## -----------------  Section 2  ------------------------------
### Classical vs Deep Learning Models
- Some examples :
1. If-Else Rule based chatbot
2. Speech Recog
3. Bag of words model : Classification
4. CNN for text Recognition

### End to end deep learning
- suppose 2 models are there, 1 for converrting speech to text and another for analyzing the text. So error can increase
- Soln : end to end deep learning. ie. 1 model for whole thing
    - ie. seq to seq is end to end deep learning model
    
### Seq2Seq Architecture
- Issues with bag of words model
    1. fixed sized inputs and outputs
    2. does not consider ordering of words

- SOLN : RNNs 
    

### Seq2Seq Architecture
- We have a dense vector correspoding where we have start of sentence, end of sentence and a value correspoding to each word in our sentence.
- ex. length of sentence 8, .'. length of vector corresponding to the sentence = SOS + 8 + EOS = 10 
    - value of SOS = 1 always and EOS = 2
    - every word have a value, if word is two places we can see same value two places
    - We can remove SOS as we know starting sentence. EOS is imp as it tells when the output will terminate.
    
- As we see end of sentence, we start generating output. 

- So we have an enocder part and a decoder part.
- We can have deep networks as well

### Training
- ex. input : Did you like that EOS
    - output : Yes it was great EOS
- In Seq2Seq, in decoder part we are passing the output of previous time step to the next time step as input


- Q. : How can it adapt to different input and output lengths for different examples ? 
- Ans : Encoder part we have single weight w1, decoder we have single weight w2. Is is a time step, so time steps can change but weight will be same

### Beam Search Decoding
1. Greedy Decoding : y<1> word that has highest probab is fed to next time step in decoder and thus we get y<2> that has highes prob. This continues till we get EOS.
    - Greedy because we look at the word with highest probability
    
2. Beam Search Decoding:
    - here we will look at top n probability words, ex. top 3 or top 10. ie. 3 beams or 10 beams.
    - Now we have three versions of seq2seq, one version with word as yes, another with I'm and another with Thanks.
    - now same for each of the three, another three seq2seq will be produced. This is TREE like structure.
    - Thus we choose a combination or beam which has the maximum joint probability. 
    - NOTE: beam grows quickly. 1st time 3, 2nd time 9 ...
        - Soln : Truncating the beams : 
            - if joint probab starts going low, it will throw the beam.
    - There are also techniques for variability ie. all answers are not similar.
    

### Attention Mechanism
- In Seq2Seq model we have an encoder LSTM and a decoder LSTM.
    - at the last input time step ie. EOS step, we have the representation of whole input which is the meaning of our sentence.
    - decoder will take this and gives us some response
    - this is weak point of this architecture. We are having memory but also stacking up the meaning to the end time step.
    - now this is a fixed dimensional representation but input can be of variable lengths .'. It becomes a lot of info to store if input becomes large
    - Now this representation or meaning will be taken by the decoder and it should be able to maintain all the info in the layer.
    - This approach is OK for short sentences and short responses

- Soln: Additional to the representation, our decoder should have access to previous input timestep additional to the last one 
    - now with learning for each word we get weights. We will have a Context Vector which is weghted sum of all these layers. 
    - ie. w1*a1 + w2*a2 + w3*a3 { suppose we had 3 word input } 
    - w1,w2,w3 are for diff timesteps
    - Now we will feed this context vector as an additional layer to decoder as input.
    
#### Global vs Local attention
- This was global attention, additionally we have Local attention. 
- In global attention, we take the all the words in the input and add the weighted sum to the context vector, but in case of local attention

## ------------ SECTION-3 -------------------
- cornel movie dialog data
- we have other metadata but we need 2 files
    - movie conversations txt
        - we have conversations
        - each line tells the lines composing conversations
    - movie lines txt
        - contains lines taken from movies
        - 1st column : id of line
        - 2nd column : u means user ex. u0
        - 3rd : movie name
        - 4th : actor name
        - 5th : movie line
        
## ----- Part 1 : Data Preprocessing ------
        

In [1]:
import numpy as np
import tensorflow as tf
import re
import time

  from ._conv import register_converters as _register_converters


In [2]:
lines = open('./data/cornell movie-dialogs corpus/movie_lines.txt',encoding = 'utf-8', errors='ignore').read().split("\n")

In [4]:
conversations = open('./data/cornell movie-dialogs corpus/movie_conversations.txt',encoding = 'utf-8', errors='ignore').read().split("\n")

In [5]:
lines

['L1045 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ They do not!',
 'L1044 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ They do to!',
 'L985 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ I hope so.',
 'L984 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ She okay?',
 "L925 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Let's go.",
 'L924 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ Wow',
 "L872 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Okay -- you're gonna need to learn how to lie.",
 'L871 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ No',
 'L870 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ I\'m kidding.  You know how sometimes you just become this "persona"?  And you don\'t know how to quit?',
 'L869 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Like my fear of wearing pastels?',
 'L868 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ The "real you".',
 'L867 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ What good stuff?',
 "L866 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ I figured yo

In [6]:
conversations

["u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L194', 'L195', 'L196', 'L197']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L198', 'L199']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L200', 'L201', 'L202', 'L203']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L204', 'L205', 'L206']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L207', 'L208']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L271', 'L272', 'L273', 'L274', 'L275']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L276', 'L277']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L280', 'L281']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L363', 'L364']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L365', 'L366']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L367', 'L368']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L401', 'L402', 'L403']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L404', 'L405', 'L406', 'L407']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L575', 'L576']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L577', 'L578']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L662', 'L663']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L693', 'L69

- Now we need to build a python dictionary that will map lineId -> text
    - we have a kind of mapping but we will build a python dict

- We want a dataset containing input and output
    - soln : using dictionary

In [15]:
# creating a dict
id2line = {}
for line in lines:
    _line = line.split(" +++$+++ ")
    if len(_line) == 5:
        id2line[_line[0]] = _line[4]

In [16]:
id2line

{'L1045': 'They do not!',
 'L1044': 'They do to!',
 'L985': 'I hope so.',
 'L984': 'She okay?',
 'L925': "Let's go.",
 'L924': 'Wow',
 'L872': "Okay -- you're gonna need to learn how to lie.",
 'L871': 'No',
 'L870': 'I\'m kidding.  You know how sometimes you just become this "persona"?  And you don\'t know how to quit?',
 'L869': 'Like my fear of wearing pastels?',
 'L868': 'The "real you".',
 'L867': 'What good stuff?',
 'L866': "I figured you'd get to the good stuff eventually.",
 'L865': 'Thank God!  If I had to hear one more story about your coiffure...',
 'L864': "Me.  This endless ...blonde babble. I'm like, boring myself.",
 'L863': 'What crap?',
 'L862': 'do you listen to this crap?',
 'L861': 'No...',
 'L860': 'Then Guillermo says, "If you go any lighter, you\'re gonna look like an extra on 90210."',
 'L699': 'You always been this selfish?',
 'L698': 'But',
 'L697': "Then that's all you had to say.",
 'L696': 'Well, no...',
 'L695': "You never wanted to go out with 'me, did y

In [50]:
# create a list of all conversation with ids
conversations_ids = []
for conversation in conversations[:-1]:
    _conversation = conversation.split(" +++$+++ ")[-1][1:-1]
    _conversation = _conversation.replace("'","").replace(" ","")
    conversations_ids.append(_conversation.split(","))

In [51]:

conversations_ids[0]

['L194', 'L195', 'L196', 'L197']

In [54]:
# Getting seperately the questions and the answers
questions = []
answers = []

for conversation in conversations_ids:
    for i in range(len(conversation)-1):
        questions.append(id2line[conversation[i]])
        answers.append(id2line[conversation[i+1]])
        
        

In [55]:
print(questions[0])
print(answers[0])

Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again.
Well, I thought we'd start with pronunciation, if that's okay with you.


In [58]:
def clean_text(text):
    text = text.lower()
    text = re.sub("i'm","i am",text)
    text = re.sub("he's","he is",text)
    text = re.sub("she's","she is",text)
    text = re.sub("that's","that is",text)
    text = re.sub("what's","what is",text)
    text = re.sub("where's","where is",text)
    text = re.sub("\'ll'"," will",text)
    text = re.sub("\'ve"," have",text)
    text = re.sub("\'re"," are",text)
    text = re.sub("\'d"," would",text)
    text = re.sub("won't","will not",text)
    text = re.sub("can't","can not",text)
    
    text = re.sub("[-()\"#/@;:<>{}=+|,?.-]","",text)
    return text

In [59]:
# cleaning the questions
clean_questions = []
for question in questions:
    clean_questions.append(clean_text(question))
    
# cleaning the answers
clean_answers = []
for answer in answers:
    clean_answers.append(clean_text(answer))

In [61]:
print(questions[0])
print(answers[0])
print("-"*40)
print(clean_questions[0])
print(clean_answers[0])

Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again.
Well, I thought we'd start with pronunciation, if that's okay with you.
----------------------------------------
can we make this quick  roxanne korrine and andrew barrett are having an incredibly horrendous public break up on the quad  again
well i thought we would start with pronunciation if that is okay with you


In [62]:
# create a dict that maps each word to its numer of occurences
word2count = {}
for question in clean_questions:
    for word in question.split():
        if word not in word2count:
            word2count[word] = 1
        else:
            word2count[word] += 1
            
for answer in clean_answers:
    for word in answer.split():
        if word not in word2count:
            word2count[word] = 1
        else:
            word2count[word] += 1
            


In [63]:
# 

{'can': 25581,
 'we': 37583,
 'make': 6747,
 'this': 33523,
 'quick': 337,
 'roxanne': 1,
 'korrine': 1,
 'and': 65607,
 'andrew': 56,
 'barrett': 19,
 'are': 54580,
 'having': 1217,
 'an': 9482,
 'incredibly': 60,
 'horrendous': 4,
 'public': 368,
 'break': 895,
 'up': 16049,
 'on': 27238,
 'the': 140644,
 'quad': 2,
 'again': 3193,
 'well': 14111,
 'i': 195046,
 'thought': 4550,
 'would': 20009,
 'start': 1656,
 'with': 24961,
 'pronunciation': 2,
 'if': 18952,
 'that': 66860,
 'is': 79611,
 'okay': 6097,
 'you': 209825,
 'not': 41850,
 'hacking': 18,
 'gagging': 9,
 'spitting': 16,
 'part': 1417,
 'please': 3209,
 'asking': 746,
 'me': 44904,
 'out': 18468,
 'so': 19059,
 'cute': 272,
 'what': 55094,
 'your': 29938,
 'name': 3122,
 'no': 27575,
 "it's": 25845,
 'my': 29687,
 'fault': 482,
 "didn't": 8735,
 'have': 46595,
 'a': 102010,
 'proper': 138,
 'introduction': 19,
 'cameron': 35,
 'thing': 5728,
 'am': 37862,
 'at': 15290,
 'mercy': 68,
 'of': 56296,
 'particularly': 111,
 'h

In [65]:
threshold = 20
# creating two dictionaries
questionswords2int = {}
word_num = 0
for word, count in word2count.items():
    if count >= threshold:
        questionswords2int[word] = word_num
        word_num += 1
        
answerswords2int = {}
word_num = 0
for word, count in word2count.items():
    if count >= threshold:
        answerswords2int[word] = word_num
        word_num += 1
    

In [66]:
questionswords2int

{'can': 0,
 'we': 1,
 'make': 2,
 'this': 3,
 'quick': 4,
 'and': 5,
 'andrew': 6,
 'are': 7,
 'having': 8,
 'an': 9,
 'incredibly': 10,
 'public': 11,
 'break': 12,
 'up': 13,
 'on': 14,
 'the': 15,
 'again': 16,
 'well': 17,
 'i': 18,
 'thought': 19,
 'would': 20,
 'start': 21,
 'with': 22,
 'if': 23,
 'that': 24,
 'is': 25,
 'okay': 26,
 'you': 27,
 'not': 28,
 'part': 29,
 'please': 30,
 'asking': 31,
 'me': 32,
 'out': 33,
 'so': 34,
 'cute': 35,
 'what': 36,
 'your': 37,
 'name': 38,
 'no': 39,
 "it's": 40,
 'my': 41,
 'fault': 42,
 "didn't": 43,
 'have': 44,
 'a': 45,
 'proper': 46,
 'cameron': 47,
 'thing': 48,
 'am': 49,
 'at': 50,
 'mercy': 51,
 'of': 52,
 'particularly': 53,
 'breed': 54,
 'loser': 55,
 'sister': 56,
 'date': 57,
 'until': 58,
 'she': 59,
 'does': 60,
 'why': 61,
 'mystery': 62,
 'used': 63,
 'to': 64,
 'be': 65,
 'really': 66,
 'popular': 67,
 'when': 68,
 'started': 69,
 'high': 70,
 'school': 71,
 'then': 72,
 'it': 73,
 'was': 74,
 'just': 75,
 'like': 7

In [67]:
answerswords2int

{'can': 0,
 'we': 1,
 'make': 2,
 'this': 3,
 'quick': 4,
 'and': 5,
 'andrew': 6,
 'are': 7,
 'having': 8,
 'an': 9,
 'incredibly': 10,
 'public': 11,
 'break': 12,
 'up': 13,
 'on': 14,
 'the': 15,
 'again': 16,
 'well': 17,
 'i': 18,
 'thought': 19,
 'would': 20,
 'start': 21,
 'with': 22,
 'if': 23,
 'that': 24,
 'is': 25,
 'okay': 26,
 'you': 27,
 'not': 28,
 'part': 29,
 'please': 30,
 'asking': 31,
 'me': 32,
 'out': 33,
 'so': 34,
 'cute': 35,
 'what': 36,
 'your': 37,
 'name': 38,
 'no': 39,
 "it's": 40,
 'my': 41,
 'fault': 42,
 "didn't": 43,
 'have': 44,
 'a': 45,
 'proper': 46,
 'cameron': 47,
 'thing': 48,
 'am': 49,
 'at': 50,
 'mercy': 51,
 'of': 52,
 'particularly': 53,
 'breed': 54,
 'loser': 55,
 'sister': 56,
 'date': 57,
 'until': 58,
 'she': 59,
 'does': 60,
 'why': 61,
 'mystery': 62,
 'used': 63,
 'to': 64,
 'be': 65,
 'really': 66,
 'popular': 67,
 'when': 68,
 'started': 69,
 'high': 70,
 'school': 71,
 'then': 72,
 'it': 73,
 'was': 74,
 'just': 75,
 'like': 7

In [None]:
# adding last token, as a replacement for words with less freq than threshold
tokens = ["<PAD>","<EOS>","<OUT>","<SOS>"]
for token in tokens:
    questionswords2int[token]