## -----------------  Section 2  ------------------------------
### Classical vs Deep Learning Models
- Some examples :
1. If-Else Rule based chatbot
2. Speech Recog
3. Bag of words model : Classification
4. CNN for text Recognition

### End to end deep learning
- suppose 2 models are there, 1 for converrting speech to text and another for analyzing the text. So error can increase
- Soln : end to end deep learning. ie. 1 model for whole thing
    - ie. seq to seq is end to end deep learning model
    
### Seq2Seq Architecture
- Issues with bag of words model
    1. fixed sized inputs and outputs
    2. does not consider ordering of words

- SOLN : RNNs 
    

### Seq2Seq Architecture
- We have a dense vector correspoding where we have start of sentence, end of sentence and a value correspoding to each word in our sentence.
- ex. length of sentence 8, .'. length of vector corresponding to the sentence = SOS + 8 + EOS = 10 
    - value of SOS = 1 always and EOS = 2
    - every word have a value, if word is two places we can see same value two places
    - We can remove SOS as we know starting sentence. EOS is imp as it tells when the output will terminate.
    
- As we see end of sentence, we start generating output. 

- So we have an enocder part and a decoder part.
- We can have deep networks as well

### Training
- ex. input : Did you like that EOS
    - output : Yes it was great EOS
- In Seq2Seq, in decoder part we are passing the output of previous time step to the next time step as input


- Q. : How can it adapt to different input and output lengths for different examples ? 
- Ans : Encoder part we have single weight w1, decoder we have single weight w2. Is is a time step, so time steps can change but weight will be same

### Beam Search Decoding
1. Greedy Decoding : y<1> word that has highest probab is fed to next time step in decoder and thus we get y<2> that has highes prob. This continues till we get EOS.
    - Greedy because we look at the word with highest probability
    
2. Beam Search Decoding:
    - here we will look at top n probability words, ex. top 3 or top 10. ie. 3 beams or 10 beams.
    - Now we have three versions of seq2seq, one version with word as yes, another with I'm and another with Thanks.
    - now same for each of the three, another three seq2seq will be produced. This is TREE like structure.
    - Thus we choose a combination or beam which has the maximum joint probability. 
    - NOTE: beam grows quickly. 1st time 3, 2nd time 9 ...
        - Soln : Truncating the beams : 
            - if joint probab starts going low, it will throw the beam.
    - There are also techniques for variability ie. all answers are not similar.
    

### Attention Mechanism
- In Seq2Seq model we have an encoder LSTM and a decoder LSTM.
    - at the last input time step ie. EOS step, we have the representation of whole input which is the meaning of our sentence.
    - decoder will take this and gives us some response
    - this is weak point of this architecture. We are having memory but also stacking up the meaning to the end time step.
    - now this is a fixed dimensional representation but input can be of variable lengths .'. It becomes a lot of info to store if input becomes large
    - Now this representation or meaning will be taken by the decoder and it should be able to maintain all the info in the layer.
    - This approach is OK for short sentences and short responses

- Soln: Additional to the representation, our decoder should have access to previous input timestep additional to the last one 
    - now with learning for each word we get weights. We will have a Context Vector which is weghted sum of all these layers. 
    - ie. w1*a1 + w2*a2 + w3*a3 { suppose we had 3 word input } 
    - w1,w2,w3 are for diff timesteps
    - Now we will feed this context vector as an additional layer to decoder as input.
    
#### Global vs Local attention
- This was global attention, additionally we have Local attention. 
- In global attention, we take the all the words in the input and add the weighted sum to the context vector, but in case of local attention

## ------------ SECTION-3 -------------------
- cornel movie dialog data
- we have other metadata but we need 2 files
    - movie conversations txt
        - we have conversations
        - each line tells the lines composing conversations
    - movie lines txt
        - contains lines taken from movies
        - 1st column : id of line
        - 2nd column : u means user ex. u0
        - 3rd : movie name
        - 4th : actor name
        - 5th : movie line
        
## ----- Part 1 : Data Preprocessing ------
        

In [1]:
import numpy as np
import tensorflow as tf
import re
import time

  from ._conv import register_converters as _register_converters


In [2]:
lines = open('./data/cornell movie-dialogs corpus/movie_lines.txt',encoding = 'utf-8', errors='ignore').read().split("\n")

In [4]:
conversations = open('./data/cornell movie-dialogs corpus/movie_conversations.txt',encoding = 'utf-8', errors='ignore').read().split("\n")

In [5]:
lines

['L1045 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ They do not!',
 'L1044 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ They do to!',
 'L985 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ I hope so.',
 'L984 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ She okay?',
 "L925 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Let's go.",
 'L924 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ Wow',
 "L872 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Okay -- you're gonna need to learn how to lie.",
 'L871 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ No',
 'L870 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ I\'m kidding.  You know how sometimes you just become this "persona"?  And you don\'t know how to quit?',
 'L869 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Like my fear of wearing pastels?',
 'L868 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ The "real you".',
 'L867 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ What good stuff?',
 "L866 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ I figured yo

In [6]:
conversations

["u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L194', 'L195', 'L196', 'L197']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L198', 'L199']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L200', 'L201', 'L202', 'L203']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L204', 'L205', 'L206']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L207', 'L208']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L271', 'L272', 'L273', 'L274', 'L275']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L276', 'L277']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L280', 'L281']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L363', 'L364']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L365', 'L366']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L367', 'L368']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L401', 'L402', 'L403']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L404', 'L405', 'L406', 'L407']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L575', 'L576']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L577', 'L578']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L662', 'L663']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L693', 'L69

- Now we need to build a python dictionary that will map lineId -> text
    - we have a kind of mapping but we will build a python dict

- We want a dataset containing input and output
    - soln : using dictionary

In [15]:
# creating a dict
id2line = {}
for line in lines:
    _line = line.split(" +++$+++ ")
    if len(_line) == 5:
        id2line[_line[0]] = _line[4]

In [16]:
id2line

{'L1045': 'They do not!',
 'L1044': 'They do to!',
 'L985': 'I hope so.',
 'L984': 'She okay?',
 'L925': "Let's go.",
 'L924': 'Wow',
 'L872': "Okay -- you're gonna need to learn how to lie.",
 'L871': 'No',
 'L870': 'I\'m kidding.  You know how sometimes you just become this "persona"?  And you don\'t know how to quit?',
 'L869': 'Like my fear of wearing pastels?',
 'L868': 'The "real you".',
 'L867': 'What good stuff?',
 'L866': "I figured you'd get to the good stuff eventually.",
 'L865': 'Thank God!  If I had to hear one more story about your coiffure...',
 'L864': "Me.  This endless ...blonde babble. I'm like, boring myself.",
 'L863': 'What crap?',
 'L862': 'do you listen to this crap?',
 'L861': 'No...',
 'L860': 'Then Guillermo says, "If you go any lighter, you\'re gonna look like an extra on 90210."',
 'L699': 'You always been this selfish?',
 'L698': 'But',
 'L697': "Then that's all you had to say.",
 'L696': 'Well, no...',
 'L695': "You never wanted to go out with 'me, did y

In [50]:
# create a list of all conversation with ids
conversations_ids = []
for conversation in conversations[:-1]:
    _conversation = conversation.split(" +++$+++ ")[-1][1:-1]
    _conversation = _conversation.replace("'","").replace(" ","")
    conversations_ids.append(_conversation.split(","))

In [51]:

conversations_ids[0]

['L194', 'L195', 'L196', 'L197']

In [54]:
# Getting seperately the questions and the answers
questions = []
answers = []

for conversation in conversations_ids:
    for i in range(len(conversation)-1):
        questions.append(id2line[conversation[i]])
        answers.append(id2line[conversation[i+1]])
        
        

In [55]:
print(questions[0])
print(answers[0])

Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again.
Well, I thought we'd start with pronunciation, if that's okay with you.


In [58]:
def clean_text(text):
    text = text.lower()
    text = re.sub("i'm","i am",text)
    text = re.sub("he's","he is",text)
    text = re.sub("she's","she is",text)
    text = re.sub("that's","that is",text)
    text = re.sub("what's","what is",text)
    text = re.sub("where's","where is",text)
    text = re.sub("\'ll'"," will",text)
    text = re.sub("\'ve"," have",text)
    text = re.sub("\'re"," are",text)
    text = re.sub("\'d"," would",text)
    text = re.sub("won't","will not",text)
    text = re.sub("can't","can not",text)
    
    text = re.sub("[-()\"#/@;:<>{}=+|,?.-]","",text)
    return text

In [59]:
# cleaning the questions
clean_questions = []
for question in questions:
    clean_questions.append(clean_text(question))
    
# cleaning the answers
clean_answers = []
for answer in answers:
    clean_answers.append(clean_text(answer))

In [61]:
print(questions[0])
print(answers[0])
print("-"*40)
print(clean_questions[0])
print(clean_answers[0])

Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again.
Well, I thought we'd start with pronunciation, if that's okay with you.
----------------------------------------
can we make this quick  roxanne korrine and andrew barrett are having an incredibly horrendous public break up on the quad  again
well i thought we would start with pronunciation if that is okay with you


In [62]:
# create a dict that maps each word to its numer of occurences
word2count = {}
for question in clean_questions:
    for word in question.split():
        if word not in word2count:
            word2count[word] = 1
        else:
            word2count[word] += 1
            
for answer in clean_answers:
    for word in answer.split():
        if word not in word2count:
            word2count[word] = 1
        else:
            word2count[word] += 1
            


In [63]:
# 

{'can': 25581,
 'we': 37583,
 'make': 6747,
 'this': 33523,
 'quick': 337,
 'roxanne': 1,
 'korrine': 1,
 'and': 65607,
 'andrew': 56,
 'barrett': 19,
 'are': 54580,
 'having': 1217,
 'an': 9482,
 'incredibly': 60,
 'horrendous': 4,
 'public': 368,
 'break': 895,
 'up': 16049,
 'on': 27238,
 'the': 140644,
 'quad': 2,
 'again': 3193,
 'well': 14111,
 'i': 195046,
 'thought': 4550,
 'would': 20009,
 'start': 1656,
 'with': 24961,
 'pronunciation': 2,
 'if': 18952,
 'that': 66860,
 'is': 79611,
 'okay': 6097,
 'you': 209825,
 'not': 41850,
 'hacking': 18,
 'gagging': 9,
 'spitting': 16,
 'part': 1417,
 'please': 3209,
 'asking': 746,
 'me': 44904,
 'out': 18468,
 'so': 19059,
 'cute': 272,
 'what': 55094,
 'your': 29938,
 'name': 3122,
 'no': 27575,
 "it's": 25845,
 'my': 29687,
 'fault': 482,
 "didn't": 8735,
 'have': 46595,
 'a': 102010,
 'proper': 138,
 'introduction': 19,
 'cameron': 35,
 'thing': 5728,
 'am': 37862,
 'at': 15290,
 'mercy': 68,
 'of': 56296,
 'particularly': 111,
 'h

In [65]:
threshold = 20
# creating two dictionaries
questionswords2int = {}
word_num = 0
for word, count in word2count.items():
    if count >= threshold:
        questionswords2int[word] = word_num
        word_num += 1
        
answerswords2int = {}
word_num = 0
for word, count in word2count.items():
    if count >= threshold:
        answerswords2int[word] = word_num
        word_num += 1
    

In [72]:
for word,count in questionswords2int.items():
    print(word+":::"+str(count))

can:::0
we:::1
make:::2
this:::3
quick:::4
and:::5
andrew:::6
are:::7
having:::8
an:::9
incredibly:::10
public:::11
break:::12
up:::13
on:::14
the:::15
again:::16
well:::17
i:::18
thought:::19
would:::20
start:::21
with:::22
if:::23
that:::24
is:::25
okay:::26
you:::27
not:::28
part:::29
please:::30
asking:::31
me:::32
out:::33
so:::34
cute:::35
what:::36
your:::37
name:::38
no:::39
it's:::40
my:::41
fault:::42
didn't:::43
have:::44
a:::45
proper:::46
cameron:::47
thing:::48
am:::49
at:::50
mercy:::51
of:::52
particularly:::53
breed:::54
loser:::55
sister:::56
date:::57
until:::58
she:::59
does:::60
why:::61
mystery:::62
used:::63
to:::64
be:::65
really:::66
popular:::67
when:::68
started:::69
high:::70
school:::71
then:::72
it:::73
was:::74
just:::75
like:::76
got:::77
sick:::78
or:::79
something:::80
gosh:::81
only:::82
could:::83
find:::84
kat:::85
boyfriend:::86
ma:::87
head:::88
right:::89
see:::90
ready:::91
for:::92
don't:::93
want:::94
know:::95
how:::96
say:::97
though:::98
us

thirty:::1095
percent:::1096
background:::1097
young:::1098
parents:::1099
killed:::1100
mother:::1101
giving:::1102
birth:::1103
fucking:::1104
doctor:::1105
czech:::1106
gave:::1107
drugs:::1108
made:::1109
blamed:::1110
mother's:::1111
famous:::1112
caught:::1113
mental:::1114
hospital:::1115
shared:::1116
gun:::1117
chair:::1118
shot:::1119
alright:::1120
homicide:::1121
custody:::1122
police:::1123
department:::1124
cooperate:::1125
da:::1126
married:::1127
divorced:::1128
place:::1129
quarters:::1130
practicing:::1131
putting:::1132
fires:::1133
station:::1134
empty:::1135
prostitute:::1136
turn:::1137
tricks:::1138
couldn't:::1139
job:::1140
green:::1141
card:::1142
approached:::1143
lot:::1144
below:::1145
table:::1146
whore:::1147
sorry:::1148
desperate:::1149
cop:::1150
prison:::1151
gay:::1152
jealous:::1153
immigration:::1154
stall:::1155
process:::1156
eddie:::1157
recommended:::1158
glad:::1159
met:::1160
packed:::1161
coffee:::1162
kitchen:::1163
problems:::1164
drag:::1

life!:::2095
reality:::2096
business!:::2097
destroying:::2098
glass:::2099
produce:::2100
monster:::2101
bride:::2102
fare:::2103
cornelius:::2104
guide:::2105
she!:::2106
moment:::2107
gentle:::2108
precious:::2109
perfect!:::2110
exception:::2111
fine!:::2112
over!:::2113
thing!:::2114
resort:::2115
methods:::2116
radio:::2117
tickets:::2118
priests:::2119
vacation:::2120
noble:::2121
identified:::2122
evil:::2123
now!:::2124
importance:::2125
fortyeight:::2126
itself:::2127
living:::2128
conditions:::2129
ultimate:::2130
warrior:::2131
stands:::2132
gum:::2133
options:::2134
enter:::2135
supreme:::2136
names:::2137
paradise:::2138
detailed:::2139
plane:::2140
dallas:::2141
safe:::2142
chosen:::2143
others:::2144
are!:::2145
know!:::2146
hardly:::2147
admitted:::2148
loved:::2149
basic:::2150
cat:::2151
timing:::2152
perfect:::2153
forgetting:::2154
sat:::2155
cab:::2156
points:::2157
license:::2158
fall:::2159
lap:::2160
y'know:::2161
resist:::2162
nickname:::2163
shorter:::2164
fi

confirm:::3094
warn:::3095
trick:::3096
movement:::3097
pilot:::3098
jet:::3099
maintenance:::3100
panel:::3101
heading:::3102
fat:::3103
colonel:::3104
wing:::3105
responding:::3106
torn:::3107
jammed:::3108
walter:::3109
answering:::3110
engines:::3111
terrorists:::3112
threw:::3113
speech:::3114
apparently:::3115
sixth:::3116
representative:::3117
taylor:::3118
compromise:::3119
ambassador:::3120
claiming:::3121
east:::3122
hostages:::3123
kings:::3124
senior:::3125
staff:::3126
deadline:::3127
fighters:::3128
rescue:::3129
spying:::3130
justice:::3131
helped:::3132
served:::3133
exist:::3134
knots:::3135
affirmative:::3136
remarkable:::3137
aircraft:::3138
psychology:::3139
dumping:::3140
turkey:::3141
georgia:::3142
tower:::3143
romeo:::3144
zulu:::3145
changing:::3146
signs:::3147
mitchell:::3148
finally:::3149
upper:::3150
communication:::3151
linked:::3152
network:::3153
military:::3154
satellites:::3155
stations:::3156
cloud:::3157
rooms:::3158
kit:::3159
comfortable:::3160
sc

gambling:::4094
dollars!:::4095
begged:::4096
comin':::4097
gamble:::4098
dough:::4099
buti:::4100
stone:::4101
heavens:::4102
establish:::4103
alibi:::4104
admission:::4105
affect:::4106
slightest:::4107
phyllis:::4108
oldfashioned:::4109
bells:::4110
invite:::4111
closest:::4112
respectable:::4113
sport:::4114
stuffy:::4115
speeches:::4116
deals:::4117
comfort:::4118
misunderstood:::4119
butler:::4120
enough!:::4121
gorgeous:::4122
ho:::4123
yourself!:::4124
attractive:::4125
ummm:::4126
stalling:::4127
parker:::4128
exchange:::4129
harris:::4130
letter:::4131
property:::4132
satisfaction:::4133
lick:::4134
shots:::4135
away!:::4136
reservations:::4137
vague:::4138
opposite:::4139
vault:::4140
checking:::4141
robbed:::4142
guilty:::4143
chances:::4144
lying!:::4145
clyde:::4146
steady:::4147
reserve:::4148
nickel:::4149
robbery:::4150
available:::4151
arguing:::4152
spot:::4153
what'll:::4154
ladies':::4155
scratch:::4156
sticks:::4157
dynamite:::4158
liar:::4159
grand:::4160
gee:::4

bitches:::5093
blowin':::5094
slavery:::5095
classified:::5096
nigga:::5097
letting:::5098
righteous:::5099
brooklyn:::5100
sloan:::5101
certified:::5102
fruit:::5103
dat:::5104
dela:::5105
nose:::5106
regardless:::5107
marvin:::5108
con:::5109
chill:::5110
patience:::5111
fingers:::5112
savings:::5113
talented:::5114
tap:::5115
singing:::5116
pitch:::5117
mantan:::5118
lookin':::5119
happenin':::5120
pierre:::5121
creator:::5122
traitor:::5123
racist:::5124
material:::5125
drawn:::5126
century:::5127
honesty:::5128
informed:::5129
culture:::5130
niggers:::5131
monsieur:::5132
offended:::5133
spike:::5134
ole:::5135
presence:::5136
revolutionary:::5137
raises:::5138
desmond:::5139
deaf:::5140
searching:::5141
previous:::5142
surface:::5143
country's:::5144
entertainment:::5145
millennium:::5146
deemed:::5147
cast:::5148
'n:::5149
ignorant:::5150
lazy:::5151
exactly!:::5152
unbelievable:::5153
stupidity:::5154
protest:::5155
negroes:::5156
educated:::5157
characters:::5158
alabama:::515

greatness:::6093
thou:::6094
vile:::6095
hast:::6096
accepting:::6097
inviting:::6098
secretly:::6099
rebel:::6100
graves:::6101
romance:::6102
oui:::6103
servant:::6104
peaceful:::6105
invasion:::6106
supply:::6107
stranger:::6108
die!:::6109
battlefield:::6110
border:::6111
tricked:::6112
inherited:::6113
jason's:::6114
monitor:::6115
journal:::6116
friend's:::6117
sleepy:::6118
grades:::6119
touching:::6120
remembered:::6121
carlos:::6122
closely:::6123
lenny:::6124
outer:::6125
alternate:::6126
screamed:::6127
worm:::6128
cellular:::6129
catches:::6130
ouch:::6131
wasting:::6132
twisted:::6133
freaking:::6134
ha!:::6135
inner:::6136
mansion:::6137
tijuana:::6138
spiritual:::6139
boom:::6140
spit:::6141
retard:::6142
naughty:::6143
loving:::6144
info:::6145
hysterical:::6146
psych:::6147
humble:::6148
whoa!:::6149
extension:::6150
tuck:::6151
blade:::6152
shitless:::6153
hurry!:::6154
student:::6155
funds:::6156
wipe:::6157
traveling:::6158
exit:::6159
laszlo:::6160
concentration:::

reactor:::7092
heal:::7093
craft:::7094
cookies:::7095
wilderness:::7096
baltimore:::7097
openly:::7098
strung:::7099
coats:::7100
desperately:::7101
gates:::7102
courts:::7103
colored:::7104
saint:::7105
geese:::7106
yessir:::7107
declare:::7108
drunken:::7109
democratic:::7110
child!:::7111
party!:::7112
champion:::7113
crushed:::7114
allowing:::7115
spies:::7116
cornwallis:::7117
honor!:::7118
tender:::7119
resignation:::7120
war!:::7121
portrait:::7122
appointed:::7123
marquis:::7124
purity:::7125
discipline:::7126
issued:::7127
wealthy:::7128
bodyguard:::7129
curtis:::7130
acres:::7131
scandal:::7132
engagement:::7133
tales:::7134
dreamt:::7135
unemployed:::7136
reserved:::7137
chili:::7138
palmer:::7139
stunt:::7140
karen:::7141
mattress:::7142
kidnapping:::7143
setup:::7144
deal's:::7145
sister's:::7146
panicked:::7147
associate:::7148
anyplace:::7149
harvey:::7150
morgan:::7151
murray:::7152
bernie:::7153
script:::7154
tricky:::7155
leo:::7156
canyon:::7157
wayne:::7158
dean:::

lasted:::8092
maximum:::8093
corps:::8094
smokey:::8095
chopper:::8096
column:::8097
yuh:::8098
sheldon:::8099
chronicle:::8100
connell:::8101
chairman:::8102
stuff's:::8103
spencer:::8104
swat:::8105
greg:::8106
barnett:::8107
vehicles:::8108
recruited:::8109
metro:::8110
teresa:::8111
scottie:::8112
payin':::8113
maranzano:::8114
rothstein:::8115
stu:::8116
uyouu:::8117
darned:::8118
ui:::8119
leave!:::8120
uthisu:::8121
th:::8122
paula:::8123
sonny:::8124
ranger:::8125
pigeons:::8126
foley:::8127
diz:::8128
paine:::8129
steering:::8130
saunders:::8131
lightning:::8132
publisher:::8133
clayton:::8134
willet:::8135
dam:::8136
damon:::8137
lena:::8138
banner:::8139
aa:::8140
babbling:::8141
notices:::8142
fitted:::8143
toto:::8144
presented:::8145
mandrake:::8146
dawson:::8147
snoopy:::8148
yet!:::8149
cedar:::8150
traveled:::8151
mason:::8152
elected:::8153
mumford:::8154
acknowledge:::8155
haunting:::8156
immune:::8157
brady:::8158
exhusband:::8159
nba:::8160
extortion:::8161
doorway

In [73]:
len(questionswords2int)

8834

In [74]:
# adding last token, as a replacement for words with less freq than threshold
tokens = ["<PAD>","<EOS>","<OUT>","<SOS>"]
for token in tokens:
    questionswords2int[token] = len(questionswords2int)
    answerswords2int[token] = len(answerswords2int)

In [75]:
# creating the inverse of a dictionary of answerswords2int dict
answersints2word = {count: w for w,count in answerswords2int.items()}

In [76]:
answersints2word

{0: 'can',
 1: 'we',
 2: 'make',
 3: 'this',
 4: 'quick',
 5: 'and',
 6: 'andrew',
 7: 'are',
 8: 'having',
 9: 'an',
 10: 'incredibly',
 11: 'public',
 12: 'break',
 13: 'up',
 14: 'on',
 15: 'the',
 16: 'again',
 17: 'well',
 18: 'i',
 19: 'thought',
 20: 'would',
 21: 'start',
 22: 'with',
 23: 'if',
 24: 'that',
 25: 'is',
 26: 'okay',
 27: 'you',
 28: 'not',
 29: 'part',
 30: 'please',
 31: 'asking',
 32: 'me',
 33: 'out',
 34: 'so',
 35: 'cute',
 36: 'what',
 37: 'your',
 38: 'name',
 39: 'no',
 40: "it's",
 41: 'my',
 42: 'fault',
 43: "didn't",
 44: 'have',
 45: 'a',
 46: 'proper',
 47: 'cameron',
 48: 'thing',
 49: 'am',
 50: 'at',
 51: 'mercy',
 52: 'of',
 53: 'particularly',
 54: 'breed',
 55: 'loser',
 56: 'sister',
 57: 'date',
 58: 'until',
 59: 'she',
 60: 'does',
 61: 'why',
 62: 'mystery',
 63: 'used',
 64: 'to',
 65: 'be',
 66: 'really',
 67: 'popular',
 68: 'when',
 69: 'started',
 70: 'high',
 71: 'school',
 72: 'then',
 73: 'it',
 74: 'was',
 75: 'just',
 76: 'like

{'can': 0,
 'we': 1,
 'make': 2,
 'this': 3,
 'quick': 4,
 'and': 5,
 'andrew': 6,
 'are': 7,
 'having': 8,
 'an': 9,
 'incredibly': 10,
 'public': 11,
 'break': 12,
 'up': 13,
 'on': 14,
 'the': 15,
 'again': 16,
 'well': 17,
 'i': 18,
 'thought': 19,
 'would': 20,
 'start': 21,
 'with': 22,
 'if': 23,
 'that': 24,
 'is': 25,
 'okay': 26,
 'you': 27,
 'not': 28,
 'part': 29,
 'please': 30,
 'asking': 31,
 'me': 32,
 'out': 33,
 'so': 34,
 'cute': 35,
 'what': 36,
 'your': 37,
 'name': 38,
 'no': 39,
 "it's": 40,
 'my': 41,
 'fault': 42,
 "didn't": 43,
 'have': 44,
 'a': 45,
 'proper': 46,
 'cameron': 47,
 'thing': 48,
 'am': 49,
 'at': 50,
 'mercy': 51,
 'of': 52,
 'particularly': 53,
 'breed': 54,
 'loser': 55,
 'sister': 56,
 'date': 57,
 'until': 58,
 'she': 59,
 'does': 60,
 'why': 61,
 'mystery': 62,
 'used': 63,
 'to': 64,
 'be': 65,
 'really': 66,
 'popular': 67,
 'when': 68,
 'started': 69,
 'high': 70,
 'school': 71,
 'then': 72,
 'it': 73,
 'was': 74,
 'just': 75,
 'like': 7

In [79]:
# add EOS token to end to all answers
    # EOS token will specify end to decoding part
for i in range(len(clean_answers)):
    clean_answers[i] += " <EOS>"

In [84]:
## Translate all questions and answers to integers
# and replace all word filtered out by <OUT>
# additionally we will sort questions and answers by length as to 
    # optimize training performance

questions_into_list = []
for question in clean_questions:
    ints = []
    for word in question.split():
        if word not in questionswords2int:
            ints.append(questionswords2int['<OUT>'])
        else:
            ints.append(questionswords2int[word])
    questions_into_list.append(ints)
    
answers_into_list = []
for answer in clean_answers:
    ints = []
    for word in answer.split():
        if word not in answerswords2int:
            ints.append(answerswords2int['<OUT>'])
        else:
            ints.append(answerswords2int[word])
    answers_into_list.append(ints)

In [88]:
len(clean_answers)

221616

In [87]:
len(answers_into_list)

221616

In [None]:
# Sorting questions and answers by the length of questions
sorted_clean_questions = []
sorted_clean_answers = []
for length in range(1,25+1):
    

In [90]:
print(questions_into_list[0])
print(clean_questions[0])

[0, 1, 2, 3, 4, 8836, 8836, 5, 6, 8836, 7, 8, 9, 10, 8836, 11, 12, 13, 14, 15, 8836, 16]
can we make this quick  roxanne korrine and andrew barrett are having an incredibly horrendous public break up on the quad  again
