The first step is to import the dependencies.

In [1]:
import pandas as pd
import nltk
import gensim
from gensim import corpora, models, similarities

1.Pandas- To create/read a dataframe.
2.NLTK- To apply natural language operations, in our case for word tokenising.
3.Gensim- To create a Word2Vec model. The model will generate feature vectors which act as input to our final LSTM model.

The csv file contains sample chat scenarios in question and answer format.

In [3]:
df=pd.read_csv('/Users/rajdesai/Downloads/sample_chat.csv', error_bad_lines=False);


b'Skipping line 7: expected 3 fields, saw 4\nSkipping line 8: expected 3 fields, saw 4\n'


We have split the questions and answers to lists and combined the list to form our corpus.

We have also tokenized the corpus, what we get in return is a list of words which will be fed into Word2vec model.

In [4]:
x=df['Question'].values.tolist()
y=df['Answer'].values.tolist()
corpus= x+y
tok_corp= [nltk.word_tokenize(sent) for sent in corpus]  

Gensim’s word2vec expects a sequence of sentences as its input. Each sentence as list of words.

Parameters: 
min_count- it represents the minimum count of words which has to be considered into a model. The default will be 5. We have set it to 1 which means every word will be considered.
size- it represents the size of NN layers, which correspond to the “degrees” of freedom the training algorithm has.
Bigger size values require more training data, but can lead to better (more accurate) models. Reasonable values are in the tens to hundreds.

In [6]:
model = gensim.models.Word2Vec(tok_corp, min_count=1, size = 32)
print(model)

Word2Vec(vocab=54, size=32, alpha=0.025)


We would save the model so that it can be initialized at later point in time.

In [7]:
model.save('sample.bin')

## chatbot Preprocessing

We have chat logs available in json format, which we would pre-process to make them suitable as training data for our model

In [9]:
import json
import numpy as np
from gensim import corpora, models, similarities
import pickle

we load the previously created word2vec model

In [28]:
model = gensim.models.Word2Vec.load('sample.bin');


In [29]:
file=open('/Users/rajdesai/Downloads/conversation.json');

In [30]:
data = json.load(file)

In [31]:
print(data["conversations"][1])

['Hello', 'Hi', 'How are you doing?', 'I am doing well.', 'That is good to hear', 'Yes it is.', 'Can I help you with anything?', 'Yes, I have a question.', 'What is your question?', 'Could I borrow a cup of sugar?', "I'm sorry, but I don't have any.", 'Thank you anyway', 'No problem']


1. We create our own corpus of converstions.
2. We store the conversations into two different lists.
3. Two lists are created because we have to provide the data to the model in sequence.
4. Therefore the first list would contain the i,j value and the second list would have i,j+1 values.
5. The next task would be to tokenize the sentences.
6. Once we have a list of tokens we would input them into our Word2Vec model.
7. The output of the model will be a list of vectors.

In [32]:
cor=data["conversations"];

x=[]
y=[]

#path2="corpus";

for i in range(len(cor)):
    for j in range(len(cor[i])):
        if j<len(cor[i])-1:
            x.append(cor[i][j]);
            y.append(cor[i][j+1]);

tok_x=[]
tok_y=[]
for i in range(len(x)):
    tok_x.append(nltk.word_tokenize(x[i].lower()))
    tok_y.append(nltk.word_tokenize(y[i].lower()))
    
sentend=np.ones(300, dtype=np.float32) 

vec_x=[]
for sent in tok_x:
    #sentvec = [model[w] for w in sent if w in model.wv.vocab]
    sentvec = [model[w] for w in sent if w in model.wv.vocab]
    vec_x.append(sentvec)
    
vec_y=[]
for sent in tok_y:
    #sentvec = [model[w] for w in sent if w in model.wv.vocab]
    sentvec = [model[w] for w in sent if w in model.wv.vocab]
    vec_y.append(sentvec)           
    
    
for tok_sent in vec_x:
    tok_sent[14:]=[]
    tok_sent.append(sentend)
    

for tok_sent in vec_x:
    if len(tok_sent)<15:
        for i in range(15-len(tok_sent)):
            tok_sent.append(sentend)    
            
for tok_sent in vec_y:
    tok_sent[14:]=[]
    tok_sent.append(sentend)
    

for tok_sent in vec_y:
    if len(tok_sent)<15:
        for i in range(15-len(tok_sent)):
            tok_sent.append(sentend)             

In [34]:
print(data.keys)

<built-in method keys of dict object at 0x0000000018DAE288>


In [36]:
vec_x[0][0][0:4]

array([ 0.12354546,  0.00536549, -0.1516405 ,  0.08004843], dtype=float32)

We would serialize our data, in this case the two lists of vectors we received as an output from Word2Vec model.

In [37]:
with open('conversation.pickle','wb') as f:
    pickle.dump([vec_x,vec_y],f)

## LSTM

Now we would create a LSTM model for training our chatbot.

Keras is used to build RNN models also it would be running tensorflow in the background.

In [38]:
import os
import pickle
import numpy as np
from keras.models import Sequential
import gensim
from keras.layers.recurrent import LSTM,SimpleRNN
from sklearn.model_selection import train_test_split

We load our list of vectors which were pickle dumped.

In [39]:
with open('conversation.pickle','rb') as f:
    vec_x,vec_y=pickle.load(f) 

Now we convert the vectors into numpy arrays.
Though they are slower in operation especially with LSTM, we would use them because of its simplicity to operate and we have limited data.

In [40]:
vec_x=np.array(vec_x,dtype=np.object)
vec_y=np.array(vec_y,dtype=np.object)  


We split the data into test and train sets.

In [42]:
x_train,x_test, y_train,y_test = train_test_split(vec_x, vec_y, test_size=0.2, random_state=1)

The input to every LSTM layer must be three-dimensional.

The three dimensions of this input are:

Samples. One sequence is one sample. A batch is comprised of one or more samples.
Time Steps. One time step is one point of observation in the sample.
Features. One feature is one observation at a time step.

This means that the input layer expects a 3D array of data when fitting the model and when making predictions, even if specific dimensions of the array contain a single value, e.g. one sample or one feature.

In [43]:
print(vec_x.shape)
print(vec_y.shape)

(92, 15, 300)
(92, 15, 300)


So we have three-dimensional arrays which are ready to be fed into the LSTM model.

We define the parameters for our model:

1.Weights: list of numpy arrays to set as initial weights. The list should have 3 elements, of shapes: [(input_dim, output_dim), (output_dim, output_dim), (output_dim,)].

2.return_sequences: Boolean. Whether to return the last output in the output sequence, or the full sequence.

3.init: weight initialization function.
Glorot normal initializer, also called Xavier normal initializer.
It draws samples from a truncated normal distribution centered on 0 with  stddev = sqrt(2 / (fan_in + fan_out)) where fan_in is the number of input units in the weight tensor and fan_out is the number of output units in the weight tensor.

4.activation: activation function, in our case we use the sigmoid function.

5.We use cosine_proximity for loss so that the values do not get NaN and become irrelevant to the model.


In [52]:
model=Sequential()
model.add(LSTM(output_dim=300,input_shape=x_train.shape[1:],return_sequences=True, init='glorot_normal', inner_init='glorot_normal', activation='sigmoid'))
model.add(LSTM(output_dim=300,input_shape=x_train.shape[1:],return_sequences=True, init='glorot_normal', inner_init='glorot_normal', activation='sigmoid'))
model.add(LSTM(output_dim=300,input_shape=x_train.shape[1:],return_sequences=True, init='glorot_normal', inner_init='glorot_normal', activation='sigmoid'))
model.add(LSTM(output_dim=300,input_shape=x_train.shape[1:],return_sequences=True, init='glorot_normal', inner_init='glorot_normal', activation='sigmoid'))
model.compile(loss='cosine_proximity', optimizer='adam', metrics=['accuracy'])


  from ipykernel import kernelapp as app
  app.launch_new_instance()


In [46]:
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

(73, 15, 300)
(73, 15, 300)
(19, 15, 300)
(19, 15, 300)


In [47]:
model=Sequential()

In [48]:
print(model)

<keras.models.Sequential object at 0x00000000165E6F60>


Now we train the model.

In [53]:
model.fit(x_train, y_train, nb_epoch=50,validation_data=(x_test, y_test))
model.save('LSTM5000.h5');          




Train on 73 samples, validate on 19 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


In [54]:
predictions=model.predict(x_test) 
mod = gensim.models.Word2Vec.load('sample.bin'); 

## Chat

Now we write a function to take text as input and try to generate an appropriate reply with the help of our model.

In [57]:
import os
from scipy import spatial
import numpy as np
import gensim
import nltk
from keras.models import load_model

In [64]:
model=load_model('LSTM5000.h5')
mod = gensim.models.Word2Vec.load('/Users/rajdesai/Downloads/doc2vec.bin');


In [None]:
while(True):
    x=input("Enter your query:");
    if x == "1":
        print("Thanks for chatting")
        break
    else:
        sentend=np.ones((300,),dtype=np.float32) 

        sent=nltk.word_tokenize(x.lower())
        sentvec = [mod[w] for w in sent if w in mod.wv.vocab]

        sentvec[14:]=[]
        sentvec.append(sentend)
        if len(sentvec)<15:
            for i in range(15-len(sentvec)):
                sentvec.append(sentend) 
        sentvec=np.array([sentvec])
    
        predictions = model.predict(sentvec)
        outputlist=[mod.most_similar([predictions[0][i]])[0][0] for i in range(15)]
        output=' '.join(outputlist)
        print(output)
                

Enter your query:h
so they they but but have have apart duluth duluth duluth duluth duluth duluth duluth
