# INFO-7390 ChatBot using Deep Learning and NLP

Name - Rohan Subhash Yewale<br/>
NUID - 001087414

# Abstract

Our Aim is to build a chatbot using Deep Learning, NLP and implement using SEQ2SEQ Architecture.
* **Data Preprocessing** 
* **Build SEQ2SEQ Model** 
* **Training SEQ2SEQ Model** 
* **Testing SEQ2SEQ Model**

The data set that we are going to use is the Cornell movie corpus. It's a data set of more than 600 movies containing thousands of conversations between lots of characters. We'll build a general chatbot that can have a general conversation with us like a friend, to talk about everyday life. Thats why movies are perfect because in movies you have a lot of random conversations, general conversations between friends. Our model can be trained on other more specific dataset like calender assistant or navigation assistant. These are some other applications of the chatbot.

<Img src="https://github.com/rhnyewale/INFO7390-ChatBot-with-Deep-Learning-and-NLP/blob/main/Images/DataFlow.jpg?raw=true">



**Why chatbot is essential?**

Chatbot is one of the most scalable approach a business can take in order to interact with users. If one support representative interacts with 100 customers, then for 100,000 customers we’ll need 1000 support representatives. But one chatbot can handle and interact with all the 100,000 customers and cuts lots of costs and efficiency. Deep learning has allowed the complexities within NLP to be easier to model and can be leveraged to build a chatbot which has a real conversation with a human.

**Why seq2seq model ?**

tf-seq2seq is a general-purpose encoder-decoder framework for Tensorflow that can be used for Machine Translation, Text Summarization, Conversational Modeling, Image Captioning, and more.


We built tf-seq2seq with the following goals in mind:

* **General Purpose**: We initially built this framework for Machine Translation, but have since used it for a variety of other tasks, including Summarization, Conversational Modeling, and Image Captioning. As long as your problem can be phrased as encoding input data in one format and decoding it into another format, you should be able to use or extend this framework.

* **Usability**: You can train a model with a single command. Several types of input data are supported, including standard raw text.

* **Reproducibility**: Training pipelines and models are configured using YAML files. This allows other to run your exact same model configurations.

* **Extensibility**: Code is structured in a modular way and that easy to build upon. For example, adding a new type of attention mechanism or encoder architecture requires only minimal code changes.

* **Documentation**: All code is documented using standard Python docstrings, and we have written guides to help you get started with common tasks.

* **Good Performance**: For the sake of code simplicity, we did not try to squeeze out every last bit of performance, but the implementation is fast enough to cover almost all production and research use cases. tf-seq2seq also supports distributed training to trade off computational power and training time.


![https://google.github.io/seq2seq/t](https://3.bp.blogspot.com/-3Pbj_dvt0Vo/V-qe-Nl6P5I/AAAAAAAABQc/z0_6WtVWtvARtMk0i9_AtLeyyGyV6AI4wCLcB/s1600/nmt-model-fast.gif)

To Understand seq2seq model we should have prior knowledge of-
* [RNN-Recurrent Neural Network](http://colah.github.io/posts/2015-08-Understanding-LSTMs/) - Blog by Christopher Olah
* LSTM-Long Short Term Memory

Drawback with the Bag of Words model
1.	Fixed-sized input
2.	Doesn’t take word order into account
3.	Fixed-sized output
To overcome all these limitations we need RNN.


Sequence To Sequence model introduced in [Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation](https://arxiv.org/abs/1406.1078) has since then, become the Go-To model for Dialogue Systems and Machine Translation. It consists of two RNNs (Recurrent Neural Network) : An Encoder and a Decoder. The encoder takes a sequence(sentence) as input and processes one symbol(word) at each timestep. Its objective is to convert a sequence of symbols into a fixed size feature vector that encodes only the important information in the sequence while losing the unnecessary information. You can visualize data flow in the encoder along the time axis, as the flow of local information from one end of the sequence to another.

<Img src="https://github.com/rhnyewale/INFO7390-ChatBot-with-Deep-Learning-and-NLP/blob/main/Images/seq2seq.jpg?raw=true">

Each hidden state influences the next hidden state and the final hidden state can be seen as the summary of the sequence. This state is called the context or thought vector, as it represents the intention of the sequence. From the context, the decoder generates another sequence, one symbol(word) at a time. Here, at each time step, the decoder is influenced by the context and the previously generated symbols.

<Img src="https://github.com/rhnyewale/INFO7390-ChatBot-with-Deep-Learning-and-NLP/blob/main/Images/enc_dec.jpg?raw=true">
    
There are a few challenges in using this model. The most disturbing one is that the model cannot handle variable length sequences. It is disturbing because almost all the sequence-to-sequence applications, involve variable length sequences. The next one is the vocabulary size. The decoder has to run softmax over a large vocabulary of say 20,000 words, for each word in the output. That is going to slow down the training process, even if your hardware is capable of handling it. Representation of words is of great importance. How do you represent the words in the sequence? Use of one-hot vectors means we need to deal with large sparse vectors due to large vocabulary and there is no semantic meaning to words encoded into one-hot vectors.


In [62]:
# Importing the Libraries
import numpy as np
import tensorflow as tf
import re
import time
import pandas as pd

# Dataset

Dataset-https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html<br/>
This corpus contains a large metadata-rich collection of fictional conversations extracted from raw movie scripts:
- 220,579 conversational exchanges between 10,292 pairs of movie characters
- involves 9,035 characters from 617 movies
- in total 304,713 utterances


# PART 1 - DATA PREPROCESSING

In [64]:
################## PART 1-DATA PREPROCESSING #####################

# Importing the dataset
lines = open('movie_lines.txt', encoding = 'utf-8', errors = 'ignore').read().split('\n')
conversations = open('movie_conversations.txt', encoding = 'utf-8', errors = 'ignore').read().split('\n')

In [3]:
lines

['L1045 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ They do not!',
 'L1044 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ They do to!',
 'L985 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ I hope so.',
 'L984 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ She okay?',
 "L925 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Let's go.",
 'L924 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ Wow',
 "L872 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Okay -- you're gonna need to learn how to lie.",
 'L871 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ No',
 'L870 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ I\'m kidding.  You know how sometimes you just become this "persona"?  And you don\'t know how to quit?',
 'L869 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Like my fear of wearing pastels?',
 'L868 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ The "real you".',
 'L867 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ What good stuff?',
 "L866 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ I figured yo

In lines we got a conversation with one line said by a character.<br/>
For Example in the second line CAMERON which is user number 2(u2) in a specific movie(m0) saying "They do to!".<br/>
L1044 is a unique key identifier.

In [4]:
conversations

["u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L194', 'L195', 'L196', 'L197']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L198', 'L199']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L200', 'L201', 'L202', 'L203']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L204', 'L205', 'L206']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L207', 'L208']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L271', 'L272', 'L273', 'L274', 'L275']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L276', 'L277']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L280', 'L281']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L363', 'L364']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L365', 'L366']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L367', 'L368']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L401', 'L402', 'L403']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L404', 'L405', 'L406', 'L407']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L575', 'L576']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L577', 'L578']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L662', 'L663']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L693', 'L69

Each row corresponds to one conversations between two characters in a specific movie with the unique key of the lines.
For Example in first row, user u0 & u2 in movie m0 had conversation composed of four lines or four sentences were exchanged between them. We don't have the conversation text, just the unique id.

## Creating Dictionaries

We got our data set, next step is to make a dictionary that will map each lines ID with its text.<br/>
We already have a mapping between the key and interfaces and the lines.<br/>
But we want to make a python dictionary because we will use it afterward to get our data ready for the
future neural network that we're going to build.<br/>

We want data set that contains basically two columns the input and the output because the inputs will be fed into the neural network and the outputs will be the target. We will compare the target that is the real reply by the characters to the answer with the chatbot reply. That's how chatbot will learn to speak whether it's prediction is close to the target.

Easiest way to have a dataset composed of input and output is with the dictionaries beacause we'll have to keep track of the conversations to make a correct mapping between the inputs and the outputs.

### Creating a Dictionary that maps each line and its id

In [5]:
#Creating a Dictionary that maps each line and its id

id2line = {}

for line in lines:
    _line = line.split(' +++$+++ ')        #_line temporary variable
    if len(_line) == 5:                    # To avoid shiffting issue we take lines with only 5 elements after splitting
        id2line[_line[0]] = _line[4]

In [6]:
#dataset lines:'**L1045** +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ **They do not!**',<br/>
#We are trying to map unique id **L1045** with the text **"They do not!"**. 

In [7]:
id2line

{'L221261': 'Oh bullshit.',
 'L196413': 'You know.',
 'L411948': 'She pepper-sprayed me, man!  She pepper- sprayed me!',
 'L635097': "Well, I guess they'll just up an' run anyhow, them two.",
 'L463045': "Just wanted to make you aware... before today's over, we'll be standing on top of your mountain of horse and pissing down on you.",
 'L500628': 'One other thing. Do you have any tattoos?',
 'L131267': 'I guess so...',
 'L661328': 'Oh, come on, Tripp. Cut the kid some slack.',
 'L358768': 'The world would be so much simpler if it were all just about good and evil. Unfortunately I find it much more slippery and elusive place.',
 'L256846': 'Say more.',
 'L555130': "What you're trying to say is that you don't want me to love you.  Is that it?",
 'L461692': "I'll rephrase the question.",
 'L237586': 'Tell me about Beaumont- does he understand how brilliant you are, how lucky he is to have you?',
 'L463412': "What about Jack Daniels?  Wasn't he a decorated general in the Civil War?",
 'L39

In [8]:
len(id2line)

304713

So there are total 304713 lines said between the characters.<br/>
Key: Is the unique id<br/>
Value: Is the line said by the character

### Creating a list of all of the conversations

In [9]:
conversations[0]

"u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L194', 'L195', 'L196', 'L197']"

We'll create a list which will contain only the id ['L194', 'L195', 'L196', 'L197'].<br/>
We'll split by '+++$+++' and then take only last element'[-1]'.<br/>
Then remove the square brackets, "'", and blank spaces and then again split by ',' to get the list.

In [10]:
# Creating a list of all of the conversations

conversations_ids = []
for conversation in conversations[:-1]:           
    _conversation = conversation.split(' +++$+++ ')[-1][1:-1].replace("'","").replace(" ","")
    conversations_ids.append(_conversation.split(','))

In [11]:
conversations_ids

[['L194', 'L195', 'L196', 'L197'],
 ['L198', 'L199'],
 ['L200', 'L201', 'L202', 'L203'],
 ['L204', 'L205', 'L206'],
 ['L207', 'L208'],
 ['L271', 'L272', 'L273', 'L274', 'L275'],
 ['L276', 'L277'],
 ['L280', 'L281'],
 ['L363', 'L364'],
 ['L365', 'L366'],
 ['L367', 'L368'],
 ['L401', 'L402', 'L403'],
 ['L404', 'L405', 'L406', 'L407'],
 ['L575', 'L576'],
 ['L577', 'L578'],
 ['L662', 'L663'],
 ['L693', 'L694', 'L695'],
 ['L696', 'L697', 'L698', 'L699'],
 ['L860', 'L861'],
 ['L862', 'L863', 'L864', 'L865'],
 ['L866', 'L867', 'L868', 'L869'],
 ['L870', 'L871', 'L872'],
 ['L924', 'L925'],
 ['L984', 'L985'],
 ['L1044', 'L1045'],
 ['L49', 'L50', 'L51'],
 ['L571', 'L572', 'L573'],
 ['L579', 'L580'],
 ['L595', 'L596', 'L597'],
 ['L598', 'L599', 'L600'],
 ['L659', 'L660'],
 ['L952', 'L953'],
 ['L394', 'L395'],
 ['L396', 'L397'],
 ['L589', 'L590', 'L591'],
 ['L592', 'L593'],
 ['L756', 'L757', 'L758'],
 ['L759', 'L760'],
 ['L164', 'L165'],
 ['L319', 'L320'],
 ['L441', 'L442', 'L443', 'L444', 'L445']

So we got the list of all line ids of the conversation.<br/>
In each list the first id will be the question and the next id will be the answer to that question.

### Getting separately the questions and the answers

So we want to seperate lists, one for the questions and other for the answers but we want both the list of same size.<br/>
We'll get the text for each id from the id2line dictionary. 

In [12]:
questions = []
answers = []

for conversation in conversations_ids:
    for i in range(len(conversation)- 1):
        questions.append(id2line[conversation[i]])
        answers.append(id2line[conversation[i+1]])

In [13]:
questions[18]

'How do you get your hair to look like that?'

In [14]:
answers[18]

"Eber's Deep Conditioner every two days. And I never, ever use a blowdryer without the diffuser attachment."

### Clean Text Function - Cleaning questions and answers
Now we'll create a function which will be used to clean the text in questions and answers.<br/>
Text cleaning will help chatbot during training process.

In [66]:
# Cleaning texts

def clean_text(text):
    text = text.lower()
    text = re.sub(r"i'm", "i am", text)
    text = re.sub(r"he's", "he is", text)
    text = re.sub(r"she's", "she is", text)
    text = re.sub(r"that's", "that is", text)
    text = re.sub(r"what's", "what is", text)
    text = re.sub(r"where's", "where is", text)
    text = re.sub(r"how's", "how is", text)
    text = re.sub(r"\'ll", " will", text)
    text = re.sub(r"\'ve", " have", text)
    text = re.sub(r"\'re", " are", text)
    text = re.sub(r"\'d", " would", text)
    text = re.sub(r"n't", " not", text)
    text = re.sub(r"won't", "will not", text)
    text = re.sub(r"can't", "cannot", text)
    text = re.sub(r"[-()\"#/@;:<>{}`+=~|.!?,]", "", text)
    return text

In [16]:
# Cleaning the questions
clean_questions = []
for question in questions:
    clean_questions.append(clean_text(question))

In [17]:
questions[:5]

['Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again.',
 "Well, I thought we'd start with pronunciation, if that's okay with you.",
 'Not the hacking and gagging and spitting part.  Please.',
 "You're asking me out.  That's so cute. What's your name again?",
 "No, no, it's my fault -- we didn't have a proper introduction ---"]

In [18]:
clean_questions[:5]

['can we make this quick  roxanne korrine and andrew barrett are having an incredibly horrendous public break up on the quad  again',
 'well i thought we would start with pronunciation if that is okay with you',
 'not the hacking and gagging and spitting part  please',
 'you are asking me out  that is so cute what is your name again',
 "no no it's my fault  we didn't have a proper introduction "]

We can see the difference in the cleaned_questions. Similarly we'll clean the answers

In [19]:
# Cleaning the answers
clean_answers = []
for answer in answers:
    clean_answers.append(clean_text(answer))

### Taking a Word Count

In this part we'll remove the words which are not important or we can say that which are not frequently used as we want to optimize  the training process.<br/>
We'll create a dictionary **word2count** which will count its occurance in all the questions and answers.


In [20]:
# Creating a dictionary that maps each word to its number of occurrences
word2count = {}
for question in clean_questions:
    for word in question.split():
        if word not in word2count:
            word2count[word] = 1
        else:
            word2count[word] += 1

In [21]:
for answer in clean_answers:
    for word in answer.split():
        if word not in word2count:
            word2count[word] = 1
        else:
            word2count[word] += 1

In [22]:
word2count

{'catwalk': 2,
 'station': 343,
 'fisting': 2,
 '`you': 1,
 'upward': 6,
 'boatsupposed': 2,
 'committeeyou': 1,
 "equipment's": 1,
 'hatchit': 1,
 "todon't": 1,
 'surewhere': 1,
 'archaeologists': 1,
 'lunar': 10,
 'scabbard': 2,
 'hollywoodnow': 1,
 'ditto': 5,
 'catches': 41,
 'intelligent': 87,
 'caporegimes': 2,
 'sharpshooters': 2,
 'damnation!': 1,
 'disgraceful': 7,
 '*are*': 6,
 'warwhy': 1,
 'love': 4534,
 'amso': 1,
 'onput': 1,
 "'under": 4,
 'grandfather': 96,
 'vos': 2,
 'nu!': 1,
 'adios': 11,
 'swordmaker': 1,
 'chucker': 1,
 'extends': 3,
 'punitive': 7,
 'avenge': 19,
 "dormer's": 2,
 'dalmatia': 2,
 'sokolow': 1,
 "parish'r": 1,
 'depositors': 3,
 'stillwell': 2,
 'who': 10593,
 'married!!': 1,
 'rez': 5,
 "swede's": 1,
 "'boost": 2,
 'lobbyist': 1,
 'spigs': 1,
 'burgel!': 4,
 'twen': 2,
 'hereand': 2,
 'fainted': 11,
 'uhsenatori': 1,
 "file's": 6,
 'youhimust': 1,
 'lessening': 1,
 'behaviorist': 5,
 'weirdoes': 2,
 'sga': 5,
 'ambulence': 3,
 'honesty': 24,
 'nor

### Tokenization and Filtering the non frequent words

We'll choose a threshold and then words occuring below that threshold will be removed from the questions and answers.
We can try with different threshold values and treat it as an hyperparameter.


In [67]:
# Creating two dictionaries that map the questions words and the answers words to a unique integer
threshold_questions = 15
questionswords2int = {}
word_number = 0
for word, count in word2count.items():
    if count >= threshold_questions:
        questionswords2int[word] = word_number
        word_number += 1

In [24]:
questionswords2int.get('accent')

7662

In [68]:
threshold_answers = 15
answerswords2int = {}
word_number = 0
for word, count in word2count.items():
    if count >= threshold_answers:
        answerswords2int[word] = word_number
        word_number += 1

In [26]:
answerswords2int

{'ext': 7744,
 'station': 0,
 'here': 2406,
 'year': 3293,
 'gamble': 3294,
 'soap': 6662,
 'look!': 3295,
 'huge': 8608,
 'met': 4372,
 'bud': 2243,
 'herbert': 3296,
 'lake': 2080,
 'pneumonia': 1178,
 'photographer': 6253,
 'gordon': 8592,
 'catches': 1,
 'intelligent': 2,
 'iowa': 4377,
 'dangerous': 8594,
 'love': 3,
 'bloody': 2763,
 'curly': 4401,
 'report': 7413,
 'papers': 7745,
 'grandfather': 4,
 'payroll': 4378,
 'losers': 6628,
 'sophisticated': 5535,
 'harry!': 3298,
 'fights': 3299,
 'mind': 7746,
 'spring': 5692,
 'normal': 7747,
 'refused': 4751,
 'who': 5,
 'arizona': 2266,
 'hudson': 8096,
 'cedar': 2246,
 'suspected': 6665,
 'greater': 4380,
 'gladly': 2247,
 'drives': 1450,
 'cowboys': 1297,
 'in!': 6693,
 'obvious': 4557,
 'vada': 2581,
 'never': 3856,
 'ken': 4382,
 'felt': 7749,
 'plank': 7750,
 'honesty': 6,
 'moore': 1181,
 'shrimp': 1182,
 'brazil': 6667,
 'baby': 3301,
 'ungrateful': 7,
 'maker': 3302,
 'brady': 4383,
 'commander': 5537,
 'position': 303,
 '

In [27]:
answerswords2int.get('slowing')

5597

### Adding tokens
Last tokens are useful for the encoder and the decoder in the seq2seq model which are the start of string that we're going to encode by **SOS**, the end of string by **EOS**.<br/>
**PAD** is important for our model because the process training data and the sequences in the batches should all have the same length and therfore we have to put this token in an empty position.<br/>
The last token will be **OUT** will correspond to all the words that were filtered out by our two previous dictionaries i.e words with less than threshold count will be replaced by the token **OUT**.

In [28]:
# Adding the last tokens to these two dictionaries
tokens = ['<PAD>', '<EOS>', '<OUT>', '<SOS>']
for token in tokens:
    questionswords2int[token] = len(questionswords2int) + 1
for token in tokens:
    answerswords2int[token] = len(answerswords2int) + 1

Now we'll inverse answerswords2int dictionary because we'll use this inverse mapping when building the Seq2Seq model.

In [29]:
# Creating the inverse dictionary of the answerswords2int dictionary
answersints2word = {w_i: w for w, w_i in answerswords2int.items()}

In [33]:
answersints2word

{0: 'station',
 1: 'catches',
 2: 'intelligent',
 3: 'love',
 4: 'grandfather',
 5: 'who',
 6: 'honesty',
 7: 'ungrateful',
 8: 'status',
 9: 'cindy',
 10: 'close',
 11: 'rains',
 12: "sal's",
 13: 'trapped',
 14: 'experienced',
 15: 'virginia',
 16: 'worried',
 17: 'nicolet',
 18: "earth's",
 19: 'freaky',
 20: 'champagne',
 21: 'sofa',
 22: 'downstairs',
 23: 'direct',
 24: 'artistic',
 25: 'caen',
 26: 'slap',
 27: 'covers',
 28: 'cared',
 29: 'practice',
 30: 'bertrand',
 31: 'pm',
 32: 'lemon',
 33: 'pale',
 34: 'standing',
 35: "'im",
 36: 'birdie',
 37: 'wear',
 38: 'safely',
 39: 'cooperate',
 40: 'fighting',
 41: 'five',
 42: 'resort',
 43: 'cue',
 44: 'broker',
 45: 'heal',
 46: 'locker',
 47: 'lisa',
 48: 'otherwise',
 49: 'learning',
 50: 'ring',
 51: 'request',
 52: 'duck',
 53: 'including',
 54: 'geek',
 55: 'sergeant',
 56: "workin'",
 57: 'episode',
 58: 'lefferts',
 59: 'crisis',
 60: 'regan',
 61: 'connected',
 62: 'now',
 63: 'interrupt',
 64: 'destroying',
 65: 'ele

In [30]:
# Adding the End Of String EOS token to the end of every answer as it is important for the decoding part of Seq2Seq Model
for i in range(len(clean_answers)):
    clean_answers[i] += ' <EOS>'

In [34]:
clean_answers[:5]

['well i thought we would start with pronunciation if that is okay with you <EOS>',
 'not the hacking and gagging and spitting part  please <EOS>',
 "okay then how 'bout we try out some french cuisine  saturday  night <EOS>",
 'forget it <EOS>',
 'cameron <EOS>']

Now we'll be translating clean_questions and clean_answers into integers that were assigned in the dictionary previously for all words with occurance greater than the threshold value. 

We are doing this because then we want to sort all the questions and all the answers by their length and to sort that well these two lists of questions and answers translated into integers will be very helpful. Sort of the list will optimize the training speed and reduce the loss as it will reduce the amount of padding during the training.

In [31]:
# Translating all the questions and the answers into integers
# and Replacing all the words that were filtered out by <OUT> 
questions_into_int = []
for question in clean_questions:
    ints = []
    for word in question.split():
        if word not in questionswords2int:
            ints.append(questionswords2int['<OUT>'])
        else:
            ints.append(questionswords2int[word])
    questions_into_int.append(ints)
answers_into_int = []
for answer in clean_answers:
    ints = []
    for word in answer.split():
        if word not in answerswords2int:
            ints.append(answerswords2int['<OUT>'])
        else:
            ints.append(answerswords2int[word])
    answers_into_int.append(ints)

Let's check if the words are replaced by the unique key from dictionary.

In [37]:
clean_questions[0]

'can we make this quick  roxanne korrine and andrew barrett are having an incredibly horrendous public break up on the quad  again'

In [38]:
questionswords2int.get('can')

8274

We can see that the word 'can' is replaced by the unique key 8274

In [36]:
questions_into_int[0]

[8274,
 5208,
 6106,
 5168,
 1475,
 8824,
 8824,
 3699,
 4690,
 8824,
 3133,
 236,
 5273,
 2388,
 8824,
 3726,
 4543,
 4037,
 245,
 5312,
 8824,
 142]

In [32]:
# Sorting questions and answers by the length of questions
sorted_clean_questions = []
sorted_clean_answers = []
for length in range(1, 25 + 1): 
    for i in enumerate(questions_into_int):
        if len(i[1]) == length:
            sorted_clean_questions.append(questions_into_int[i[0]])
            sorted_clean_answers.append(answers_into_int[i[0]])

# PART 2 - BUILDING THE SEQ2SEQ MODEL

## Creating placeholders

Why are we creating placeholders ?

So in tensorflow all variables are used in tensors.<br/>
Tensors are like an advanced array more advanced than numpy array which is of a single type and that allows fastest computations in the deep neural networks.

So we need to go from the numpy arrays to tensors.
But that's not all, because then in tensorflow all the variables used in tensors must be defined as
what we call tensorflow placeholders.

So basically that's an even more advanced data structure that can contain tensors and also other features.

To start to build a deep neural network with tensorflow we need to create some placeholders for the inputs and the target, then add a learning rate and even more hyperparameter. So we are creating this placeholders to be able to use these variables in the future training.

In [39]:
# Creating placeholders for the inputs and the targets
def model_inputs():
    inputs = tf.placeholder(tf.int32, [None, None], name = 'input')
    targets = tf.placeholder(tf.int32, [None, None], name = 'target')
    lr = tf.placeholder(tf.float32, name = 'learning_rate')           #To Hold Hyperparameter Learning Rate 
    keep_prob = tf.placeholder(tf.float32, name = 'keep_prob')        #To control the dropout rate 
    return inputs, targets, lr, keep_prob

The dropout is the rate of the neurons. You choose to override during one iteration in the training.
Usually the dropout rate is 20 percent so that you activate 20 percent of the neurons during the different
iterations of the training. And so this keep_prob parameter for which we're going to create a placeholder will be used to control this dropout rate because sometimes the drop out is too high. So we will need to reduce it and sometimes it is too low so we will need to increase that. And this will be done by the keep_prob parameter.

## Preprocessing the Targets

That's because the decoder will only accept a certain format of the targets.<br/>

RNN of the decoder will not accept a single target, target must be in batches for example 10 answers at a time.<br/>
We need to do two things-
* create target batches
* add SOS token at the beginning of each target

Parameter:<br/>
target - to get pre-processed targets<br/>
word2int - to get the unique key identifier of SOS token<br/>
batch size - number of targets to pass<br/>

In this function we'll concatenate the left side of SOS token with the right side i.e tokens of the answers.

In [40]:
# Preprocessing the targets
def preprocess_targets(targets, word2int, batch_size):
    left_side = tf.fill([batch_size, 1], word2int['<SOS>'])
    right_side = tf.strided_slice(targets, [0,0], [batch_size, -1], [1,1])
    preprocessed_targets = tf.concat([left_side, right_side], 1)
    return preprocessed_targets

## Creating the Encoder RNN Layer



**encoder_rnn** Parameters:<br/>
rnn_inputs - model_inputs from previous function <br/>
rnn_size - number of input tensors of the encoder <br/>
num_layers - number of layers <br/>
keep_prob - to apply dropout regularization to our LSTM <br/>
sequence_length - list of the length of each question in the batch <br/>

This function will return encoder_state which will be the input for the decoder.

In [41]:
# Creating the Encoder RNN
def encoder_rnn(rnn_inputs, rnn_size, num_layers, keep_prob, sequence_length):
    lstm = tf.contrib.rnn.BasicLSTMCell(rnn_size)
    lstm_dropout = tf.contrib.rnn.DropoutWrapper(lstm, input_keep_prob = keep_prob)
    encoder_cell = tf.contrib.rnn.MultiRNNCell([lstm_dropout] * num_layers)
    encoder_output, encoder_state = tf.nn.bidirectional_dynamic_rnn(cell_fw = encoder_cell,
                                                                    cell_bw = encoder_cell,
                                                                    sequence_length = sequence_length,
                                                                    inputs = rnn_inputs,
                                                                    dtype = tf.float32)
    return encoder_state

## Decoding the Training & Test/Validation Set

https://www.tensorflow.org/programmers_guide/embedding<br/>
https://www.tensorflow.org/api_docs/python/tf/variable_scope<br/>
https://www.tensorflow.org/versions/r1.0/api_docs/python/tf/contrib/seq2seq/prepare_attention<br/>
https://www.tensorflow.org/versions/r1.0/api_docs/python/tf/contrib/seq2seq/attention_decoder_fn_train<br/>
https://www.tensorflow.org/versions/r1.0/api_docs/python/tf/contrib/seq2seq/dynamic_rnn_decoder<br/>
https://www.tensorflow.org/api_docs/python/tf/nn/dropout<br/>

Now we'll decode the observations of the training set and at the same time we'll return the output of the decoder, some observations will go back into the neural network to update the weights and improve the ability of the chatbot to talk like a human.

attention keys - key to be compared with the target state <br/>
attention values - values to construct the context vectors which is returned by the encoder and used in the decoder as the first element of the decoding<br/>
attention_score_function is used to compute the similarity between the keys and the target states<br/>
attention_construct_function is used to build attention state

In [42]:
# Decoding the training set
def decode_training_set(encoder_state, decoder_cell, decoder_embedded_input, sequence_length, decoding_scope, output_function, keep_prob, batch_size):
    attention_states = tf.zeros([batch_size, 1, decoder_cell.output_size])
    attention_keys, attention_values, attention_score_function, attention_construct_function = tf.contrib.seq2seq.prepare_attention(attention_states, attention_option = "bahdanau", num_units = decoder_cell.output_size)
    training_decoder_function = tf.contrib.seq2seq.attention_decoder_fn_train(encoder_state[0],
                                                                              attention_keys,
                                                                              attention_values,
                                                                              attention_score_function,
                                                                              attention_construct_function,
                                                                              name = "attn_dec_train")
    decoder_output, decoder_final_state, decoder_final_context_state = tf.contrib.seq2seq.dynamic_rnn_decoder(decoder_cell,
                                                                                                              training_decoder_function,
                                                                                                              decoder_embedded_input,
                                                                                                              sequence_length,
                                                                                                              scope = decoding_scope)
    decoder_output_dropout = tf.nn.dropout(decoder_output, keep_prob)
    return output_function(decoder_output_dropout)

https://www.tensorflow.org/programmers_guide/embedding<br/>
http://web.stanford.edu/class/cs20si/lectures/notes_04.pdf<br/>
https://www.tensorflow.org/api_docs/python/tf/variable_scope<br/>

We are going to make the same function but for the new kind of observations from the test and validation set that won't be used for the training.

This function will be used to predict the observations of the test i.e answer the question asked in the test phase.
This function will also be used for the validation set.

Validation set we will make during the training, which will keep 10% of the trainset for **cross-validation** which is a technique that keeps a small part of the training data to test the predictive power on new observations. This is very useful to **reduce overfitting** and **improve the accuracy**.

The function is same as above only change is, instead of **attention_decoder_fn__train** we are using **attention_decoder_fn_inference**. Because once the chatbot is trained it has logic inside its brain and therefore it is able to deduce logically infer the answers to the questions that being asked based on the logic it learned during the training.

In [43]:
# Decoding the test/validation set
def decode_test_set(encoder_state, decoder_cell, decoder_embeddings_matrix, sos_id, eos_id, maximum_length, num_words, decoding_scope, output_function, keep_prob, batch_size):
    attention_states = tf.zeros([batch_size, 1, decoder_cell.output_size])
    attention_keys, attention_values, attention_score_function, attention_construct_function = tf.contrib.seq2seq.prepare_attention(attention_states, attention_option = "bahdanau", num_units = decoder_cell.output_size)
    test_decoder_function = tf.contrib.seq2seq.attention_decoder_fn_inference(output_function,
                                                                              encoder_state[0],
                                                                              attention_keys,
                                                                              attention_values,
                                                                              attention_score_function,
                                                                              attention_construct_function,
                                                                              decoder_embeddings_matrix,
                                                                              sos_id,
                                                                              eos_id,
                                                                              maximum_length,
                                                                              num_words,
                                                                              name = "attn_dec_inf")
    test_predictions, decoder_final_state, decoder_final_context_state = tf.contrib.seq2seq.dynamic_rnn_decoder(decoder_cell,
                                                                                                                test_decoder_function,
                                                                                                                scope = decoding_scope)
    return test_predictions

Using **attention** in our **decoding layers** reduces the loss of our model by about 20% and increases the training time by about 20%. I’d say that it’s a fair trade-off. 
Some notes to make:
* The model performs best when the attention states are set with zeros.
* The two attention options are **bahdanau** and **luong**. **Bahdanau** is less computationally expensive and better results were achieved with it.

## Creating the Decoder RNN

The decoder is another RNN that takes the encoder output vector(s) and outputs a sequence of words to create the translation.

Simple Decoder
In the simplest seq2seq decoder we use only last output of the encoder. This last output is sometimes called the context vector as it encodes context from the entire sequence. This context vector is used as the initial hidden state of the decoder.

At every step of decoding, the decoder is given an input token and hidden state. The initial input token is the start-of-string <SOS> token, and the first hidden state is the context vector (the encoder’s last hidden state).

https://google.github.io/seq2seq/decoders/



In [44]:
# Creating the Decoder RNN
def decoder_rnn(decoder_embedded_input, decoder_embeddings_matrix, encoder_state, num_words, sequence_length, rnn_size, num_layers, word2int, keep_prob, batch_size):
    with tf.variable_scope("decoding") as decoding_scope:
        lstm = tf.contrib.rnn.BasicLSTMCell(rnn_size)
        lstm_dropout = tf.contrib.rnn.DropoutWrapper(lstm, input_keep_prob = keep_prob)
        decoder_cell = tf.contrib.rnn.MultiRNNCell([lstm_dropout] * num_layers)
        weights = tf.truncated_normal_initializer(stddev = 0.1)      # weight initialized
        biases = tf.zeros_initializer()                              # bias
        output_function = lambda x: tf.contrib.layers.fully_connected(x,
                                                                      num_words,
                                                                      None,
                                                                      scope = decoding_scope,
                                                                      weights_initializer = weights,
                                                                      biases_initializer = biases)
        training_predictions = decode_training_set(encoder_state,
                                                   decoder_cell,
                                                   decoder_embedded_input,
                                                   sequence_length,
                                                   decoding_scope,
                                                   output_function,
                                                   keep_prob,
                                                   batch_size)
        decoding_scope.reuse_variables()
        test_predictions = decode_test_set(encoder_state,
                                           decoder_cell,
                                           decoder_embeddings_matrix,
                                           word2int['<SOS>'],
                                           word2int['<EOS>'],
                                           sequence_length - 1,
                                           num_words,
                                           decoding_scope,
                                           output_function,
                                           keep_prob,
                                           batch_size)
    return training_predictions, test_predictions

## Building the seq2seq model

https://www.tensorflow.org/api_docs/python/tf/contrib/layers/embed_sequence<br/>
https://www.tensorflow.org/api_docs/python/tf/Variable<br/>
https://www.tensorflow.org/api_docs/python/tf/random_uniform<br/>
https://www.tensorflow.org/api_docs/python/tf/nn/embedding_lookup<br/>


This is the final brain of our chatbot. We'll chat with this model.
This function will return training and test predictions.
* encoder_embedded_input
* encoder_state - input for the decoder
* get preprocessed targets
* decoder_embedding_matrix
* decoder_embedded_input
* training_predictions, test_predictions


In [45]:
# Building the seq2seq model
def seq2seq_model(inputs, targets, keep_prob, batch_size, sequence_length, answers_num_words, questions_num_words, encoder_embedding_size, decoder_embedding_size, rnn_size, num_layers, questionswords2int):
    encoder_embedded_input = tf.contrib.layers.embed_sequence(inputs,
                                                              answers_num_words + 1,
                                                              encoder_embedding_size,
                                                              initializer = tf.random_uniform_initializer(0, 1))
    encoder_state = encoder_rnn(encoder_embedded_input, rnn_size, num_layers, keep_prob, sequence_length)
    preprocessed_targets = preprocess_targets(targets, questionswords2int, batch_size)
    decoder_embeddings_matrix = tf.Variable(tf.random_uniform([questions_num_words + 1, decoder_embedding_size], 0, 1))
    decoder_embedded_input = tf.nn.embedding_lookup(decoder_embeddings_matrix, preprocessed_targets)
    training_predictions, test_predictions = decoder_rnn(decoder_embedded_input,
                                                         decoder_embeddings_matrix,
                                                         encoder_state,
                                                         questions_num_words,
                                                         sequence_length,
                                                         rnn_size,
                                                         num_layers,
                                                         questionswords2int,
                                                         keep_prob,
                                                         batch_size)
    return training_predictions, test_predictions

# Part 3 - TRAINING THE SEQ2SEQ MODEL

## Setting Hyperparameters

So **epochs** is basically one complete iteration of the training i.e forward propagating with the batches of inputs and then backpropagating for updating weights.<br/>

We are setting it to 100 but if it's taking time we can try 50 epochs but not less than that.<br/>
**batch_size** with 32 if it takes too long then try 128.<br/>

In [46]:
# Setting the Hyperparameters
epochs = 50
batch_size = 32
rnn_size = 1024
num_layers = 3                             
encoding_embedding_size = 1024
decoding_embedding_size = 1024
learning_rate = 0.001                    #should not be too high and also not too low
learning_rate_decay = 0.9               # percent by which learning rate is reduced over the iterations of the training, most commonly used is 0.9
min_learning_rate = 0.0001              # for early stopping
keep_probability = 0.5                  # dropout rate recommended by Geoffrey Hinton for the hidden units

In [47]:
# Defining a tensor session
tf.reset_default_graph()
session = tf.InteractiveSession()

In [48]:
# Loading model inputs
inputs, targets, lr, keep_prob = model_inputs()

Earlier we had set the sequence length to 25 i.e questions and answers won't be greater than 25 words

In [49]:
# Setting the sequence length
sequence_length = tf.placeholder_with_default(25, None, name = 'sequence_length') 

In [50]:
# Getting the shape of the inputs tensor
input_shape = tf.shape(inputs)

## Getting the training and test predictions

In [51]:
training_predictions, test_predictions = seq2seq_model(tf.reverse(inputs, [-1]), #inversing the input to produce better output
                                                       targets,
                                                       keep_prob,
                                                       batch_size,
                                                       sequence_length,
                                                       len(answerswords2int),
                                                       len(questionswords2int),
                                                       encoding_embedding_size,
                                                       decoding_embedding_size,
                                                       rnn_size,
                                                       num_layers,
                                                       questionswords2int)


## Setting up the Loss Error, the Optimizer and Gradient Clipping

https://www.tensorflow.org/api_docs/python/tf/name_scope<br/>
https://www.tensorflow.org/api_docs/python/tf/contrib/seq2seq/sequence_loss<br/>
https://www.tensorflow.org/api_docs/python/tf/ones<br/>
https://www.tensorflow.org/api_docs/python/tf/clip_by_value<br/>

Gradient Clipping is a technique used to avoid exploding gradients and vanishing gradient problems.

In [52]:
# Setting up the Loss Error, the Optimizer and Gradient Clipping
with tf.name_scope("optimization"):
    loss_error = tf.contrib.seq2seq.sequence_loss(training_predictions,
                                                  targets,
                                                  tf.ones([input_shape[0], sequence_length]))
    optimizer = tf.train.AdamOptimizer(learning_rate)
    gradients = optimizer.compute_gradients(loss_error)
    clipped_gradients = [(tf.clip_by_value(grad_tensor, -5., 5.), grad_variable) for grad_tensor, grad_variable in gradients if grad_tensor is not None]
    optimizer_gradient_clipping = optimizer.apply_gradients(clipped_gradients)

This sets up the structure of our graph.
* I choose to use an interactive session to provide a little more flexibility when building this model, but you can use whatever session type you wish.
* Sequence length will be the max line length for each batch. I sorted my inputs by length to reduce the amount of padding when creating the batches. This helped to speed up training.
* If you are unfamiliar with seq2seq models, the input is often reversed. This helps a model to produce better outputs because when the input data is being fed into the model, the start of the sequence will now become closer to the start of the output sequence.
* Although I have clipped my gradients at ±5, I didn’t notice much of a difference with ±1.

## Padding the sequences with the PAD token

Before training, we work on the dataset to convert the variable length sequences into fixed length sequences, by padding. We use a few special symbols to fill in the sequence.

EOS : End of sentence<br/>
PAD : Filler<br/>
SOS : Start decoding<br/>
OUT : Unknown; word not in vocabulary<br/>

Consider the following query-response pair.

In [53]:
#Question:['Who', 'are', 'you']
#Answer: [<SOS>, 'I', 'am', 'a', 'bot', '.', <EOS> ]

Assuming that we would like our sentences (queries and responses) to be of fixed length, 10, this pair will be converted to:

In [54]:
#After Padding 
#Question:['Who', 'are', 'you', <PAD>, <PAD>, <PAD>, <PAD>]
#Answer:  [<SOS>, 'I', 'am', 'a', 'bot', '.', <EOS>, <PAD>]

In [55]:
# Padding the sequences with the <PAD> token

def apply_padding(batch_of_sequences, word2int):
    max_sequence_length = max([len(sequence) for sequence in batch_of_sequences]) #getting max sequence length
    return [sequence + [word2int['<PAD>']] * (max_sequence_length - len(sequence)) for sequence in batch_of_sequences] #padding

## Splitting the data into batches of questions and answers

Get the questions and answers in batch and then apply padding to the batch.

In [57]:
# Splitting the data into batches of questions and answers
def split_into_batches(questions, answers, batch_size):
    for batch_index in range(0, len(questions) // batch_size):
        start_index = batch_index * batch_size
        questions_in_batch = questions[start_index : start_index + batch_size]
        answers_in_batch = answers[start_index : start_index + batch_size]
        padded_questions_in_batch = np.array(apply_padding(questions_in_batch, questionswords2int))
        padded_answers_in_batch = np.array(apply_padding(answers_in_batch, answerswords2int))
        yield padded_questions_in_batch, padded_answers_in_batch
 

In [58]:
# Splitting the questions and answers into training and validation sets
training_validation_split = int(len(sorted_clean_questions) * 0.15)       # Calculating 15% of total number of questions for cross-validation
training_questions = sorted_clean_questions[training_validation_split:]   #Remaining 85% for training 
training_answers = sorted_clean_answers[training_validation_split:]         
validation_questions = sorted_clean_questions[:training_validation_split] # 15% of questions for cross-validation
validation_answers = sorted_clean_answers[:training_validation_split]
 

## Training

We'll calculate training loss for every 100 batch.<br/>
Check validation loss in middle of every epoch.<br/>
To calculate loss on 100 batches.<br/>
List of validation loss error which will be helpful for early stopping.

In [None]:
batch_index_check_training_loss = 100
batch_index_check_validation_loss = ((len(training_questions)) // batch_size // 2) - 1
total_training_loss_error           #total loss on 100 batches
list_validation_loss_error = []     #list of validation loss error for early stopping technique 
early_stopping_check = 0            #
early_stopping_stop = 100
checkpoint = "./chatbot_weights.ckpt" # For Windows users, replace this line of code by: checkpoint = "./chatbot_weights.ckpt"
session.run(tf.global_variables_initializer())
for epoch in range(1, epochs + 1): 
    #Training loss error
    for batch_index, (padded_questions_in_batch, padded_answers_in_batch) in enumerate(split_into_batches(training_questions, training_answers, batch_size)):
        starting_time = time.time()
        _, batch_training_loss_error = session.run([optimizer_gradient_clipping, loss_error], {inputs: padded_questions_in_batch,
                                                                                               targets: padded_answers_in_batch,
                                                                                               lr: learning_rate,
                                                                                               sequence_length: padded_answers_in_batch.shape[1],
                                                                                               keep_prob: keep_probability})
        total_training_loss_error += batch_training_loss_error
        ending_time = time.time()
        batch_time = ending_time - starting_time
        if batch_index % batch_index_check_training_loss == 0:
            print('Epoch: {:>3}/{}, Batch: {:>4}/{}, Training Loss Error: {:>6.3f}, Training Time on 100 Batches: {:d} seconds'.format(epoch,
                                                                                                                                       epochs,
                                                                                                                                       batch_index,
                                                                                                                                       len(training_questions) // batch_size,
                                                                                                                                       total_training_loss_error / batch_index_check_training_loss,
                                                                                                                                       int(batch_time * batch_index_check_training_loss)))
            total_training_loss_error = 0
        if batch_index % batch_index_check_validation_loss == 0 and batch_index > 0: 
            total_validation_loss_error = 0
            starting_time = time.time()
            # Validation loss error
            for batch_index_validation, (padded_questions_in_batch, padded_answers_in_batch) in enumerate(split_into_batches(validation_questions, validation_answers, batch_size)):
                batch_validation_loss_error = session.run(loss_error, {inputs: padded_questions_in_batch,
                                                                       targets: padded_answers_in_batch,
                                                                       lr: learning_rate,
                                                                       sequence_length: padded_answers_in_batch.shape[1],
                                                                       keep_prob: 1})
                total_validation_loss_error += batch_validation_loss_error
            ending_time = time.time()
            batch_time = ending_time - starting_time
            # Calculate average validation loss
            average_validation_loss_error = total_validation_loss_error / (len(validation_questions) / batch_size)
            print('Validation Loss Error: {:>6.3f}, Batch Validation Time: {:d} seconds'.format(average_validation_loss_error, int(batch_time)))
            # Applying decay to learning rate
            learning_rate *= learning_rate_decay
            if learning_rate < min_learning_rate:
                learning_rate = min_learning_rate
            list_validation_loss_error.append(average_validation_loss_error)
            if average_validation_loss_error <= min(list_validation_loss_error):
                print('I speak better now!!')
                early_stopping_check = 0
                saver = tf.train.Saver()
                saver.save(session, checkpoint)
            else:
                print("Sorry I do not speak better, I need to practice more.")
                early_stopping_check += 1
                if early_stopping_check == early_stopping_stop:
                    break
    if early_stopping_check == early_stopping_stop:
        print("My apologies, I cannot speak better anymore. This is the best I can do.")
        break
print("Game Over")

# PART 4 - TESTING THE SEQ2SEQ MODEL

Training for our model will take days. As the model trains the weights of the chatbot will be downloaded. We can use these pretrained weights(brain) of the chatbot, but as it hasn't learned much it won't be accurate.

To chat with the chatbot with the help of pre-trained weights run the code except the training part. Comment the training part.
Load the pre-trained weight model.

In [None]:
checkpoint = "./best_weights_training.ckpt.data-00000-of-00001"
session = tf.InteractiveSession()
session.run(tf.global_variables_initializer())
saver = tf.train.Saver() #Initializing the session to connect with the weights-checkpoint
saver.restore(session, checkpoint) 

In [65]:
# Converting the questions from strings to lists of encoding integers
def convert_string2int(question, word2int):
    question = clean_text(question)
    return [word2int.get(word, word2int['<OUT>']) for word in question.split()]


In [None]:
# Setting up the chat
while(True):
    question = input("You: ")
    if question == 'Goodbye':  # to end the conversation
        break
    question = convert_string2int(question, questionswords2int)
    question = question + [questionswords2int['<PAD>']] * (25 - len(question)) # padding the question
    fake_batch = np.zeros((batch_size, 25)) #as the neural network only accept input in batches we need to create fake batch
    fake_batch[0] = question # adding question in the batch 
    predicted_answer = session.run(test_predictions, {inputs: fake_batch, keep_prob: 0.5})[0] #to get the predicted answer
    answer = ''
    #Converting the answer in the correct readable format i.e removing tokens like EOS, OUT
    for i in np.argmax(predicted_answer, 1):  
        if answersints2word[i] == 'i':
            token = ' I'
        elif answersints2word[i] == '<EOS>':
            token = '.'
        elif answersints2word[i] == '<OUT>':
            token = 'out'
        else:
            token = ' ' + answersints2word[i]
        answer += token
        if token == '.':
            break
    print('ChatBot: ' + answer)

To chat with the chatbot run the chatbot.py file in Spyder.<br/>
As the training of the model takes more than 2 days we didn't get the optimized weights.<br/>
We can run the chatbot.py file with the pretrain weights to test our chatbot.<br/>

Please find the output in the github as the pretrained weights are greater than 500mb.

Github - https://github.com/rhnyewale/INFO7390-ChatBot-with-Deep-Learning-and-NLP

# Conclusion

We implemented ChatBot using Seq2Seq model.

We completed below task under each section:

* **Data Preprocessing** - Text Cleaning, Word2Count, Tokenization
* **Build SEQ2SEQ Model** - Functions for Model Inputs, Encoder, Decoder and ensembling it to build SEQ2SEQ Model
* **Training SEQ2SEQ Model** - Set Hyperparameters, Train the model and get optimized weights for better chat experience
* **Testing SEQ2SEQ Model** - Test the chatbot and according to the experience try to optimize or tune the model by changing Hyperparameter



# Citation

https://arxiv.org/pdf/1406.1078v3.pdf<br/>
https://google.github.io/seq2seq/<br/>
https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html<br/>
https://colah.github.io/posts/2015-08-Understanding-LSTMs/<br/>
https://www.guru99.com/seq2seq-model.html<br/>
https://towardsdatascience.com/day-1-2-attention-seq2seq-models-65df3f49e263#:~:text=Jul%2013%2C%202019%C2%B76%20min,in%20production%20in%20late%202016.<br/>
https://towardsdatascience.com/generative-chatbots-using-the-seq2seq-model-d411c8738ab5<br/>




# License

Copyright 2020 Rohan Subhash Yewale

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.