# AI CHATBOT

<hr>

In this jupyter notebook, we will build a chatbot using sequence to sequence model.

In [3]:
# Import the libraries
import numpy as np
import tensorflow as tf
import contractions
from tqdm import tqdm
import re
import time

In [2]:
# Tensorflow 1.4.0
assert tf.__version__ == '1.4.0'

<br>

# 1. Load the Dataset

<hr>

For creating our chatbot, we will use a data set called <a href="https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html">Cornell Movie Dialogs Corpus</a> which this corpus contains a large metadata-rich collection of fictional conversations extracted from raw movie scripts:

- 220,579 conversational exchanges between 10,292 pairs of movie characters
- involves 9,035 characters from 617 movies
- in total 304,713 utterances


However for our purpose, we will only use <i>movie_conversations.txt</i> and <i>movie_lines.txt</i>.


In [4]:
# Load the lines dataset
lines = open(file = "./dataset/movie_lines.txt", encoding = "utf-8", errors = "ignore").read()

# Split the dataset for every line
lines = lines.split("\n")

In [5]:
# Print lines
lines[:5]

['L1045 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ They do not!',
 'L1044 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ They do to!',
 'L985 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ I hope so.',
 'L984 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ She okay?',
 "L925 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Let's go."]

In lines dataset, we have 5 columns. Now let's consider the first line:

- **Line id** --> like L1045
- **User id** --> like u0
- **Movie id** --> like m0
- **User name** --> like BIANCA
- **Text** --> like They do not!

In [6]:
# Load the conversations dataset
conversations = open(file = "./dataset/movie_conversations.txt", encoding = "utf-8", errors = "ignore").read()

# Split the dataset for every line
conversations = conversations.split("\n")

In [7]:
# Print convesations
conversations[:5]

["u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L194', 'L195', 'L196', 'L197']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L198', 'L199']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L200', 'L201', 'L202', 'L203']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L204', 'L205', 'L206']",
 "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L207', 'L208']"]

In conversations dataset, we have 4 columns. Now let's consider the first line:

- **User number 1** --> like u0
- **User number 2** --> like u2
- **Movie id** --> like m0
- **Conversations (array of line ids)** --> ['L194', 'L195', 'L196', 'L197']

<br>

# 2. Data Prepration

<hr>

Before applying the preprocessing steps, we need to prepare our dataset. In other word, we need to exactly specify our inputs and our outputs.

In [8]:
### Create a dictionary that maps each line id and its line text

# Initialize an empty dictionary
id2line = {}

# Iterate through lines
for i_line in lines:
    
    # Split the line with +++$+++                       # _ before lines means that this is temporary
    _lines = i_line.split(" +++$+++ ")
    
    # Make sure the length is 5 (otherwise we are going to get the wrong information because of indexing)
    if len(_lines) == 5:
        
        # Get the line id 
        line_id = _lines[0]
        
        # Get the text
        text = _lines[-1]
        
        # Map line ids to its text
        id2line[line_id] = text

In [9]:
# Take a look at id2line
id2line

{'L1045': 'They do not!',
 'L1044': 'They do to!',
 'L985': 'I hope so.',
 'L984': 'She okay?',
 'L925': "Let's go.",
 'L924': 'Wow',
 'L872': "Okay -- you're gonna need to learn how to lie.",
 'L871': 'No',
 'L870': 'I\'m kidding.  You know how sometimes you just become this "persona"?  And you don\'t know how to quit?',
 'L869': 'Like my fear of wearing pastels?',
 'L868': 'The "real you".',
 'L867': 'What good stuff?',
 'L866': "I figured you'd get to the good stuff eventually.",
 'L865': 'Thank God!  If I had to hear one more story about your coiffure...',
 'L864': "Me.  This endless ...blonde babble. I'm like, boring myself.",
 'L863': 'What crap?',
 'L862': 'do you listen to this crap?',
 'L861': 'No...',
 'L860': 'Then Guillermo says, "If you go any lighter, you\'re gonna look like an extra on 90210."',
 'L699': 'You always been this selfish?',
 'L698': 'But',
 'L697': "Then that's all you had to say.",
 'L696': 'Well, no...',
 'L695': "You never wanted to go out with 'me, did y

In [10]:
### Create a list of all conversations

# Initialize an empty array
conversation_ids = []

# Iterate through conversations                  # We don't take the last row because it's an empty string
for i_conversation in conversations[:-1]:
    
    # Split each conversation with +++$+++
    _conversation = i_conversation.split(" +++$+++ ")
    
    # Get the lines ids
    _conversation = _conversation[-1]
    
    # Remove the square brackets                 # Remmber that this array is inside a string so we can use indexing like below
    _conversation = _conversation[1:-1]
    
    # Remove quotes
    _conversation = _conversation.replace("'", "")
    
    # Remove empty spaces
    _conversation = _conversation.replace(" ", "")
    
    # Split the conversation with ,
    _conversation = _conversation.split(",")
    
    # Append the conversations to our list
    conversation_ids.append(_conversation)

In [11]:
# Print conversation_ids
conversation_ids

[['L194', 'L195', 'L196', 'L197'],
 ['L198', 'L199'],
 ['L200', 'L201', 'L202', 'L203'],
 ['L204', 'L205', 'L206'],
 ['L207', 'L208'],
 ['L271', 'L272', 'L273', 'L274', 'L275'],
 ['L276', 'L277'],
 ['L280', 'L281'],
 ['L363', 'L364'],
 ['L365', 'L366'],
 ['L367', 'L368'],
 ['L401', 'L402', 'L403'],
 ['L404', 'L405', 'L406', 'L407'],
 ['L575', 'L576'],
 ['L577', 'L578'],
 ['L662', 'L663'],
 ['L693', 'L694', 'L695'],
 ['L696', 'L697', 'L698', 'L699'],
 ['L860', 'L861'],
 ['L862', 'L863', 'L864', 'L865'],
 ['L866', 'L867', 'L868', 'L869'],
 ['L870', 'L871', 'L872'],
 ['L924', 'L925'],
 ['L984', 'L985'],
 ['L1044', 'L1045'],
 ['L49', 'L50', 'L51'],
 ['L571', 'L572', 'L573'],
 ['L579', 'L580'],
 ['L595', 'L596', 'L597'],
 ['L598', 'L599', 'L600'],
 ['L659', 'L660'],
 ['L952', 'L953'],
 ['L394', 'L395'],
 ['L396', 'L397'],
 ['L589', 'L590', 'L591'],
 ['L592', 'L593'],
 ['L756', 'L757', 'L758'],
 ['L759', 'L760'],
 ['L164', 'L165'],
 ['L319', 'L320'],
 ['L441', 'L442', 'L443', 'L444', 'L445']

In [12]:
### Get the questions and answers seperatly

# Initialize an empry list for questions
questions = []

# Initialize an empry list for answers
answers = []

# Iterate through conversation ids
for i_conversation in conversation_ids:
    
    # Iterate through length of i_conversation
    for index in range(len(i_conversation) - 1):

        # Get the text of current index
        current_text = id2line[i_conversation[index]]
        
        # Append the text to questions array
        questions.append(current_text)
        
        # Get the text of next index
        next_text = id2line[i_conversation[index + 1]]
        
        # Append the text to answers array
        answers.append(next_text)

In [13]:
# Take a look at questions
questions[:5]

['Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again.',
 "Well, I thought we'd start with pronunciation, if that's okay with you.",
 'Not the hacking and gagging and spitting part.  Please.',
 "You're asking me out.  That's so cute. What's your name again?",
 "No, no, it's my fault -- we didn't have a proper introduction ---"]

In [14]:
# Take a look at answers
answers[:5]

["Well, I thought we'd start with pronunciation, if that's okay with you.",
 'Not the hacking and gagging and spitting part.  Please.',
 "Okay... then how 'bout we try out some French cuisine.  Saturday?  Night?",
 'Forget it.',
 'Cameron.']

<br>

# 3. Data Preprocessing

<hr>

In this section we will take the following steps for preprocessing our dataset:
1. Lowercasing the text
2. Decontracting the text
3. Remove the punctuation

In [15]:
# Preprocessing function
def preprocess_text(text):
    """
    Function for applying the preprocesssing steps.
    """
    # 1. Lowercase the text
    text = text.lower()
    
    # 2. De-contract the text
    text = contractions.fix(text)
    
    # 3. Remove the punctuation
    text = re.sub(r"[^a-zA-Z0-9]", " ", text)

    return text

In [16]:
# Apply preprocesing function to our questions
questions_preprocessed = []
for i_question in tqdm(questions):
    questions_preprocessed.append(preprocess_text(i_question))

# Apply preprocesing function to our answers
answers_preprocessed = []
for i_answer in tqdm(answers):
    answers_preprocessed.append(preprocess_text(i_answer))

100%|██████████| 221616/221616 [00:51<00:00, 4275.71it/s]
100%|██████████| 221616/221616 [00:57<00:00, 3887.97it/s]


In [17]:
# Take a look at preprocessed questions
questions_preprocessed[:5]

['can we make this quick   roxanne korrine and andrew barrett are having an incredibly horrendous public break  up on the quad   again ',
 'well  i thought we would start with pronunciation  if that is okay with you ',
 'not the hacking and gagging and spitting part   please ',
 'you are asking me out   that is so cute  what is your name again ',
 'no  no  it is my fault    we did not have a proper introduction    ']

In [18]:
# Take a look at preprocessed answers
answers_preprocessed[:5]

['well  i thought we would start with pronunciation  if that is okay with you ',
 'not the hacking and gagging and spitting part   please ',
 'okay    then how  bout we try out some french cuisine   saturday   night ',
 'forget it ',
 'cameron ']

In [19]:
# Initialize an empty dictionary for counting words
word2count = {}

# Create a word count dictionary for questions
for i_question in tqdm(questions_preprocessed):
    for i_word in i_question.split():
        if i_word not in word2count:
            word2count[i_word] = 1
        else:
            word2count[i_word] += 1

# Create a word count dictionary for answers
for i_answer in tqdm(questions_preprocessed):
    for i_word in i_answer.split():
        if i_word not in word2count:
            word2count[i_word] = 1
        else:
            word2count[i_word] += 1

100%|██████████| 221616/221616 [00:01<00:00, 124866.25it/s]
100%|██████████| 221616/221616 [00:01<00:00, 169782.75it/s]


In [20]:
### Remove the least used words

# Threshold; If occurrence of each word is below this number, it will get deleted
threshold = 20    

# Map the question words to a unique integer
questionwords2int = {}
word_number = 0
for i_word, i_count in word2count.items():
    if i_count >= threshold:
        questionwords2int[i_word] = word_number
        word_number += 1

# Map the answer words to a unique integer
answerwords2int = {}
word_number = 0
for i_word, i_count in word2count.items():
    if i_count >= threshold:
        answerwords2int[i_word] = word_number
        word_number += 1

In [21]:
# Take a look at questionsword2int
questionwords2int

{'can': 0,
 'we': 1,
 'make': 2,
 'this': 3,
 'quick': 4,
 'and': 5,
 'andrew': 6,
 'barrett': 7,
 'are': 8,
 'having': 9,
 'an': 10,
 'incredibly': 11,
 'public': 12,
 'break': 13,
 'up': 14,
 'on': 15,
 'the': 16,
 'again': 17,
 'well': 18,
 'i': 19,
 'thought': 20,
 'would': 21,
 'start': 22,
 'with': 23,
 'if': 24,
 'that': 25,
 'is': 26,
 'okay': 27,
 'you': 28,
 'not': 29,
 'part': 30,
 'please': 31,
 'asking': 32,
 'me': 33,
 'out': 34,
 'so': 35,
 'cute': 36,
 'what': 37,
 'your': 38,
 'name': 39,
 'no': 40,
 'it': 41,
 'my': 42,
 'fault': 43,
 'did': 44,
 'have': 45,
 'a': 46,
 'proper': 47,
 'introduction': 48,
 'cameron': 49,
 'thing': 50,
 'm': 51,
 'at': 52,
 'mercy': 53,
 'of': 54,
 'particularly': 55,
 'hideous': 56,
 'breed': 57,
 'loser': 58,
 'sister': 59,
 'cannot': 60,
 'date': 61,
 'until': 62,
 'she': 63,
 'does': 64,
 'why': 65,
 'mystery': 66,
 'used': 67,
 'to': 68,
 'be': 69,
 'really': 70,
 'popular': 71,
 'when': 72,
 'started': 73,
 'high': 74,
 'school': 7

In [22]:
# Take a look at answerwords2int
answerwords2int

{'can': 0,
 'we': 1,
 'make': 2,
 'this': 3,
 'quick': 4,
 'and': 5,
 'andrew': 6,
 'barrett': 7,
 'are': 8,
 'having': 9,
 'an': 10,
 'incredibly': 11,
 'public': 12,
 'break': 13,
 'up': 14,
 'on': 15,
 'the': 16,
 'again': 17,
 'well': 18,
 'i': 19,
 'thought': 20,
 'would': 21,
 'start': 22,
 'with': 23,
 'if': 24,
 'that': 25,
 'is': 26,
 'okay': 27,
 'you': 28,
 'not': 29,
 'part': 30,
 'please': 31,
 'asking': 32,
 'me': 33,
 'out': 34,
 'so': 35,
 'cute': 36,
 'what': 37,
 'your': 38,
 'name': 39,
 'no': 40,
 'it': 41,
 'my': 42,
 'fault': 43,
 'did': 44,
 'have': 45,
 'a': 46,
 'proper': 47,
 'introduction': 48,
 'cameron': 49,
 'thing': 50,
 'm': 51,
 'at': 52,
 'mercy': 53,
 'of': 54,
 'particularly': 55,
 'hideous': 56,
 'breed': 57,
 'loser': 58,
 'sister': 59,
 'cannot': 60,
 'date': 61,
 'until': 62,
 'she': 63,
 'does': 64,
 'why': 65,
 'mystery': 66,
 'used': 67,
 'to': 68,
 'be': 69,
 'really': 70,
 'popular': 71,
 'when': 72,
 'started': 73,
 'high': 74,
 'school': 7

These are some special tokens which is used in seq2seq:

- `SOS` - Start of string which is the first token that should start in the decoding layers.
    
- `EOS` - End of sentence. As soon as decoder generates this token we consider the answer to be complete (you can't use usual punctuation marks for this purpose cause their meaning can be different)

- `UNK` - Unknown token. This is used to replace the rare words that did not fit in your vocabulary. So your sentence My name is guotong1988 will be translated into My name is _unk_.

- `PAD` - Your GPU (or CPU at worst) processes your training data in batches and all the sequences in your batch should have the same length. If the max length of your sequence is 8, your sentence My name is guotong1988 will be padded from either side to fit this length: My name is guotong1988 _pad_ _pad_ _pad_ _pad_

- `OUT` - Tokens by which all the words that were filtered out by our previous dictionaries will be replaced. We call it OUT as in filter out.

In [23]:
# Adding the special tokens to the previous two dictionaries
tokens = ["<PAD>", "<EOS>", "<OUT>", "<SOS>"]

for i_token in tokens:
    questionwords2int[i_token] = len(questionwords2int) + 1
    
for i_token in tokens:
    answerwords2int[i_token] = len(answerwords2int) + 1

In [24]:
# Create the inverse dictionary of answerwords2int
answerints2word = {i_index: i_word for i_word, i_index in answerwords2int.items()}

In [25]:
# Take a look at answerints2word
answerints2word

{0: 'can',
 1: 'we',
 2: 'make',
 3: 'this',
 4: 'quick',
 5: 'and',
 6: 'andrew',
 7: 'barrett',
 8: 'are',
 9: 'having',
 10: 'an',
 11: 'incredibly',
 12: 'public',
 13: 'break',
 14: 'up',
 15: 'on',
 16: 'the',
 17: 'again',
 18: 'well',
 19: 'i',
 20: 'thought',
 21: 'would',
 22: 'start',
 23: 'with',
 24: 'if',
 25: 'that',
 26: 'is',
 27: 'okay',
 28: 'you',
 29: 'not',
 30: 'part',
 31: 'please',
 32: 'asking',
 33: 'me',
 34: 'out',
 35: 'so',
 36: 'cute',
 37: 'what',
 38: 'your',
 39: 'name',
 40: 'no',
 41: 'it',
 42: 'my',
 43: 'fault',
 44: 'did',
 45: 'have',
 46: 'a',
 47: 'proper',
 48: 'introduction',
 49: 'cameron',
 50: 'thing',
 51: 'm',
 52: 'at',
 53: 'mercy',
 54: 'of',
 55: 'particularly',
 56: 'hideous',
 57: 'breed',
 58: 'loser',
 59: 'sister',
 60: 'cannot',
 61: 'date',
 62: 'until',
 63: 'she',
 64: 'does',
 65: 'why',
 66: 'mystery',
 67: 'used',
 68: 'to',
 69: 'be',
 70: 'really',
 71: 'popular',
 72: 'when',
 73: 'started',
 74: 'high',
 75: 'school

In [26]:
# Add <EOS> to the end of each answers;  <EOS> is used for decoding part of our seq2seq which we start the process of decoding after recieving <EOS>
for index in range(len(answers_preprocessed)):
    answers_preprocessed[index] += " <EOS>"

In [27]:
# Take a look at answers_preprocessed
answers_preprocessed[:5]

['well  i thought we would start with pronunciation  if that is okay with you  <EOS>',
 'not the hacking and gagging and spitting part   please  <EOS>',
 'okay    then how  bout we try out some french cuisine   saturday   night  <EOS>',
 'forget it  <EOS>',
 'cameron  <EOS>']

In [28]:
# Translate all the questions & answers into integer
# And replace all the words that were filtered out by <OUT>
questions_to_int = []
for i_question in tqdm(questions_preprocessed):
    ints = []
    for i_word in i_question.split():
        if i_word not in questionwords2int:
            ints.append(questionwords2int["<OUT>"])
        else:
            ints.append(questionwords2int[i_word])
    questions_to_int.append(ints)
            
answers_to_int = []
for i_answer in tqdm(answers_preprocessed):
    ints = []
    for i_word in i_answer.split():
        if i_word not in answerwords2int:
            ints.append(answerwords2int["<OUT>"])
        else:
            ints.append(answerwords2int[i_word])
    answers_to_int.append(ints)

100%|██████████| 221616/221616 [00:02<00:00, 91214.64it/s]
100%|██████████| 221616/221616 [00:02<00:00, 92291.36it/s] 


In [29]:
# Take a look at questions_to_int
questions_to_int[:5]

[[0,
  1,
  2,
  3,
  4,
  8795,
  8795,
  5,
  6,
  7,
  8,
  9,
  10,
  11,
  8795,
  12,
  13,
  14,
  15,
  16,
  8795,
  17],
 [18, 19, 20, 1, 21, 22, 23, 8795, 24, 25, 26, 27, 23, 28],
 [29, 16, 8795, 5, 8795, 5, 8795, 30, 31],
 [28, 8, 32, 33, 34, 25, 26, 35, 36, 37, 26, 38, 39, 17],
 [40, 40, 41, 26, 42, 43, 1, 44, 29, 45, 46, 47, 48]]

In [30]:
# Take a look at answers_to_int
answers_to_int[:5]

[[18, 19, 20, 1, 21, 22, 23, 8795, 24, 25, 26, 27, 23, 28, 8794],
 [29, 16, 8795, 5, 8795, 5, 8795, 30, 31, 8794],
 [27, 76, 100, 1509, 1, 871, 34, 487, 390, 8795, 213, 246, 8794],
 [249, 41, 8794],
 [49, 8794]]

In [31]:
# Sort the questions & answers by the length of questions
# And remove the long texts (longer than 25 words)
sorted_clean_questions = []
sorted_clean_answers = []

for i_length in range(1, 25 + 1):
    for i in enumerate(questions_to_int):
        if len(i[1]) == i_length:
            sorted_clean_questions.append(questions_to_int[i[0]])
            sorted_clean_answers.append(answers_to_int[i[0]])

In [32]:
# Take a look at sorted_clean_questions; At start the lenfth is 1 and it get larger at the end
sorted_clean_questions

[[49],
 [65],
 [126],
 [150],
 [136],
 [40],
 [178],
 [40],
 [183],
 [184],
 [225],
 [37],
 [65],
 [136],
 [65],
 [113],
 [134],
 [298],
 [214],
 [37],
 [226],
 [183],
 [27],
 [65],
 [183],
 [8795],
 [462],
 [254],
 [183],
 [193],
 [8795],
 [8795],
 [674],
 [100],
 [8795],
 [8795],
 [37],
 [40],
 [37],
 [8795],
 [150],
 [93],
 [37],
 [37],
 [8795],
 [641],
 [40],
 [40],
 [947],
 [833],
 [1116],
 [40],
 [235],
 [72],
 [214],
 [40],
 [143],
 [1250],
 [214],
 [1108],
 [1108],
 [1108],
 [1108],
 [214],
 [338],
 [152],
 [27],
 [95],
 [674],
 [37],
 [1134],
 [1244],
 [1483],
 [1062],
 [1489],
 [214],
 [1528],
 [37],
 [37],
 [214],
 [1576],
 [1576],
 [1576],
 [1576],
 [1576],
 [1576],
 [27],
 [674],
 [70],
 [674],
 [235],
 [100],
 [1606],
 [1576],
 [1576],
 [1576],
 [8795],
 [1576],
 [1576],
 [70],
 [1644],
 [679],
 [1766],
 [214],
 [39],
 [214],
 [1210],
 [214],
 [1328],
 [214],
 [18],
 [126],
 [37],
 [65],
 [1815],
 [1746],
 [214],
 [1823],
 [214],
 [214],
 [1796],
 [226],
 [1142],
 [72],
 

In [33]:
# Take a look at sorted_clean_answers
sorted_clean_answers

[[16,
  50,
  26,
  49,
  19,
  51,
  52,
  16,
  53,
  54,
  46,
  55,
  56,
  57,
  54,
  58,
  42,
  59,
  19,
  60,
  61,
  62,
  63,
  64,
  8794],
 [8795,
  66,
  63,
  67,
  68,
  69,
  70,
  71,
  72,
  63,
  73,
  74,
  75,
  76,
  41,
  77,
  78,
  79,
  63,
  80,
  81,
  54,
  41,
  82,
  83,
  8794],
 [105, 8794],
 [1506, 79, 104, 1525, 34, 152, 613, 8794],
 [28, 156, 231, 3, 6310, 8794],
 [27, 28, 8, 163, 68, 257, 68, 1263, 100, 68, 619, 8794],
 [289, 740, 135, 8794],
 [28, 243, 98, 68, 194, 230, 41, 8794],
 [196, 8794],
 [21, 28, 127, 617, 33, 46, 1490, 49, 8794],
 [42,
  2547,
  160,
  19,
  215,
  80,
  46,
  106,
  1107,
  52,
  970,
  16,
  8795,
  147,
  223,
  523,
  8794],
 [114, 8795, 96, 46, 274, 8794],
 [181, 77, 79, 46, 275, 276, 8794],
 [282,
  25,
  19,
  283,
  19,
  21,
  113,
  97,
  284,
  78,
  119,
  281,
  271,
  77,
  268,
  41,
  5,
  19,
  45,
  29,
  285,
  286,
  96,
  8795,
  212,
  155,
  5,
  42,
  8795,
  8795,
  8795,
  287,
  8794],
 [119, 6

<br>

# 4. Building the Seq2Seq Model

---

A typical sequence to sequence model has two parts which both the parts are practically two different neural network models combined into one giant network.:

- **An Encoder:** The task of an encoder network is to understand the input sequence, and create a smaller dimensional representation of it. This representation is then forwarded to a decoder network
- **An Decoder:** The task of an decoder network is to generates a sequence of its own that represents the output.

This can be used for machine translation or for free-from question answering (generating a natural language answer given a natural language question) -- in general, it is applicable any time you need to generate text.

In [34]:
# Function for creating the placeholers for inputs and targets
def model_inputs():
    """
    Create placeholders for inputs, targets, learning rate, and keep probability for dropout.
    """
    # Inputs
    inputs = tf.placeholder(dtype = tf.int32, shape = [None, None], name = 'input')
    
    # Targets
    targets = tf.placeholder(dtype = tf.int32, shape = [None, None], name = 'target')
    
    # Learning rate
    lr = tf.placeholder(dtype = tf.float32, name = 'learning_rate')
    
    # Keep probability (for dropout)
    keep_prob = tf.placeholder(dtype = tf.float32, name = 'keep_prob')
    
    return inputs, targets, lr, keep_prob

In [35]:
# Function for preprocesing the targets (create batches + add <SOS> at the start of each sentence)
def preprocess_targets(targets, word2int, batch_size):
    """
    Preprocessing the target by:
        1. Creating batches
        2. Adding <SOS> at the begining of each sentence
    """
    # Left side: <SOS> tensor
    left_side = tf.fill(dims = [batch_size, 1], value = word2int['<SOS>'])
    
    # Right side: Sliced targets
    right_side = tf.strided_slice(input_ = targets, begin = [0,0], end = [batch_size, -1], strides = [1,1])
    
    # Concatenating two tensors
    preprocessed_targets = tf.concat(values = [left_side, right_side], axis = 1)
    
    return preprocessed_targets

In [36]:
# Create the Encoder RNN Layer
def encoder_rnn(rnn_inputs, rnn_size, num_layers, keep_prob, sequence_length):
    """
    Encoder RNN Layer
    """
    # Basic LSTM cell
    lstm = tf.contrib.rnn.BasicLSTMCell(num_units = rnn_size)
    
    # Add dropout
    lstm_dropout = tf.contrib.rnn.DropoutWrapper(cell = lstm, input_keep_prob = keep_prob)
    
    # Create RNN cell composed sequentially of multiple simple cells
    encoder_cell = tf.contrib.rnn.MultiRNNCell(cells = [lstm_dropout] * num_layers)
    
    # Dynamic version of Bidirectional RNN
    encoder_output, encoder_state = tf.nn.bidirectional_dynamic_rnn(cell_fw = encoder_cell,
                                                                    cell_bw = encoder_cell,
                                                                    sequence_length = sequence_length,
                                                                    inputs = rnn_inputs,
                                                                    dtype = tf.float32)
    return encoder_state

In [37]:
# Decode the training set
def decode_training_set(encoder_state, decoder_cell, decoder_embedded_input, sequence_length, decoding_scope, 
                        output_function, keep_prob, batch_size):
    """
    Decoding the training set.
    """
    # Initialize the attention
    attention_states = tf.zeros(shape = [batch_size, 1, decoder_cell.output_size])
    
    # Prepare keys / values / functions for attention
    attention_keys, attention_values, attention_score_function, attention_construct_function = tf.contrib.seq2seq.prepare_attention(attention_states = attention_states, 
                                                                                                                                    attention_option = "bahdanau", 
                                                                                                                                    num_units = decoder_cell.output_size)
    
    # Attentional decoder function for dynamic_rnn_decoder during training
    training_decoder_function = tf.contrib.seq2seq.attention_decoder_fn_train(encoder_state = encoder_state[0],
                                                                              attention_keys = attention_keys,
                                                                              attention_values = attention_values,
                                                                              attention_score_fn = attention_score_function,
                                                                              attention_construct_fn = attention_construct_function,
                                                                              name = "attn_dec_train")
    
    # Dynamic RNN decoder for a sequence-to-sequence model specified by RNNCell and decoder function
    decoder_output, decoder_final_state, decoder_final_context_state = tf.contrib.seq2seq.dynamic_rnn_decoder(cell = decoder_cell,
                                                                                                              decoder_fn = training_decoder_function,
                                                                                                              inputs = decoder_embedded_input,
                                                                                                              sequence_length = sequence_length,
                                                                                                              scope = decoding_scope)
    # Add dropout
    decoder_output_dropout = tf.nn.dropout(x = decoder_output, keep_prob = keep_prob)
    
    return output_function(decoder_output_dropout)

In [38]:
# Decode the test/validation set
def decode_test_set(encoder_state, decoder_cell, decoder_embeddings_matrix, sos_id, eos_id, maximum_length, num_words, 
                    decoding_scope, output_function, keep_prob, batch_size):
    """
    Decoding the test set / validation set.
    """
    # Initialize the attention
    attention_states = tf.zeros(shape = [batch_size, 1, decoder_cell.output_size])
    
    # Prepare keys / values / functions for attention
    attention_keys, attention_values, attention_score_function, attention_construct_function = tf.contrib.seq2seq.prepare_attention(attention_states = attention_states, 
                                                                                                                                    attention_option = "bahdanau", 
                                                                                                                                    num_units = decoder_cell.output_size)
    # Attentional decoder function for dynamic_rnn_decoder during inference
    test_decoder_function = tf.contrib.seq2seq.attention_decoder_fn_inference(output_fn = output_function,
                                                                              encoder_state = encoder_state[0],
                                                                              attention_keys = attention_keys,
                                                                              attention_values = attention_values,
                                                                              attention_score_fn = attention_score_function,
                                                                              attention_construct_fn = attention_construct_function,
                                                                              embeddings = decoder_embeddings_matrix,
                                                                              start_of_sequence_id = sos_id,
                                                                              end_of_sequence_id = eos_id,
                                                                              maximum_length = maximum_length,
                                                                              num_decoder_symbols = num_words,
                                                                              name = "attn_dec_inf")
    
    # Dynamic RNN decoder for a sequence-to-sequence model specified by RNNCell and decoder function
    test_predictions, decoder_final_state, decoder_final_context_state = tf.contrib.seq2seq.dynamic_rnn_decoder(cell = decoder_cell,
                                                                                                                decoder_fn = test_decoder_function,
                                                                                                                scope = decoding_scope)
    return test_predictions

In [39]:
# Create Decoder RNN
def decoder_rnn(decoder_embedded_input, decoder_embeddings_matrix, encoder_state, num_words, sequence_length, 
                rnn_size, num_layers, word2int, keep_prob, batch_size):
    """
    Creating Decoder RNN
    """
    # Variable scope
    with tf.variable_scope("decoding") as decoding_scope:
        
        # Basic LSTM cell
        lstm = tf.contrib.rnn.BasicLSTMCell(rnn_size)
        
        # Add dropout
        lstm_dropout = tf.contrib.rnn.DropoutWrapper(lstm, input_keep_prob = keep_prob)
        
        # Create RNN cell composed sequentially of multiple simple cells
        decoder_cell = tf.contrib.rnn.MultiRNNCell([lstm_dropout] * num_layers)
        
        # Initialize the weights
        weights = tf.truncated_normal_initializer(stddev = 0.1)
        
        # Initialize the biases
        biases = tf.zeros_initializer()
        
        # Add a Fully Connected Layer
        output_function = lambda x: tf.contrib.layers.fully_connected(inputs = x,
                                                                      num_outputs = num_words,
                                                                      normalizer_fn = None,
                                                                      scope = decoding_scope,
                                                                      weights_initializer = weights,
                                                                      biases_initializer = biases)
        # Decoding the training set
        training_predictions = decode_training_set(encoder_state = encoder_state,
                                                   decoder_cell = decoder_cell,
                                                   decoder_embedded_input = decoder_embedded_input,
                                                   sequence_length = sequence_length,
                                                   decoding_scope = decoding_scope,
                                                   output_function = output_function,
                                                   keep_prob = keep_prob,
                                                   batch_size = batch_size)
        
        # Reuse variables
        decoding_scope.reuse_variables()
        
        # Decoding the test set
        test_predictions = decode_test_set(encoder_state = encoder_state,
                                           decoder_cell = decoder_cell,
                                           decoder_embeddings_matrix = decoder_embeddings_matrix,
                                           sos_id = word2int['<SOS>'],
                                           eos_id = word2int['<EOS>'],
                                           maximum_length = sequence_length - 1,
                                           num_words = num_words,
                                           decoding_scope = decoding_scope,
                                           output_function = output_function,
                                           keep_prob = keep_prob,
                                           batch_size = batch_size)
        
    return training_predictions, test_predictions

In [40]:
# Build the seq2seq model
def seq2seq_model(inputs, targets, keep_prob, batch_size, sequence_length, answers_num_words, questions_num_words, 
                  encoder_embedding_size, decoder_embedding_size, rnn_size, num_layers, questionswords2int):
    """
    Seq2Seq Model
    """
    # Maps a sequence of symbols to a sequence of embeddings
    encoder_embedded_input = tf.contrib.layers.embed_sequence(ids = inputs,
                                                              vocab_size = answers_num_words + 1,
                                                              embed_dim = encoder_embedding_size,
                                                              initializer = tf.random_uniform_initializer(0, 1))
    
    # Encoder RNN Layer
    encoder_state = encoder_rnn(rnn_inputs = encoder_embedded_input, 
                                rnn_size = rnn_size, 
                                num_layers = num_layers, 
                                keep_prob = keep_prob, 
                                sequence_length = sequence_length)
    
    # Preprocessing the target by creating batches & Adding <SOS> at the begining of each sentence
    preprocessed_targets = preprocess_targets(targets = targets, 
                                              word2int = questionswords2int, 
                                              batch_size = batch_size)
    
    # Decoder embedding matrix
    decoder_embeddings_matrix = tf.Variable(tf.random_uniform(shape = [questions_num_words + 1, decoder_embedding_size], minval = 0, maxval = 1))
    
    # Decoder embedding input
    decoder_embedded_input = tf.nn.embedding_lookup(params = decoder_embeddings_matrix, 
                                                    ids = preprocessed_targets)
    
    # Decoder RNN Layer
    training_predictions, test_predictions = decoder_rnn(decoder_embedded_input = decoder_embedded_input,
                                                         decoder_embeddings_matrix = decoder_embeddings_matrix,
                                                         encoder_state = encoder_state,
                                                         num_words = questions_num_words,
                                                         sequence_length = sequence_length,
                                                         rnn_size = rnn_size,
                                                         num_layers = num_layers,
                                                         word2int = questionswords2int,
                                                         keep_prob = keep_prob,
                                                         batch_size = batch_size)
    
    return training_predictions, test_predictions

<br>

# 5. Training the Seq2Seq Model

---

In [41]:
# Setting the Hyperparameters
epochs = 10
batch_size = 64
rnn_size = 512
num_layers = 3
encoding_embedding_size = 512
decoding_embedding_size = 512
learning_rate = 0.01
learning_rate_decay = 0.9
min_learning_rate = 0.0001
keep_probability = 0.5  # Based on the paper for Geoffery Hinton: "Dropout 20% of the input units and 50% of the hidden units"

In [42]:
# Reset the default graph
tf.reset_default_graph()

# Define a session
session = tf.InteractiveSession()

In [43]:
# Load the model input
inputs, targets, lr, keep_prob = model_inputs()

In [44]:
# Set the sequence length (if you remmber, before we set this to 25 as maximum length)
sequence_length = tf.placeholder_with_default(25, None, name = 'sequence_length')

In [45]:
# Get the shape of the inputs tensor
input_shape = tf.shape(inputs)

In [None]:
# Get the training & test predictions
training_predictions, test_predictions = seq2seq_model(inputs = tf.reverse(tensor = inputs, axis = [-1]), 
                                                       targets = targets, 
                                                       keep_prob = keep_prob, 
                                                       batch_size = batch_size, 
                                                       sequence_length = sequence_length, 
                                                       answers_num_words = len(answerints2word), 
                                                       questions_num_words = len(questionwords2int), 
                                                       encoder_embedding_size = encoding_embedding_size, 
                                                       decoder_embedding_size = decoding_embedding_size, 
                                                       rnn_size = rnn_size, 
                                                       num_layers = num_layers, 
                                                       questionswords2int = questionwords2int)

In [None]:
# Set up the Loss Error, the Optimizer, and the Gradient Clipping
with tf.name_scope("optimization"):
    loss_error = tf.contrib.seq2seq.sequence_loss(training_predictions,
                                                  targets,
                                                  tf.ones([input_shape[0], sequence_length]))
    optimizer = tf.train.AdamOptimizer(learning_rate)
    gradients = optimizer.compute_gradients(loss_error)
    clipped_gradients = [(tf.clip_by_value(grad_tensor, -5., 5.), grad_variable) for grad_tensor, grad_variable in gradients if grad_tensor is not None]
    optimizer_gradient_clipping = optimizer.apply_gradients(clipped_gradients)

In [None]:
# Padding the sequences with the <PAD> token (for having the same length)
# Question: [ 'Who', 'are', 'you', '<PAD>', '<PAD>', '<PAD>', '<PAD>' ]
# Answer: [ <SOS>, 'I', 'am', 'a', 'bot', '.', '<EOS>', '<PAD>' ]
def apply_padding(batch_of_sequences, word2int):
    max_sequence_length = max([len(sequence) for sequence in batch_of_sequences])
    return [sequence + [word2int['<PAD>']] * (max_sequence_length - len(sequence)) for sequence in batch_of_sequences]

In [None]:
# Split the dataset into batches of questions and answers
def split_into_batches(questions, answers, batch_size):
    for batch_index in range(0, len(questions) // batch_size):
        start_index = batch_index * batch_size
        questions_in_batch = questions[start_index : start_index + batch_size]
        answers_in_batch = answers[start_index : start_index + batch_size]
        padded_questions_in_batch = np.array(apply_padding(questions_in_batch, questionwords2int))
        padded_answers_in_batch = np.array(apply_padding(answers_in_batch, answerwords2int))
        yield padded_questions_in_batch, padded_answers_in_batch

In [None]:
# Split the questions & answers into training and validation sets
training_validation_split = int(len(sorted_clean_questions) * 0.15)
training_questions = sorted_clean_questions[training_validation_split:]
training_answers = sorted_clean_answers[training_validation_split:]
validation_questions = sorted_clean_questions[:training_validation_split]
validation_answers = sorted_clean_answers[:training_validation_split]

In [None]:
# Training
batch_index_check_training_loss = 100
batch_index_check_validation_loss = ((len(training_questions)) // batch_size // 2) - 1
total_training_loss_error = 0
list_validation_loss_error = []
early_stopping_check = 0
early_stopping_stop = 1000
checkpoint = "chatbot_weights.ckpt" # For Windows users, replace this line of code by: checkpoint = "./chatbot_weights.ckpt"
session.run(tf.global_variables_initializer())
for epoch in range(1, epochs + 1):
    for batch_index, (padded_questions_in_batch, padded_answers_in_batch) in enumerate(split_into_batches(training_questions, training_answers, batch_size)):
        starting_time = time.time()
        _, batch_training_loss_error = session.run([optimizer_gradient_clipping, loss_error], {inputs: padded_questions_in_batch,
                                                                                               targets: padded_answers_in_batch,
                                                                                               lr: learning_rate,
                                                                                               sequence_length: padded_answers_in_batch.shape[1],
                                                                                               keep_prob: keep_probability})
        total_training_loss_error += batch_training_loss_error
        ending_time = time.time()
        batch_time = ending_time - starting_time
        if batch_index % batch_index_check_training_loss == 0:
            print('Epoch: {:>3}/{}, Batch: {:>4}/{}, Training Loss Error: {:>6.3f}, Training Time on 100 Batches: {:d} seconds'.format(epoch,
                                                                                                                                       epochs,
                                                                                                                                       batch_index,
                                                                                                                                       len(training_questions) // batch_size,
                                                                                                                                       total_training_loss_error / batch_index_check_training_loss,
                                                                                                                                       int(batch_time * batch_index_check_training_loss)))
            total_training_loss_error = 0
        if batch_index % batch_index_check_validation_loss == 0 and batch_index > 0:
            total_validation_loss_error = 0
            starting_time = time.time()
            for batch_index_validation, (padded_questions_in_batch, padded_answers_in_batch) in enumerate(split_into_batches(validation_questions, validation_answers, batch_size)):
                batch_validation_loss_error = session.run(loss_error, {inputs: padded_questions_in_batch,
                                                                       targets: padded_answers_in_batch,
                                                                       lr: learning_rate,
                                                                       sequence_length: padded_answers_in_batch.shape[1],
                                                                       keep_prob: 1})
                total_validation_loss_error += batch_validation_loss_error
            ending_time = time.time()
            batch_time = ending_time - starting_time
            average_validation_loss_error = total_validation_loss_error / (len(validation_questions) / batch_size)
            print('Validation Loss Error: {:>6.3f}, Batch Validation Time: {:d} seconds'.format(average_validation_loss_error, int(batch_time)))
            learning_rate *= learning_rate_decay
            if learning_rate < min_learning_rate:
                learning_rate = min_learning_rate
            list_validation_loss_error.append(average_validation_loss_error)
            if average_validation_loss_error <= min(list_validation_loss_error):
                print('I speak better now!!')
                early_stopping_check = 0
                saver = tf.train.Saver()
                saver.save(session, checkpoint)
            else:
                print("Sorry I do not speak better, I need to practice more.")
                early_stopping_check += 1
                if early_stopping_check == early_stopping_stop:
                    break
    if early_stopping_check == early_stopping_stop:
        print("My apologies, I cannot speak better anymore. This is the best I can do.")
        break
print("Game Over")

Epoch:   1/10, Batch:    0/2689, Training Loss Error:  0.091, Training Time on 100 Batches: 1798 seconds
Epoch:   1/10, Batch:  100/2689, Training Loss Error:  9.336, Training Time on 100 Batches: 789 seconds
Epoch:   1/10, Batch:  200/2689, Training Loss Error:  4.966, Training Time on 100 Batches: 1309 seconds
Epoch:   1/10, Batch:  300/2689, Training Loss Error:  2.495, Training Time on 100 Batches: 1160 seconds
Epoch:   1/10, Batch:  400/2689, Training Loss Error:  2.389, Training Time on 100 Batches: 2446 seconds
Epoch:   1/10, Batch:  500/2689, Training Loss Error:  2.338, Training Time on 100 Batches: 2396 seconds
Epoch:   1/10, Batch:  600/2689, Training Loss Error:  2.180, Training Time on 100 Batches: 2235 seconds
Epoch:   1/10, Batch:  700/2689, Training Loss Error:  2.372, Training Time on 100 Batches: 1640 seconds
Epoch:   1/10, Batch:  800/2689, Training Loss Error:  2.232, Training Time on 100 Batches: 1197 seconds
Epoch:   1/10, Batch:  900/2689, Training Loss Error:  2