<a href="https://colab.research.google.com/github/rahiakela/natural-language-processing-in-action/blob/master/10-sequence-to-sequence-models-and-attention/1_building_chatbot_using_sequence_to_sequence_networks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Building a chatbot using sequence-to-sequence networks

We guide you through how to apply the various steps to train a chatbot. For the chatbot training, you’ll use the [Cornell movie dialog corpus](https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html). You’ll train a sequenceto- sequence network to “adequately” reply to your questions or statements. Our chatbot example is an adopted sequence-to-sequence example from the [Keras blog](https://github.com/fchollet/keras/blob/master/examples/lstm_seq2seq.py).



## Setup

In [1]:
import tensorflow as tf
from tensorflow import keras
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

from tensorflow.keras import backend as keras_backend
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation, Flatten, LSTM
from tensorflow.keras.preprocessing import sequence

import os
import tarfile
import re
import tqdm

import glob
from random import shuffle
from nltk.tokenize import TreebankWordTokenizer

import requests

In [2]:
!wget -q https://github.com/rahiakela/natural-language-processing-in-action/raw/master/10-sequence-to-sequence-models-and-attention/dataset/moviedialog.csv

In [3]:
df = pd.read_csv("moviedialog.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,statement,reply
0,0,you're asking me out. that's so cute. what's y...,forget it.
1,1,"no, no, it's my fault we didn't have a proper ...",cameron.
2,2,"gosh, if only we could find kat a boyfriend...",let me see what i can do.
3,3,c'esc ma tete. this is my head,right. see? you're ready for the quiz.
4,4,how is our little find the wench a date plan p...,"well, there's someone i think might be"


In [4]:
print(df.shape)
df = df[["statement", "reply"]]
df.head()

(64350, 3)


Unnamed: 0,statement,reply
0,you're asking me out. that's so cute. what's y...,forget it.
1,"no, no, it's my fault we didn't have a proper ...",cameron.
2,"gosh, if only we could find kat a boyfriend...",let me see what i can do.
3,c'esc ma tete. this is my head,right. see? you're ready for the quiz.
4,how is our little find the wench a date plan p...,"well, there's someone i think might be"


In [5]:
df = df.dropna()
print(df.shape)

(64350, 2)


## Preparing the corpus for your training

First, you need to load the corpus and generate the training sets from it. The training data will determine the set of characters the encoder and decoder will support during the training and during the generation phase. Please note that this implementation doesn’t support characters that haven’t been included during the training phase.

Using the entire Cornell Movie Dialog dataset can be computationally intensive because a few sequences have more than 2,000 tokens—2,000 time steps will take a while to unroll. But the majority of dialog samples are based on less than 100 characters.

For this example, you’ve preprocessed the dialog corpus by limiting samples to those with fewer than 100 characters, removed odd characters, and only allowed lowercase characters.

You’ll loop over the corpus file and generate the training pairs (technically 3-tuples: input text, target text with start token, and target text). While reading the corpus, you’ll also generate a set of input and target characters, which you’ll then use to onehot encode the samples. The input and target characters don’t have to match. 

But characters that aren’t included in the sets can’t be read or generated during the generation phase. The result of the following listing is two lists of input and target texts (strings), as well as two sets of characters that have been seen in the training corpus.

In [6]:
# The arrays hold the input and target text read from the corpus file.
input_texts, target_texts = [], []

# The sets hold the seen characters in the input and target text.
input_vocabulary = set()
output_vocabulary = set()

"""
The target sequence is annotated with a start (first) and stop (last) token; the characters representing the tokens are
defined here. These tokens can’t be part of the normal sequence text and should be uniquely used as start and stop tokens.
"""
start_token = "\t"
stop_token = "\n"

"""
max_training_samples defines how many lines are used for the training. It’s the lower number
of either a user-defined maximum or the total number of lines loaded from the file.
"""
max_training_samples = min(25000, len(df) - 1)

for input_text, target_text in zip(df.statement, df.reply):
  # The target_text needs to be wrapped with the start and stop tokens.
  target_text = start_token + target_text + stop_token
  input_texts.append(input_text)
  target_texts.append(target_text)

  # Compile the vocabulary— set of the unique characters seen in the input_texts.
  for char in input_text:
    if char not in input_vocabulary:
      input_vocabulary.add(char)
  for char in target_text:
    if char not in output_vocabulary:
      output_vocabulary.add(char)

In [7]:
input_texts[:5]

["you're asking me out. that's so cute. what's your name again?",
 "no, no, it's my fault we didn't have a proper introduction ",
 'gosh, if only we could find kat a boyfriend...',
 "c'esc ma tete. this is my head",
 'how is our little find the wench a date plan progressing?']

In [8]:
target_texts[:5]

['\tforget it.\n',
 '\tcameron.\n',
 '\tlet me see what i can do.\n',
 "\tright. see? you're ready for the quiz.\n",
 "\twell, there's someone i think might be \n"]

In [9]:
list(input_vocabulary)[:10]

['g', 'c', 'm', '5', 'l', 'x', '.', "'", 'n', '1']

In [10]:
list(output_vocabulary)[:10]

['g', 'c', 'm', '5', 'l', 'x', '.', "'", 'n', '1']

##Building your character dictionary

You need to convert each character of the input and target texts into one-hot vectors that represent each character. In order to generate the one-hot vectors, you generate token dictionaries (for the input and target text), where every character is mapped to an index. You also generate the reverse dictionary (index to character), which you’ll use during the generation phase to convert the generated index to a character.

In [11]:
# You convert the character sets into sorted lists of characters, which you then use to generate the dictionary.
input_vocabulary = sorted(input_vocabulary)
output_vocabulary = sorted(output_vocabulary)

# For the input and target data, you determine the maximum number of unique characters, which you use to build the one-hot matrices.
input_vocab_size = len(input_vocabulary)
output_vocab_size = len(output_vocabulary)

# For the input and target data, you also determine the maximum number of sequence tokens.
max_encoder_seq_length = max([len(txt) for txt in input_texts])
max_decoder_seq_length = max([len(txt) for txt in target_texts])

# Loop over the input_characters and output_vocabulary to create the lookup dictionaries, which you use to generate the one-hot vectors.
input_token_index = dict([(char, i) for i, char in enumerate(input_vocabulary)])
target_token_index = dict([(char, i) for i, char in enumerate(output_vocabulary)])

# Loop over the newly created dictionaries to create the reverse lookups.
reverse_input_char_index = dict([(i, char) for char, i in input_token_index.items()])
reverse_target_char_index = dict([(i, char) for char, i in target_token_index.items()])

In [14]:
list(input_token_index)[:10]

[' ', '!', "'", ',', '.', '0', '1', '2', '3', '4']

In [15]:
list(target_token_index)[:10]

['\t', '\n', ' ', '!', "'", ',', '.', '0', '1', '2']

## Generate one-hot encoded training sets