<a href="https://colab.research.google.com/github/rahiakela/natural-language-processing-in-action/blob/10-sequence-to-sequence-models-and-attention/1_building_chatbot_using_sequence_to_sequence_networks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Building a chatbot using sequence-to-sequence networks

We guide you through how to apply the various steps to train a chatbot. For the chatbot training, you’ll use the [Cornell movie dialog corpus](https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html). You’ll train a sequenceto- sequence network to “adequately” reply to your questions or statements. Our chatbot example is an adopted sequence-to-sequence example from the [Keras blog](https://github.com/fchollet/keras/blob/master/examples/lstm_seq2seq.py).



## Setup

In [0]:
from __future__ import absolute_import, division, print_function, unicode_literals

try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass
import tensorflow as tf
from tensorflow import keras
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

from tensorflow.keras import backend as keras_backend
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation, Flatten, LSTM
from tensorflow.keras.preprocessing import sequence

import os
import tarfile
import re
import tqdm

import glob
from random import shuffle
from nltk.tokenize import TreebankWordTokenizer

import requests

In [0]:
data_path = keras.utils.get_file('cornell_movie_dialogs_corpus.zip', 
                            origin='http://www.cs.cornell.edu/~cristian/data/cornell_movie_dialogs_corpus.zip',
                            extract=True)

In [8]:
data_path

'/root/.keras/datasets/cornell_movie_dialogs_corpus.zip'

In [0]:
!cp /root/.keras/datasets/cornell_movie_dialogs_corpus.zip .

In [0]:
import zipfile
with zipfile.ZipFile('cornell_movie_dialogs_corpus.zip', 'r') as zip_ref:
    zip_ref.extractall('cornell_movie_dialogs_corpus')

## Preparing the corpus for your training

First, you need to load the corpus and generate the training sets from it. The training data will determine the set of characters the encoder and decoder will support during the training and during the generation phase. Please note that this implementation doesn’t support characters that haven’t been included during the training phase.

Using the entire Cornell Movie Dialog dataset can be computationally intensive because a few sequences have more than 2,000 tokens—2,000 time steps will take a while to unroll. But the majority of dialog samples are based on less than 100 characters.

For this example, you’ve preprocessed the dialog corpus by limiting samples to those with fewer than 100 characters, removed odd characters, and only allowed lowercase characters.

You’ll loop over the corpus file and generate the training pairs (technically 3-tuples: input text, target text with start token, and target text). While reading the corpus, you’ll also generate a set of input and target characters, which you’ll then use to onehot encode the samples. The input and target characters don’t have to match. 

But characters that aren’t included in the sets can’t be read or generated during the generation phase. The result of the following listing is two lists of input and target texts (strings), as well as two sets of characters that have been seen in the training corpus.