# Building a Chatbot using Tensorflow Seq2seq

**Can we teach a machine to talk?**

In this tutorial, I will apply [Sequence to Sequence Learning](https://arxiv.org/pdf/1409.3215.pdf) method published by Google in 2014 to train a model for replying a sentence. The dataset I use is the [Movie Dialog Corpus provided by Cornell](http://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html).

Are you ready to make your machine to talk to you? Let's get started!

---

> Note: The tutorial uses some code from two online resources: [How to build your first chatbot](https://tutorials.botsfloor.com/how-to-build-your-first-chatbot-c84495d4622d) and [Building a Chatbot](https://github.com/Currie32/Chatbot-from-Movie-Dialogue/blob/master/Chatbot_Attention.ipynb). I've shared the draft with and reviewed by the instructor, and it's confirmed that *it's okay to use these resources in this way*. Please check [the private Piazza post](https://piazza.com/class/jcizpany5u6522?cid=1595) for more information.
    
---

## Introduction

**Sequence-to-sequence learning** is about training a model that converts one sequences domain to another. For example, it's used for transalation where it convert `Hello` to `你好`. It can also be used for building a chatbot that answers your techinical problem if it's trained on an IT helpdesk dataset ([A Neural Conversation Model, Google](https://arxiv.org/pdf/1506.05869v1.pdf)). In our case, we expect to train a chatbot that is capable of normal and casual conversations by using a movie subtitle dataset.

    "how are you?" -> [Seq2Seq Model] -> "good, you?"
    
In a sequence to sequence model, there's a encoder layer, a decoder layer and an intermediate that connects these two layers.

![](https://cdn-images-1.medium.com/max/1600/1*3lj8AGqfwEE5KCTJ-dXTvg.png) ([image source](https://towardsdatascience.com/sequence-to-sequence-model-introduction-and-concepts-44d9b41cd42d))

Each word in the input sequence is first be embedded using the word embedding technique learned in our class, then it's be fed into the encoder. The output of a single decoder component is fed into the next decoder component since the first word may affect what the next word is. By having this model setup, we are able to train the model that knows how to convert one sequence to another.

![](https://cdn-images-1.medium.com/max/1600/1*Ismhi-muID5ooWf3ZIQFFg.png)([image source](https://towardsdatascience.com/sequence-to-sequence-model-introduction-and-concepts-44d9b41cd42d))

If you want to learn even more about this topic without reading difficult papers, the best way it to watch [this talk](https://www.youtube.com/watch?v=G5RY_SUJih4) held by Quoc Le, Google.

**Recurrent Neural Networks (RNN)** are the networks that contain loops in them, allowing information to persist. However, to make it looks like a traditional neural network, we can unroll the loop. This will simply the question since we can apply the old techniques in training. The chain-like architecture also makes it intimately related to sequences and lists.

![](http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/RNN-unrolled.png)([image source](http://colah.github.io/posts/2015-08-Understanding-LSTMs/))

However, there's problem of long-term dependencies in RNN. In RNN, it only look at recent information to perform the present task and it doesn't perform well when the gap between two dependents is large. For example, RNN may works well on predicting the next word of "the clouds are in the ..." is "sky", but it performs badly while trying to predict the next word of "I grew up in Japan... I speak fluent ...(Japanese)".

![](http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/RNN-longtermdependencies.png)([image source](http://colah.github.io/posts/2015-08-Understanding-LSTMs/))

**Long Short Term Memory networks (LSTM)** are a special kind of RNN and t hey are capable of learning long-term dependencies. Inside the cell of LSTM, it contains different paths that enable the data be remembered for long-term or be forgotten. Explaining the details will be too much for this tutorial, please refer to [this resource](http://colah.github.io/posts/2015-08-Understanding-LSTMs/) for detailed information.

![](http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-chain.png)([image source](http://colah.github.io/posts/2015-08-Understanding-LSTMs/))

Okay, now we are good to start with the code!

## Environment Setup

You will need to install some Python dependencies before starting following code. Please make sure you install exactly version 1.0.0 for Tensorflow since there's are some unsolved bug in the later versions while using deepcopy.

    $ pip install tensorflow==1.0.0 wget==3.2

In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf
import re
import time
import wget
import zipfile
import os
import ntpath
import shutil
import string
tf.__version__

'1.0.0'

## Load dataset

We download the corpus dataset from the Internet and it's cached under `DATA_ROOT_DIR`. We will onle need two files as input, `movie_lines.txt` and `movie_converstaions.txt`. `movie_lines.txt` stores the original sentence in the subtitle and each of the line has a corresponding ID. `movie_conversations.txt` contains lists of line IDs, and each list of line IDs represents a conversation.

In [2]:
RAW_DATA_URL = "http://www.cs.cornell.edu/~cristian/data/cornell_movie_dialogs_corpus.zip"
DATA_ROOT_DIR = "./data/"
DATA_DIR = DATA_ROOT_DIR + "cornell movie-dialogs corpus/"

LINE_DATA = DATA_DIR + "movie_lines.txt"
CONVERSATION_DATA = DATA_DIR + "movie_conversations.txt"

def load_data():
    
    # Remove previous data
    try:
        shutil.rmtree(DATA_ROOT_DIR)
    except OSError:
        pass
    
    try:
        os.remove(ntpath.basename(RAW_DATA_URL))
    except OSError:
        pass    
    
    # Download
    raw_data_zip = wget.download(RAW_DATA_URL)
    
    # Unzip
    with zipfile.ZipFile(raw_data_zip, "r") as zip_ref:
        zip_ref.extractall(DATA_ROOT_DIR)
        
    # Load data
    with open(LINE_DATA, encoding="utf-8", errors="ignore") as f:
        lines = f.read().split("\n")
    
    with open(CONVERSATION_DATA, encoding="utf-8", errors="ignore") as f:
        conversations = f.read().split("\n")
        
    return lines, conversations

In [3]:
lines, conversations = load_data()

In [4]:
# Prints out the heading lines
# and see if you can get the same result as mine!
print("lines", lines[:3])
print("conversations", conversations[:3])

lines ['L1045 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ They do not!', 'L1044 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ They do to!', 'L985 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ I hope so.']
conversations ["u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L194', 'L195', 'L196', 'L197']", "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L198', 'L199']", "u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L200', 'L201', 'L202', 'L203']"]


## Process data

Before feeding the data into the model, they should be processed in order to match the format we are expecting. The process data steps include:

1. Extract question and answer pair
2. Clean and filter data
3. Tokenize
4. Sort

### Extract Question and Answer Pair

As you can see, the raw data contains lots of characters that aren't helpful in our case. Now, we will process the raw data the try to retrieve a clean sequence of questions and answers. `questions` and `answers` are lists, and the entry of these two lists are maatched. That is, questions[i] is the question for answers[i].

First, we build the mapping object that maps line ID to the sentence.

In [5]:
def get_id2line(lines):
    id2line = {}
    for line in lines:
        fields = line.split(" +++$+++ ")
        if len(fields) != 5:
            continue
        _id, _content = fields[0], fields[4]
        id2line[_id] = _content
    return id2line

In [6]:
id2line = get_id2line(lines)

Then, we retrieve the line IDs for each conversation.

In [7]:
def get_conversation_ids():
    conversation_ids = []
    for line in conversations:
        fields = line.split(" +++$+++ ")
        if len(fields) != 4:
            continue
        # Remove commas and spaces
        ids = fields[-1][1:-1].replace("'", "").replace(" ", "").split(",")
        conversation_ids.append(ids)
    return conversation_ids

In [8]:
conversation_ids = get_conversation_ids()

In [9]:
conversation_ids[:3]

[['L194', 'L195', 'L196', 'L197'],
 ['L198', 'L199'],
 ['L200', 'L201', 'L202', 'L203']]

Finally, we build the question/answer list pair by using the two objects just created.

In [10]:
def get_question_answer(id2line, conversation_ids):
    
    questions, answers = [], []
    
    # Visit all conversations
    for ids in conversation_ids:
        
        # The answer is the next line of the question
        for i in range(len(ids)-1):
            questions.append(id2line[ids[i]])
            answers.append(id2line[ids[i+1]])
            
    return questions, answers

In [11]:
questions, answers = get_question_answer(id2line, conversation_ids)

In [12]:
print(len(questions), len(answers))

221616 221616


### Clean and Filter Data

It's not a good idea to use the whole original data since there are many parts may confuse our model. Here, we will run through mutliple functions in cleaning and filtering the data.

First, we replace upper case letters with their lower case letter, remove common abbreviations, and remove unnecessary spaces.

In [13]:
def clean_text(text):
    
    # Only lower case is used
    text = text.lower()
    
    # Avoid common abbreviations 
    text = re.sub(r"i'm", "i am", text)
    text = re.sub(r"he's", "he is", text)
    text = re.sub(r"she's", "she is", text)
    text = re.sub(r"it's", "it is", text)
    text = re.sub(r"that's", "that is", text)
    text = re.sub(r"what's", "that is", text)
    text = re.sub(r"where's", "where is", text)
    text = re.sub(r"how's", "how is", text)
    text = re.sub(r"\'ll", " will", text)
    text = re.sub(r"\'ve", " have", text)
    text = re.sub(r"\'re", " are", text)
    text = re.sub(r"\'d", " would", text)
    text = re.sub(r"\'re", " are", text)
    text = re.sub(r"won't", "will not", text)
    text = re.sub(r"can't", "cannot", text)
    text = re.sub(r"n't", " not", text)
    text = re.sub(r"n'", "ng", text)
    text = re.sub(r"'bout", "about", text)
    text = re.sub(r"'til", "until", text)
    
    # Remove punctuation
    translator = str.maketrans("", "", string.punctuation)
    text = text.translate(translator)
    
    # Remove mutiple spaces
    text = " ".join(text.split())
    
    return text

In [14]:
print(clean_text("No, no, it's my fault -- we didn't have a proper introduction ---"))

no no it is my fault we did not have a proper introduction


In [15]:
def clean_question_answer(questions, answers):
    _questions = []
    for question in questions:
        _questions.append(clean_text(question))
    _answers = []
    for answer in answers:
        _answers.append(clean_text(answer))
    return _questions, _answers

In [16]:
questions, answers = clean_question_answer(questions, answers)

In [17]:
print(len(questions), len(answers))
for i in range(5):
    print("Q: %s" % questions[i])
    print("A: %s" % answers[i])
    print("---")

221616 221616
Q: can we make this quick roxanne korrine and andrew barrett are having an incredibly horrendous public break up on the quad again
A: well i thought we would start with pronunciation if that is okay with you
---
Q: well i thought we would start with pronunciation if that is okay with you
A: not the hacking and gagging and spitting part please
---
Q: not the hacking and gagging and spitting part please
A: okay then how about we try out some french cuisine saturday night
---
Q: you are asking me out that is so cute that is your name again
A: forget it
---
Q: no no it is my fault we did not have a proper introduction
A: cameron
---


Then, we only keep the questions and the answers that has length between 2 and 20. Because, the RNN model we use later only works on a fixed length sequence and 20 is a reasonable size.

In [18]:
def filter_question_answer(questions, answers,
                           min_line_length, max_line_length):
    
    _questions, _answers = [], []

    for i in range(len(questions)):
        question, answer = questions[i], answers[i]
        q_size, a_size = len(question.split()), len(answer.split())
        if ((q_size >= min_line_length and q_size <= max_line_length) and
            (a_size >= min_line_length and a_size <= max_line_length)):
            _questions.append(question)
            _answers.append(answer)

    return _questions, _answers

In [19]:
MIN_LINE_LENGTH, MAX_LINE_LENGTH = 2, 20
questions, answers = filter_question_answer(
    questions, answers, MIN_LINE_LENGTH, MAX_LINE_LENGTH)

In [20]:
lengths = []
for question in questions:
    lengths.append(len(question.split()))
for answer in answers:
    lengths.append(len(answer.split()))

# Create a dataframe so that the values can be inspected
lengths = pd.DataFrame(lengths, columns=['counts'])
lengths.describe()

Unnamed: 0,counts
count,276692.0
mean,7.937989
std,4.7467
min,2.0
25%,4.0
50%,7.0
75%,11.0
max,20.0


In [21]:
len(questions), len(answers)

(138346, 138346)

We calculates the word frequence of both question and answer sentences.

In [22]:
def get_word_freq(questions, answers):
    word_freq = {}
    for question in questions:
        for word in question.split():
            word_freq[word] = word_freq.get(word, 0) + 1
    for answer in answers:
        for word in answer.split():
            word_freq[word] = word_freq.get(word, 0) + 1
    return word_freq

In [23]:
word_freq = get_word_freq(questions, answers)
print(len(word_freq))

43438


We create two maps the converts a word to its word ID (wid), one is for words in the question sentences and another one is for the words in answer sentences. These two maps are importance in later phases since we always need this information to convert a word to its wid.

In [24]:
WORD_COUNT_THRESHOLD = 10
def get_word_id(questions, answers, word_freq):
    
    question_wid_map, answer_wid_map = {}, {}
    
    # Questions
    wid = 0
    for word, count in word_freq.items():
        if count < WORD_COUNT_THRESHOLD:
            continue
        question_wid_map[word] = wid
        wid += 1
    
    # Answers
    wid = 0
    for word, count in word_freq.items():
        if count < WORD_COUNT_THRESHOLD:
            continue
        answer_wid_map[word] = wid
        wid += 1
    
    return question_wid_map, answer_wid_map

In [25]:
question_wid_map, answer_wid_map = get_word_id(questions, answers, word_freq)

In [26]:
len(question_wid_map), len(answer_wid_map)

(8072, 8072)

### Tokenize 

We assign wid for the four special tokens. They are:

- `<PAD>`: Padding words we used for the shorter input because that it's expected all inputs should have the same length
- `<EOS>`: This token tells the decoder where a sentences ends.
- `<UNK>`: Words that we ignore will be replaced by this token.
- `<GO>`: This token tells the decoder when to start generating the output.

In [27]:
def add_special_tokens(question_wid_map, answer_wid_map):
    tokens = ["<PAD>", "<EOS>", "<UNK>", "<GO>"]
    for token in tokens:
        question_wid_map[token] = len(question_wid_map) + 1
    for token in tokens:
        answer_wid_map[token] = len(answer_wid_map) + 1
    return question_wid_map, answer_wid_map

In [28]:
question_wid_map, answer_wid_map = add_special_tokens(question_wid_map, answer_wid_map)

In [29]:
len(question_wid_map), len(answer_wid_map)

(8076, 8076)

We also need the reverse maps of the above maps since it's helpful in reading the answer, which is the case that we need to convert an ID back to its original word.

In [30]:
def get_reverse_wid_map(question_wid_map, answer_wid_map):
    question_wid_map_r, answer_wid_map_r = {}, {}
    for k, v in question_wid_map.items():
        answer_wid_map_r[v] = k
    for k, v in answer_wid_map.items():
        question_wid_map_r[v] = k
    return question_wid_map_r, answer_wid_map_r

In [31]:
question_wid_map_r, answer_wid_map_r = get_reverse_wid_map(question_wid_map, answer_wid_map)

Add `<EOS>` at the end of each answer.

In [32]:
for i in range(len(answers)):
    answers[i] += " <EOS>"

Build two objects, `question_wids` and `answer_wids`, to store the questions and answers represented in `wid`. Words that are ignore or unrecognized are replaced by `<UNK>` token. These two objects are important and will be used as the processed input dataset of the model.

In [33]:
def convert_to_wid(questions, answers, question_wid_map, answer_wid_map):
    
    question_wids = []
    for question in questions:
        wids = []
        for word in question.split():
            wids.append(question_wid_map[word]
                        if word in question_wid_map
                        else question_wid_map["<UNK>"])
        question_wids.append(wids)
    
    answer_wids = []
    for answer in answers:
        wids = []
        for word in answer.split():
            wids.append(answer_wid_map[word]
                        if word in answer_wid_map
                        else question_wid_map["<UNK>"])
        answer_wids.append(wids)
            
    return question_wids, answer_wids

In [34]:
# Tokenized question and answer data
question_wids, answer_wids = convert_to_wid(
    questions, answers, question_wid_map, answer_wid_map)

In [35]:
len(question_wids), len(answer_wids)

(138346, 138346)

### Sort

Sorting the questions and answers by the length of questions will speed up the padding procedure while training. We apply bucket sort here.

In [36]:
def sort_question_answer_wid(question_wid, answer_wid):
    # Bucket sort
    sorted_q, sorted_a = [], []
    for length in range(1, MAX_LINE_LENGTH + 1):
        for k, v in enumerate(question_wid):
            if len(v) == length:
                sorted_q.append(question_wid[k])
                sorted_a.append(answer_wid[k])
    return sorted_q, sorted_a

In [37]:
sorted_q, sorted_a = sort_question_answer_wid(question_wids, answer_wids)

In [38]:
print(len(sorted_q), len(sorted_a))
print(sorted_q[:3])
print(sorted_a[:3])

138346 138346
[[58, 48], [1, 84], [0, 85]]
[[1, 59, 59, 59, 60, 61, 62, 1, 63, 12, 64, 36, 65, 66, 8074], [11, 193, 175, 55, 61, 21, 6, 20, 160, 11, 8074], [154, 8, 9, 166, 11, 467, 55, 272, 8074]]


## Build Model

"**Tensorflow** is an open source software library for numberical computation using data flow graphs. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them."

We are now starting to introduce tensorflow library into our task. We will first setup the tensors needed for later calcuation, build the seq2seq model, then train and test the model.

### Setup session

We first initiailze a tensorflow session. Remember that, if you are running the later code multiple time, you should always run this `get_ft_session` again because it resets the graph. Otherwise, you will be given an error saying the variables are redefined.

In [39]:
def get_tf_session():
    tf.reset_default_graph()
    return tf.InteractiveSession()

In [40]:
session = get_tf_session()

### Setup tensors

Here we setup the tensors for holding the input data and run-time parameters.

In [41]:
def get_input_tensors():
    ids = tf.placeholder(tf.int32, [None, None], name="ids")
    targets = tf.placeholder(tf.int32, [None, None], name="targets")
    lr = tf.placeholder(tf.float32, name="learning_rate")
    keep_prob = tf.placeholder(tf.float32, name="keep_prob")
    return ids, targets, lr, keep_prob

In [42]:
ids, targets, lr, keep_prob = get_input_tensors()

In [43]:
print(ids)
print(targets)
print(lr)
print(keep_prob)

Tensor("ids:0", shape=(?, ?), dtype=int32)
Tensor("targets:0", shape=(?, ?), dtype=int32)
Tensor("learning_rate:0", dtype=float32)
Tensor("keep_prob:0", dtype=float32)


In [44]:
def get_parameter_tensors(ids):
    return (
        tf.placeholder_with_default(
            MAX_LINE_LENGTH, None, name="sequence_length"
        ),
        tf.shape(ids)
    )

In [45]:
sequence_length, input_shape = get_parameter_tensors(ids)

In [46]:
print(sequence_length)
print(input_shape)

Tensor("sequence_length:0", dtype=int32)
Tensor("Shape:0", shape=(2,), dtype=int32)


### Setup hyperparameters

There are few parameters needed to be manually set instead of being trained, which is called the hyperparameters. Feel free to change the value and see what changes in the output.

In [47]:
EPOCHS = 100
BATCH_SIZE = 128
RNN_SIZE = 512
NUM_LAYERS = 2
ENCODING_EMBEDDING_SIZE = 512
DECODING_EMBEDDING_SIZE = 512
LEARNING_RATE = 0.005
LEARNING_RATE_DECAY = 0.9
MIN_LEARNING_RATE = 0.0001
KEEP_PROBABILITY = 0.75

### Setup model

Now it comes to the most difficult part of this tutorial, building the seq2seq model. The `get_seq2seq_model` builds the seq2seq model using several helper functions. It returns the training and inference logits.

In [48]:
def get_seq2seq_model(
    ids, targets, keep_prob,
    batch_size, sequence_length,
    answer_wid_size, question_wid_size,
    enc_embedding_size, dec_embedding_size,
    rnn_size, num_layers, question_wid_map):
    
    # Encode: embed sequence layer
    enc_embed_input = tf.contrib.layers.embed_sequence(
        ids, answer_wid_size+1, enc_embedding_size,
        initializer = tf.random_uniform_initializer(0, 1)
    )
    
    # Encode: RNN layer of the embed input
    enc_state = get_encoding_layer(
        enc_embed_input, rnn_size, num_layers, keep_prob, sequence_length
    )
    
    # Decode: wrap the raw input
    dec_input = get_encoding_input(targets, question_wid_map, batch_size)
    
    # Decode: 
    dec_embeddings = tf.Variable(
        tf.random_uniform([question_wid_size+1, dec_embedding_size], 0, 1)
    )
    dec_embed_input = tf.nn.embedding_lookup(dec_embeddings, dec_input)
    
    # Gets the decoding layer
    train_logits, inference_logits = get_decoding_layer(
        dec_embed_input, dec_embeddings, enc_state,
        question_wid_size, sequence_length, rnn_size, num_layers,
        question_wid_map, keep_prob, batch_size
    )
    
    return train_logits, inference_logits

`get_encoding_layer` is a helper function that returns a bidirection dynamic rnn object by wrapping the input parameters.

In [49]:
def get_encoding_layer(rnn_inputs, rnn_size, num_layers,
                       keep_prob, sequence_length):
    lstm = tf.contrib.rnn.BasicLSTMCell(rnn_size)
    drop = tf.contrib.rnn.DropoutWrapper(lstm, input_keep_prob = keep_prob)
    enc_cell = tf.contrib.rnn.MultiRNNCell([drop] * num_layers)
    _, enc_state = tf.nn.bidirectional_dynamic_rnn(
        cell_fw = enc_cell,
        cell_bw = enc_cell,
        sequence_length = sequence_length,
        inputs = rnn_inputs,
        dtype=tf.float32
    )
    return enc_state

Remove the last word of each batch and concat with the <GO> token to the begining of each batch.

In [50]:
def get_encoding_input(targets, question_wid_map, batch_size):
    ending = tf.strided_slice(targets, [0, 0], [batch_size, -1], [1, 1])
    dec_input = tf.concat(
        [tf.fill([batch_size, 1], question_wid_map["<GO>"]), ending], 1)
    return dec_input

Create the decoding cells and returns the training and inference decoding layers.

In [51]:
def get_decoding_layer(dec_embed_input, dec_embeddings,
                       encoder_state, question_wid_size,
                       sequence_length, rnn_size,
                       num_layers, word_map, keep_prob, batch_size):
    
    with tf.variable_scope("decoding") as decoding_scope:
                
        lstm = tf.contrib.rnn.BasicLSTMCell(rnn_size)
        drop = tf.contrib.rnn.DropoutWrapper(
            lstm, input_keep_prob=keep_prob)
        dec_cell = tf.contrib.rnn.MultiRNNCell([drop] * num_layers)
        
        weights = tf.truncated_normal_initializer(stddev=0.1)
        biases = tf.zeros_initializer()
        
        output_fn = lambda x: tf.contrib.layers.fully_connected(
            x, question_wid_size, None, scope=decoding_scope,
            weights_initializer=weights,
            biases_initializer=biases
        )
        
        train_logits = get_decoding_layer_train(
            encoder_state, dec_cell, dec_embed_input,
            sequence_length, decoding_scope, output_fn,
            keep_prob, batch_size
        )
        
        decoding_scope.reuse_variables()
        inference_logits = get_decoding_layer_infer(
            encoder_state, dec_cell, dec_embeddings,
            word_map["<GO>"], word_map["<EOS>"],
            sequence_length-1, question_wid_size,
            decoding_scope, output_fn, keep_prob, batch_size
        )
        
        return train_logits, inference_logits

Gets the decode function of the training data.

In [52]:
def get_decoding_layer_train(
    encoder_state, dec_cell, dec_embed_input,
    sequence_length, decoding_scope, output_fn, keep_prob, batch_size):
    
    
    attention_states = tf.zeros([batch_size, 1, dec_cell.output_size])
    
    att_keys, att_vals, att_score_fn, att_construct_fn = \
            tf.contrib.seq2seq.prepare_attention(
                attention_states,
                attention_option="bahdanau",
                num_units=dec_cell.output_size
            )
    
    train_decoder_fn = tf.contrib.seq2seq.attention_decoder_fn_train(
        encoder_state[0],
        att_keys,
        att_vals,
        att_score_fn,
        att_construct_fn,
        name = "attn_dec_train")
    
    train_pred, _, _ = tf.contrib.seq2seq.dynamic_rnn_decoder(
        dec_cell, 
        train_decoder_fn, 
        dec_embed_input, 
        sequence_length, 
        scope=decoding_scope)
    
    train_pred_drop = tf.nn.dropout(train_pred, keep_prob)
    return output_fn(train_pred_drop)

Gets the decode logits of the prediction data.

In [53]:
def get_decoding_layer_infer(
    encoder_state, dec_cell, dec_embeddings,
    start_of_sequence_id, end_of_sequence_id,
    maximum_length, vocab_size, decoding_scope,
    output_fn, keep_prob, batch_size):

    
    attention_states = tf.zeros([batch_size, 1, dec_cell.output_size])
    
    att_keys, att_vals, att_score_fn, att_construct_fn = \
            tf.contrib.seq2seq.prepare_attention(
                attention_states,
                attention_option="bahdanau",
                num_units=dec_cell.output_size)
    
    infer_decoder_fn = tf.contrib.seq2seq.attention_decoder_fn_inference(
        output_fn, 
        encoder_state[0], 
        att_keys, 
        att_vals, 
        att_score_fn, 
        att_construct_fn, 
        dec_embeddings,
        start_of_sequence_id, 
        end_of_sequence_id, 
        maximum_length, 
        vocab_size, 
        name = "attn_dec_inf")
    
    inference_logits, _, _ = tf.contrib.seq2seq.dynamic_rnn_decoder(
        dec_cell, 
        infer_decoder_fn, 
        scope=decoding_scope)
    
    return inference_logits

Finally, we use the top wrapper function to get a seq2seq model.

In [54]:
train_logits, inference_logits = get_seq2seq_model(
    tf.reverse(ids, [-1]), targets, keep_prob,
    BATCH_SIZE, sequence_length,
    len(answer_wids), len(question_wids),
    ENCODING_EMBEDDING_SIZE, DECODING_EMBEDDING_SIZE,
    RNN_SIZE, NUM_LAYERS, question_wid_map)

NOTE: be careful on reuse

## Train

### Setup training optimizer

After having the seq2seq model ready, we setup an optimizer object that optimize the seq2seq model. In the same time, we also retrieve the cost tensor object which presents the cost of the current trained model.

In [55]:
def get_train_optimizer(train_logits, targets,
                        learning_rate, sequence_length):

    tf.identity(inference_logits, "logits")
    
    with tf.name_scope("optimization"):
        
        # Loss function
        cost = tf.contrib.seq2seq.sequence_loss(
            train_logits,
            targets,
            tf.ones([input_shape[0], sequence_length])
        )
        
        # Optimizer
        optimizer = tf.train.AdamOptimizer(learning_rate)
        
        # Gradient Clipping
        gradients = optimizer.compute_gradients(cost)
        capped_gradients = [
            (tf.clip_by_value(grad, -5., 5.), var)
            for grad, var in gradients if grad is not None
        ]
        train_op = optimizer.apply_gradients(capped_gradients)
    
    return train_op, cost

In [56]:
train_optimizer, cost = get_train_optimizer(
    train_logits, targets,
    LEARNING_RATE, sequence_length)

In [57]:
print(train_optimizer, cost)

name: "optimization/Adam"
op: "NoOp"
input: "^optimization/Adam/update_EmbedSequence/embeddings/ApplyAdam"
input: "^optimization/Adam/update_bidirectional_rnn/fw/multi_rnn_cell/cell_0/basic_lstm_cell/weights/ApplyAdam"
input: "^optimization/Adam/update_bidirectional_rnn/fw/multi_rnn_cell/cell_0/basic_lstm_cell/biases/ApplyAdam"
input: "^optimization/Adam/update_bidirectional_rnn/fw/multi_rnn_cell/cell_1/basic_lstm_cell/weights/ApplyAdam"
input: "^optimization/Adam/update_bidirectional_rnn/fw/multi_rnn_cell/cell_1/basic_lstm_cell/biases/ApplyAdam"
input: "^optimization/Adam/update_Variable/ApplyAdam"
input: "^optimization/Adam/update_decoding/attention_keys/weights/ApplyAdam"
input: "^optimization/Adam/update_decoding/attention_score/attnW/ApplyAdam"
input: "^optimization/Adam/update_decoding/attention_score/attnV/ApplyAdam"
input: "^optimization/Adam/update_decoding/multi_rnn_cell/cell_0/basic_lstm_cell/weights/ApplyAdam"
input: "^optimization/Adam/update_decoding/multi_rnn_cell/cell_0

### Split data

We take 85% of the original dataset as training data and 15% as validation data. This enables us to know when to "stop" training (we should stop training if the accuracy on test data starts to decrease).

In [58]:
train_valid_pivot = int(len(sorted_q) * 0.15)

train_questions = sorted_q[train_valid_pivot:]
train_answers = sorted_a[train_valid_pivot:]

valid_questions = sorted_q[:train_valid_pivot]
valid_answers = sorted_a[:train_valid_pivot]

In [59]:
print(len(train_questions), len(valid_questions))

117595 20751


### Run Train

We start to train our defined model now. The training function here is designed for running on a normal laptop instead of a powerful machine that uses GPU or TPU. That is, we don't expect to come to a point that we see accuracy on the validation data starts to decrease because that will take forever if you run this training on a normal machine. Instead, I added a `loss_stop` variable, which is the threshold you set manually based on the capacity of your machine, the training will stop when it found the current loss value if lower than `lost_stop`. In my case, it took me around 1.5 hours to get a model with loss < 2.0.

In [60]:
def train(
    session, cost, epochs, train_questions, train_answers,
    batch_size, learning_rate, keep_probability, loss_stop,
    question_wid_map, answer_wid_map):
    
    # Epoch iteration
    for epoch_i in range(1, epochs + 1):
        
        # Batch iteration
        for batch_i, (questions_batch, answers_batch) in enumerate(
            get_batch(train_questions, train_answers, batch_size,
                      question_wid_map, answer_wid_map)):
            
            # Marks starting time
            start_time = time.time()
            
            # Runs train optimizer
            _, loss = session.run(
                [train_optimizer, cost],
                {
                    ids: questions_batch,
                    targets: answers_batch,
                    lr: learning_rate,
                    sequence_length: answers_batch.shape[1],
                    keep_prob: keep_probability
                }
            )
            
            # Logs ending metrics
            end_time = time.time()
            
            # Prints some logs
            print("Epoch %d/%d Batch %d/%d Loss: %.3f Seconds: %.2f" % (
                epoch_i, epochs, batch_i,
                len(train_questions) // batch_size,
                loss, end_time - start_time
            ))

            if loss < loss_stop:
                return

In [61]:
def get_batch(questions, answers, batch_size,
              question_wid_map, answer_wid_map):
    
    for batch_i in range(0, len(questions) // batch_size):
        
        # Finds starting index of this batch
        start_i = batch_i * batch_size
        
        # Gets the batch of questions and answers
        questions_batch = questions[start_i:start_i + batch_size]
        answers_batch = answers[start_i:start_i + batch_size]
        
        # Pads and parses the questions, answers into numpy arrays
        padded_questions_batch = np.array(
            get_padded_batch(questions_batch, question_wid_map)
        )
        padded_answers_batch = np.array(
            get_padded_batch(answers_batch, answer_wid_map)
        )
        
        # Generates the batch into final output
        # https://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do
        yield padded_questions_batch, padded_answers_batch

In [62]:
def get_padded_batch(sentence_batch, word_to_int):
    max_sentence = max(
        [len(sentence) for sentence in sentence_batch]
    )
    return [
        sentence +
        [word_to_int["<PAD>"]] * (max_sentence - len(sentence))
        for sentence in sentence_batch
    ]

In [63]:
session.run(tf.global_variables_initializer())
train(
    session, cost, EPOCHS, train_questions, train_answers,
    BATCH_SIZE, LEARNING_RATE, KEEP_PROBABILITY, 2.0,
    question_wid_map, answer_wid_map
)

Epoch 1/100 Batch 0/918 Loss: 11.678 Seconds: 45.31
Epoch 1/100 Batch 1/918 Loss: 17.817 Seconds: 46.16
Epoch 1/100 Batch 2/918 Loss: 36.943 Seconds: 47.98
Epoch 1/100 Batch 3/918 Loss: 7.723 Seconds: 42.30
Epoch 1/100 Batch 4/918 Loss: 9.962 Seconds: 41.90
Epoch 1/100 Batch 5/918 Loss: 5.850 Seconds: 43.69
Epoch 1/100 Batch 6/918 Loss: 4.922 Seconds: 38.39
Epoch 1/100 Batch 7/918 Loss: 5.748 Seconds: 40.18
Epoch 1/100 Batch 8/918 Loss: 5.312 Seconds: 43.63
Epoch 1/100 Batch 9/918 Loss: 4.563 Seconds: 39.41
Epoch 1/100 Batch 10/918 Loss: 3.700 Seconds: 39.61
Epoch 1/100 Batch 11/918 Loss: 4.479 Seconds: 40.46
Epoch 1/100 Batch 12/918 Loss: 4.340 Seconds: 40.09
Epoch 1/100 Batch 13/918 Loss: 3.600 Seconds: 40.22
Epoch 1/100 Batch 14/918 Loss: 3.583 Seconds: 40.17
Epoch 1/100 Batch 15/918 Loss: 3.290 Seconds: 41.80
Epoch 1/100 Batch 16/918 Loss: 3.469 Seconds: 39.47
Epoch 1/100 Batch 17/918 Loss: 3.365 Seconds: 40.70
Epoch 1/100 Batch 18/918 Loss: 3.322 Seconds: 40.79
Epoch 1/100 Batch 1

## Let's Chat!

Finally, we can use the trained `answer_logits` to answer my questions. I built a function called `ask` which only takes one parameter, the question string.

In [64]:
def ask(question, session=session, question_wid_map=question_wid_map,
        answer_wid_map_r=answer_wid_map_r,
        inference_logits=inference_logits, batch_size=BATCH_SIZE):
    
    question_seq = question_to_seq(question, question_wid_map)
    question_seq += [question_wid_map["<PAD>"]] * (MAX_LINE_LENGTH - len(question_seq))
    batch_shell = np.zeros((batch_size, MAX_LINE_LENGTH))
    batch_shell[0] = question_seq
    
    # Run the model
    answer_logits = session.run(
        inference_logits, {
            ids: batch_shell,
            keep_prob: 1.0
        }
    )[0]

    pad = answer_wid_map["<PAD>"]
    return " ".join([answer_wid_map_r[i]
                     for i in np.argmax(answer_logits, 1)
                     if i != pad])

In [65]:
def question_to_seq(question, word_to_int):
    question = clean_text(question)
    return [
        word_to_int.get(word, word_to_int["<UNK>"])
        for word in question.split()
    ]

In [67]:
print(ask("how are you"))

i am <UNK> <EOS>


By increasing the size of the dataset set, and our computation power. We can definitely build a chatbot that appeared in [Black Mirror](https://en.wikipedia.org/wiki/Be_Right_Back) soon.

## Reference

- [Cornell Movie Dialogs Corpus](http://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html)
- [A ten-minute introduction to sequence-to-sequence learning in Keras](https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html)
- [Sequence to sequence model: Introduction and concepts](https://towardsdatascience.com/sequence-to-sequence-model-introduction-and-concepts-44d9b41cd42d)
- [Understanding LSTM Networks](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)
- [Tensorflow](https://www.tensorflow.org/)