# Building a Chatbot

In this project, we will build a chatbot using scripts from the television shows South Park and The Simpsons. The main features of our model are LSTM cells, a bidirectional dynamic RNN, and an attention cell wrapper.

The data for our chatbot are from datasets on Kaggle: [South Park](https://www.kaggle.com/tovarischsukhov/southparklines), [The Simpsons](https://www.kaggle.com/wcukierski/the-simpsons-by-the-data). Much more data could be added to this model, if you wish, but I thought it would be interesting to use similar types of data with the hope that a *personality* could be created. In this case, some sort of cartoon personality.

In [None]:
import pandas as pd
import numpy as np
import tensorflow as tf
import re
tf.__version__

In [None]:
# Load the data
# I'm only going to use a small portion of the data so that it load quickly on Kaggle.
southpark = pd.read_csv("../input/All-seasons.csv")[:100] 
#simpsons = pd.read_csv("simpsons.csv") # can't load on Kaggle, but you can do it on your own pc.

In [None]:
southpark.head()

In [None]:
print("South Park lines:")
for i in range(0,5):
    print("Line #",i+1)
    print(southpark.Line[i])

In [None]:
def clean_text(text):
    '''Clean text by removing unnecessary characters and altering the format of words.'''

    text = text.lower()
    
    text = re.sub(r"\n", "",  text)
    text = re.sub(r"[-()]", "", text)
    text = re.sub(r"\.", " .", text)
    text = re.sub(r"\!", " !", text)
    text = re.sub(r"\?", " ?", text)
    text = re.sub(r"\,", " ,", text)
    text = re.sub(r"i'm", "i am", text)
    text = re.sub(r"he's", "he is", text)
    text = re.sub(r"she's", "she is", text)
    text = re.sub(r"it's", "it is", text)
    text = re.sub(r"that's", "that is", text)
    text = re.sub(r"what's", "that is", text)
    text = re.sub(r"\'ll", " will", text)
    text = re.sub(r"\'re", " are", text)
    text = re.sub(r"won't", "will not", text)
    text = re.sub(r"can't", "cannot", text)
    text = re.sub(r"n't", " not", text)
    text = re.sub(r"n'", "ng", text)
    text = re.sub(r"ohh", "oh", text)
    text = re.sub(r"ohhh", "oh", text)
    text = re.sub(r"ohhhh", "oh", text)
    text = re.sub(r"ohhhhh", "oh", text)
    text = re.sub(r"ohhhhhh", "oh", text)
    text = re.sub(r"ahh", "ah", text)
    
    return text

In [None]:
# Clean the scripts and add them to the same list.
text = []

for line in southpark.Line:
    text.append(clean_text(line))

In [None]:
# Take a look at some of the text to ensure that it has been cleaned well.
limit = 0
for i in range(limit,limit+20):
    print(text[i])

In [None]:
# Find the length of lines
lengths = []
for line in text:
    lengths.append(len(line.split()))

# Create a dataframe so that the values can be inspected
lengths = pd.DataFrame(lengths, columns=['counts'])

In [None]:
lengths.describe()

In [None]:
print(np.percentile(lengths, 80))
print(np.percentile(lengths, 85))
print(np.percentile(lengths, 90))
print(np.percentile(lengths, 95))
print(np.percentile(lengths, 99))

In [None]:
# Limit the text we will use to the shorter 95%.
max_line_length = 30

short_text = []
for line in text:
    if len(line.split()) <= max_line_length:
        short_text.append(line)

In [None]:
# Create a dictionary for the frequency of the vocabulary
vocab = {}
for line in short_text:
    for word in line.split():
        if word not in vocab:
            vocab[word] = 1
        else:
            vocab[word] += 1

In [None]:
# Limit the vocabulary to words used more than 3 times.
threshold = 3
count = 0
for k,v in vocab.items():
    if v >= threshold:
        count += 1

In [None]:
print("Size of total vocab:", len(vocab))
print("Size of vocab we will use:", count)

In [None]:
# In case we want to use a different vocabulary sizes for the source and target text, 
# we can set different threshold values.
# Nonetheless, we will create dictionaries to provide a unique integer for each word.
source_vocab_to_int = {}

word_num = 0
for k,v in vocab.items():
    if v >= threshold:
        source_vocab_to_int[k] = word_num
        word_num += 1
        
target_vocab_to_int = {}

word_num = 0
for k,v in vocab.items():
    if v >= threshold:
        target_vocab_to_int[k] = word_num
        word_num += 1

In [None]:
# Add the unique tokens to the vocabulary dictionaries.
codes = ['<PAD>','<EOS>','<UNK>','<GO>']

for code in codes:
    source_vocab_to_int[code] = len(source_vocab_to_int)+1
    
for code in codes:
    target_vocab_to_int[code] = len(target_vocab_to_int)+1

In [None]:
# Create dictionaries to map the unique integers to their respective words.
# i.e. an inverse dictionary for vocab_to_int.
source_int_to_vocab = {v_i: v for v, v_i in source_vocab_to_int.items()}
target_int_to_vocab = {v_i: v for v, v_i in target_vocab_to_int.items()}

In [None]:
# Check the length of the dictionaries.
print(len(source_vocab_to_int))
print(len(source_int_to_vocab))
print(len(target_vocab_to_int))
print(len(target_int_to_vocab))

In [None]:
# Create the source and target texts.
# The target text is the line following the source text.
source_text = short_text[:-1]
target_text = short_text[1:]

for i in range(len(target_text)):
    target_text[i] += ' <EOS>'

In [None]:
# Check if the source and target text lengths match.
print(len(source_text))
print(len(target_text))

In [None]:
# Convert the text to integers. 
# Replace any words that are not in the respective vocabulary with <UNK> (unknown)
source_int = []
for line in source_text:
    sentence = []
    for word in line.split():
        if word not in source_vocab_to_int:
            sentence.append(source_vocab_to_int['<UNK>'])
        else:
            sentence.append(source_vocab_to_int[word])
    source_int.append(sentence)
    
target_int = []
for line in target_text:
    sentence = []
    for word in line.split():
        if word not in target_vocab_to_int:
            sentence.append(target_vocab_to_int['<UNK>'])
        else:
            sentence.append(target_vocab_to_int[word])
    target_int.append(sentence)

In [None]:
# Check the lengths
print(len(source_int))
print(len(target_int))

In [None]:
def model_inputs():
    '''Create palceholders for inputs to the model'''
    input_data = tf.placeholder(tf.int32, [None, None], name='input')
    targets = tf.placeholder(tf.int32, [None, None], name='targets')
    lr = tf.placeholder(tf.float32, name='learning_rate')
    keep_prob = tf.placeholder(tf.float32, name='keep_prob')

    return input_data, targets, lr, keep_prob

In [None]:
def process_encoding_input(target_data, vocab_to_int, batch_size):
    '''Remove the last word id from each batch and concat the <GO> to the begining of each batch'''
    ending = tf.strided_slice(target_data, [0, 0], [batch_size, -1], [1, 1])
    dec_input = tf.concat([tf.fill([batch_size, 1], vocab_to_int['<GO>']), ending], 1)

    return dec_input

In [None]:
def encoding_layer(rnn_inputs, rnn_size, num_layers, keep_prob, sequence_length, attn_length):
    '''Create the encoding layer'''
    lstm = tf.contrib.rnn.BasicLSTMCell(rnn_size)
    drop = tf.contrib.rnn.DropoutWrapper(lstm, input_keep_prob = keep_prob)
    cell = tf.contrib.rnn.AttentionCellWrapper(drop, attn_length, state_is_tuple = True)
    enc_cell = tf.contrib.rnn.MultiRNNCell([cell] * num_layers)
    _, enc_state = tf.nn.bidirectional_dynamic_rnn(cell_fw = enc_cell,
                                                   cell_bw = enc_cell,
                                                   sequence_length = sequence_length,
                                                   inputs = rnn_inputs, 
                                                   dtype=tf.float32)

    return enc_state

In [None]:
def decoding_layer_train(encoder_state, dec_cell, dec_embed_input, sequence_length, decoding_scope,
                         output_fn, keep_prob):
    '''Decode the training data'''
    train_decoder_fn = tf.contrib.seq2seq.simple_decoder_fn_train(encoder_state)
    train_pred, _, _ = tf.contrib.seq2seq.dynamic_rnn_decoder(
        dec_cell, train_decoder_fn, dec_embed_input, sequence_length, scope=decoding_scope)
    train_pred_drop = tf.nn.dropout(train_pred, keep_prob)
    return output_fn(train_pred_drop)

In [None]:
def decoding_layer_infer(encoder_state, dec_cell, dec_embeddings, start_of_sequence_id, end_of_sequence_id,
                         maximum_length, vocab_size, decoding_scope, output_fn, keep_prob):
    '''Decode the prediction data'''
    infer_decoder_fn = tf.contrib.seq2seq.simple_decoder_fn_inference(
        output_fn, encoder_state, dec_embeddings, start_of_sequence_id, end_of_sequence_id, maximum_length, vocab_size)
    infer_logits, _, _ = tf.contrib.seq2seq.dynamic_rnn_decoder(dec_cell, infer_decoder_fn, scope=decoding_scope)
    return infer_logits

In [None]:
def decoding_layer(dec_embed_input, dec_embeddings, encoder_state, vocab_size, sequence_length, rnn_size,
                   num_layers, vocab_to_int, keep_prob, attn_length):
    '''Create the decoding cell and input the parameters for the training and inference decoding layers'''
    
    with tf.variable_scope("decoding") as decoding_scope:
        lstm = tf.contrib.rnn.BasicLSTMCell(rnn_size)
        drop = tf.contrib.rnn.DropoutWrapper(lstm, input_keep_prob = keep_prob)
        cell = tf.contrib.rnn.AttentionCellWrapper(drop, attn_length, state_is_tuple = True)
        dec_cell = tf.contrib.rnn.MultiRNNCell([cell] * num_layers)
        
        weights = tf.truncated_normal_initializer(stddev = 0.1)
        biases = tf.zeros_initializer()
        output_fn = lambda x: tf.contrib.layers.fully_connected(x, 
                                                                vocab_size, 
                                                                None, 
                                                                scope=decoding_scope,
                                                                weights_initializer = weights,
                                                                biases_initializer = biases)

        train_logits = decoding_layer_train(
            encoder_state[0], dec_cell, dec_embed_input, sequence_length, decoding_scope, output_fn, keep_prob)
        decoding_scope.reuse_variables()
        infer_logits = decoding_layer_infer(encoder_state[0], dec_cell, dec_embeddings, vocab_to_int['<GO>'],
                                            vocab_to_int['<EOS>'], sequence_length, vocab_size,
                                            decoding_scope, output_fn, keep_prob)

    return train_logits, infer_logits

In [None]:
def seq2seq_model(input_data, target_data, keep_prob, batch_size, sequence_length, source_vocab_size, target_vocab_size,
                  enc_embedding_size, dec_embedding_size, rnn_size, num_layers, vocab_to_int, attn_length):
    
    '''Use the previous functions to create the training and inference logits'''
    
    enc_embed_input = tf.contrib.layers.embed_sequence(input_data, source_vocab_size+1, enc_embedding_size)
    enc_state = encoding_layer(enc_embed_input, rnn_size, num_layers, keep_prob, sequence_length, attn_length)

    dec_input = process_encoding_input(target_data, vocab_to_int, batch_size)
    dec_embeddings = tf.Variable(tf.random_uniform([target_vocab_size+1, dec_embedding_size], -1.0, 1.0))
    dec_embed_input = tf.nn.embedding_lookup(dec_embeddings, dec_input)

    train_logits, infer_logits = decoding_layer(dec_embed_input, dec_embeddings, enc_state, target_vocab_size+1, 
                                                sequence_length, rnn_size, num_layers, vocab_to_int, keep_prob, 
                                                attn_length)
    
    return train_logits, infer_logits

In [None]:
# Set the parameters
epochs = 100
batch_size = 128
rnn_size = 512
num_layers = 2
encoding_embedding_size = 512
decoding_embedding_size = 512
attn_length = 10
learning_rate = 0.0005
keep_probability = 0.8

In [None]:
train_graph = tf.Graph()
with train_graph.as_default():
    
    # Load the model inputs
    input_data, targets, lr, keep_prob = model_inputs()
    # Sequence length will be the max line length for each batch
    sequence_length = tf.placeholder_with_default(max_line_length, None, name='sequence_length')
    input_shape = tf.shape(input_data)
    
    # Create the logits from the model
    train_logits, inference_logits = seq2seq_model(
        tf.reverse(input_data, [-1]), targets, keep_prob, batch_size, sequence_length, len(source_vocab_to_int), 
        len(target_vocab_to_int), encoding_embedding_size, decoding_embedding_size, rnn_size, num_layers, 
        target_vocab_to_int, attn_length)
    
    # Create a tensor to be used for making predictions.
    tf.identity(inference_logits, 'logits')
    with tf.name_scope("optimization"):
        # Loss function
        cost = tf.contrib.seq2seq.sequence_loss(
            train_logits,
            targets,
            tf.ones([input_shape[0], sequence_length]))

        # Optimizer
        optimizer = tf.train.AdamOptimizer(learning_rate)

        # Gradient Clipping
        gradients = optimizer.compute_gradients(cost)
        capped_gradients = [(tf.clip_by_value(grad, -1., 1.), var) for grad, var in gradients if grad is not None]
        train_op = optimizer.apply_gradients(capped_gradients)

In [None]:
def pad_sentence_batch(sentence_batch, vocab_to_int):
    """Pad lines with <PAD> so each line of a batch has the same length"""
    max_sentence = max([len(sentence) for sentence in sentence_batch])
    return [sentence + [vocab_to_int['<PAD>']] * (max_sentence - len(sentence)) for sentence in sentence_batch]

In [None]:
def batch_data(source, target, batch_size):
    """Batch source and target together"""
    for batch_i in range(0, len(source)//batch_size):
        start_i = batch_i * batch_size
        source_batch = source[start_i:start_i + batch_size]
        target_batch = target[start_i:start_i + batch_size]
        yield (np.array(pad_sentence_batch(source_batch, source_vocab_to_int)), 
               np.array(pad_sentence_batch(target_batch, target_vocab_to_int)))

In [None]:
train_valid_split = int(len(source_int)*0.1)

train_source = source_int[train_valid_split:]
train_target = target_int[train_valid_split:]

valid_source = source_int[:train_valid_split]
valid_target = target_int[:train_valid_split]

print(len(train_source))
print(len(valid_source))

In [None]:
import time

learning_rate_decay = 0.95
display_step = 50
stop_early = 0
stop = 3
total_train_loss = 0
summary_valid_loss = []


checkpoint = "best_model.ckpt" 

with tf.Session(graph=train_graph) as sess:
    sess.run(tf.global_variables_initializer())

    for epoch_i in range(1, epochs+1):
        for batch_i, (source_batch, target_batch) in enumerate(
                batch_data(train_source, train_target, batch_size)):
            start_time = time.time()
            _, loss = sess.run(
                [train_op, cost],
                {input_data: source_batch,
                 targets: target_batch,
                 lr: learning_rate,
                 sequence_length: target_batch.shape[1],
                 keep_prob: keep_probability})

            total_train_loss += loss
            end_time = time.time()
            batch_time = end_time - start_time
            
            if batch_i % display_step == 0:
                print('Epoch {:>3}/{} Batch {:>4}/{} - Loss: {:>6.3f}, Seconds: {:>4.2f}'
                      .format(epoch_i,
                              epochs, 
                              batch_i, 
                              len(train_source) // batch_size, 
                              total_train_loss / display_step, 
                              batch_time*display_step))
                total_train_loss = 0

            if batch_i % 235 == 0 and batch_i > 0:
                total_valid_loss = 0
                start_time = time.time()
                for batch_ii, (source_batch, target_batch) in \
                        enumerate(batch_data(valid_source, valid_target, batch_size)):
                    valid_loss = sess.run(
                    cost, {input_data: source_batch,
                           targets: target_batch,
                           lr: learning_rate,
                           sequence_length: target_batch.shape[1],
                           keep_prob: 1})
                    total_valid_loss += valid_loss
                end_time = time.time()
                batch_time = end_time - start_time
                avg_valid_loss = total_valid_loss / (len(valid_source) / batch_size)
                print('Valid Loss: {:>6.3f}, Seconds: {:>5.2f}'.format(avg_valid_loss, batch_time))
                
                learning_rate *= learning_rate_decay
                
                summary_valid_loss.append(avg_valid_loss)
                if avg_valid_loss <= min(summary_valid_loss):
                    print('New Record!') 
                    stop_early = 0
                    saver = tf.train.Saver() 
                    saver.save(sess, checkpoint)
                
                else:
                    print("No Improvement.")
                    stop_early += 1
                    if stop_early == stop:
                        break
        if stop_early == stop:
            print("Stopping Training.")
            break

In [None]:
def sentence_to_seq(sentence, vocab_to_int):
    '''Prepare the predicted sentence for the model'''
    
    sentence = clean_text(sentence)
    return [vocab_to_int.get(word, vocab_to_int['<UNK>']) for word in sentence.split()]

In [None]:
# This part of the project won't work on Kaggle since it requires loading checkpoints of the model

# To create your own input sentence
#input_sentence = 'Oh my God they killed Kenny!'

# To use an sentence from the data
#random = np.random.choice(len(short_text))
#input_sentence = short_text[random]

# Clean the input sentence before it is used in the model
#input_sentence = sentence_to_seq(input_sentence, source_vocab_to_int)

#checkpoint = "./" + checkpoint 

#loaded_graph = tf.Graph()
#with tf.Session(graph=loaded_graph) as sess:
#    # Load the saved model
#    loader = tf.train.import_meta_graph(checkpoint + '.meta')
#    loader.restore(sess, checkpoint)
    
    # Load the tensors to be used as inputs
#    input_data = loaded_graph.get_tensor_by_name('input:0')
#    logits = loaded_graph.get_tensor_by_name('logits:0')
#    keep_prob = loaded_graph.get_tensor_by_name('keep_prob:0')
    
#    response_logits = sess.run(logits, {input_data: [input_sentence],keep_prob: 1.0})[0]

#print('Input')
#print('  Word Ids:      {}'.format([i for i in input_sentence]))
#print('  Input Words: {}'.format([source_int_to_vocab[i] for i in input_sentence]))

#print('\nResponse')
#print('  Word Ids:      {}'.format([i for i in np.argmax(response_logits, 1)]))
#print('  Response Words: {}'.format([target_int_to_vocab[i] for i in np.argmax(response_logits, 1)]))

# Summary

I hope that you have found this project to be rather interesting and informative. There are many ways to improve and alter this model, which could make a fun project for yourself. I encourage you to find some other neat projects out there and see if you can combine some methods from all of us to make an even better project! This might be able to give you some ideas: http://web.stanford.edu/class/cs20si/assignments/a3.pdf

One thing that I really like about this model is that it can be applied to many other types of projects, such as language translation or text simplification. seq2seq models are pretty flexible, so it's cool to see and build a wide variety of projects.

Thanks for reading and best of luck building your models!