# Encoder-Decoder RNN


This is the entire training and evaluation pipeline for our encoder-decoder RNN. Note that in this notebook, utterance=input and response=output. Cells need to be run one at a time in order.

Please go through every line! In particular, try tracing what would happen to line 1 of the data file through the whole notebook.

In [86]:
#Imports
import numpy as np
from numpy import array
from numpy.random import rand
from numpy.random import shuffle
from pickle import load
from pickle import dump
import re
import os, sys, glob

In [None]:
#Don't run these imports on your local machine!
import tensorflow as tf
#Keras imports
from keras.layers import LSTM, Dense, Activation, Input
from keras import optimizers
from keras.models import Sequential
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer

## Preprocessing 

Our preprocessing method opens our data file and separates each line into pairs of utterances and responses.

In [84]:
######################
# Preprocessing Methods
######################

##### Load the raw dataset #####
#This method opens the raw text file, reads the lines, and closes the file.
def load_data(filename):
    file = open(filename, mode="rt")
    data = file.read()
    file.close()
    return data

##### Split data into utterance-response pairs #####
#This method splits the dataset into lines, and for each line, we create a dictionary.
#The dictionary key is the utterance (A), and the value is the response (B)
#For the utterance and response, the speech-tag and actual utterance is tab separated.
#We add each set of utterance-response pairs to an array called pairs.
def split_to_pairs(data):
    lines = data.split("\n")
    pairs = []
    for line in lines:
        tokens = line.split("\t")
        utterance = tokens[0] + "\t" + tokens[1]
        response = tokens[2] + "\t" + tokens[3]
        pairs.append([utterance, response])
    return pairs

##### Clean the data ######
#Optionally, we could make all words lowercase, remove punctuation, etc.
#I'm going to just leave the dataset in its native form and see how it does for now.
#This method essentially just reorganizes the data into a 2D array, where each row holds:
# [utterance, response]
def clean_data(pairs):
    cleaned_data = list()
    for pair in pairs:
        clean_pair = list()
        for utt in pair:
            clean_pair.append(utt)
        cleaned_data.append(clean_pair)
    return array(cleaned_data)

##### Save pairs to file #####
def save_pairs(pairs, new_filename):
    dump(pairs, open(new_filename, "wb"))
    print("Saved: %s" % new_filename)

In [85]:
#Run preprocessing
filename = "clean_dataset.txt"
data = load_data(filename)
pairs = split_to_pairs(data)
clean_pairs = clean_data(pairs)
save_pairs(clean_pairs, "utt-resp.pkl")

#Check our dataset
#you should see:
# [fp	and I'm calling from Garland, Texas.] => [b	Yeah,], etc.
for i in range(100):
     print('[%s] => [%s]' % (clean_pairs[i,0], clean_pairs[i,1]))

Saved: utt-resp.pkl
[fp	and I'm calling from Garland, Texas.] => [b	Yeah,]
[co^t	so. anyway, let me press one.] => [aa	Okay .]
[sd	and, it was an experience that I won't do again .] => [qw	How big a family do you have?]
[sd	We saw people we hadn't see in a while] => [qy	Did you have people coming from far away?]
[sd(^q)	and we're going, my gosh.] => [sv	Well you have]
[b	Yeah.] => [sv	And if, they come from far away, they take it more seriously]
[aa	I think you're right.] => [b	Yeah.]
[b	Yeah.] => [sd	My family's not very big]
[qw^d	Your family's from where?] => [sd	Well, I have a, a brother lives in Indianapolis, a sister lives in Chicago, and my folks live back in Buffalo, New York.]
[ba	no.] => [sd	I guess we have reunions about once a year or so.]
[b	Uh huh.] => [sd	We got together over Christmas.]
[ba	those are nice.] => [aa	Yeah,]
[sv	And I'm sure they're a lot more organized too, because they've done it before.] => [aa	Yeah,]
[ba	That's great.] => [sd	and every year everyone ask

# Load Datasets and Split into train and test sets.

In [107]:
######################
# Load dataset methods
######################
def load_sentences(filename):
    return load(open(filename, "rb"))

def save_sentences(sentences, filename):
    dump(sentences, open(filename, "wb"))
    print("Saved: %s" % filename)
    
def split_dataset(dataset, num_sentences):
    split = int(round(num_sentences*0.9))
    train = dataset[:split]
    test = dataset[split:]
    return train, test

In [110]:
#For testing purposes, you can change n_sentences to a smaller number.
raw_dataset = load_sentences("utt-resp.pkl")
n_sentences = len(raw_dataset)
print("Number of training pairs: ") 
print(n_sentences)
dataset = raw_dataset[:n_sentences, :]
shuffle(dataset)
train, test = split_dataset(dataset, n_sentences)
save_sentences(dataset, "utt-resp-both.pkl")
save_sentences(train, "utt-resp-train.pkl")
save_sentences(test, "utt-resp-train.pkl")

Number of training pairs: 
46464
Saved: utt-resp-both.pkl
Saved: utt-resp-train.pkl
Saved: utt-resp-train.pkl


## Load Data and Tokenize

First, we load the datasets using our load_sentences method from above. We are going to load the full dataset (so we can calculate vocab and max_length sizes), and the train and test data.

Next, we tokenize the data. Tokenization is the process of mapping words to integers. We are actually going to create separate tokenizers for our input and response data. Why? Because right now, that makes the code run. We can experiment with using one tokenizer later.

In [None]:
######################
# Load data
######################
dataset = load_sentences("utt-resp-both.pkl")
train = load_sentences("utt-resp-train.pkl")
test = load_sentences("utt-resp-train.pkl")

In [None]:
######################
# Tokenizer methods
######################
#create and fit a tokenizer on the given lines
def create_tokenizer(lines):
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(lines)
    return tokenizer
#get the max length of all phrases
def max_length(lines):
    return max(len(line.split()) for line in lines)

In [None]:
######################
# Tokenize
######################
#create tokenizers
utterance_tokenizer = create_tokenizer(dataset[:, 0])
response_tokenizer = create_tokenizer(dataset[: 1])
#define vocabulary sizes
utterance_vocab_size = len(utterance_tokenizer.word_index) + 1
response_vocab_size = len(response_tokenizer.word_index) + 1
#define max_lengths
utterance_length = max_length(dataset[:, 0])
response_length = max_length(dataset[:, 1])
#print some statistics
print("Utterance vocabulary size: %d" % utterance_vocab_size)
print("Utterance max length: %d" % utterance_length)
print("Response vocabulary size: %d" % response_vocab_size)
print("Utterance max length: %d" % response_length)

## Encoding

We need to encode each utterance-response sequence to integers, and pad each encoding to the maximum phrase length (so that every sequence of encoded integers is the same length).

We need the encodings to be the same length because we are going to use a word embedding for the input sequences and one hot encode the output sequences.

In [None]:
######################
# Encoding methods
######################
#this method encodes the lines and pads them to the max length
def encode_input(tokenizer, length, lines):
    encoding = tokenizer.texts_to_sequences(lines)
    encoding = pad_sequences(encoding, maxlen=length, padding="post")
    return encoding
#this method one-hot encodes the output (responses). 
#we do this because we want the model to predict the probability of each word in the vocabulary as an output.
def encode_output(sequences, vocab_size):
    output_list = list()
    for sequence in sequences:
        #to_categorical converts a class vector (integers) to binary class matrix
        encoded = to_categorical(sequence, num_classes=vocab_size)
        output_list.append(encoded)
    output = array(output_list)
    #Reshapes an output to a certain shape (the parameter for reshape is a tuple of integers)
    output = output.reshape(sequences.shape[0], sequences.shape[1], vocab_size)
    return output

In [None]:
######################
# Encode data
######################
#training data
train_utterance = encode_input(utterance_tokenizer, utterance_length, train[:, 1])
train_response = encode_input(response_tokenizer, response_length, train[:, 0])
train_response = encode_output(train_response, response_vocab_size)
#test data
test_utterance = encode_input(utterance_tokenizer, utterance_length, test[:, 1])
test_response = encode_input(response_tokenizer, response_length, test[:, 0])
test_response = encode_output(test_response, response_vocab_size)

## Create model

We will create an encoder-decoder LSTM.

### What is a timestep?

A timestep is a Keras thing. Here is the formal definition:

The specified number of timesteps defines the number of input variables (X) used to predict the next time step (y).

So, basically:
A timestep is the "memory" of an LSTM- it's many inputs we are remembering (I think). In this case, we are using the max_length of an utterance/response as our timestep. This means that for every predicted word, we are taking into account every other word we have predicted so far. Likewise, when we train, we are learning weights for a word based on every previous word in a sentence (this is what we want for an encoder-decoder model!!)

In [None]:
######################
# Methods to create model
######################
#this method creates a model based on the given inputs.
def create_model(input_vocab, output_vocab, input_timesteps, output_timesteps, n_units):
    model = Sequential() #we are doing seq2seq 
    model.add(Embedding(input_vocab, n_units, input_length=input_timesteps, mask_zero=True))
    model.add(LSTM(n_units))
    model.add(RepeatVector(output_timesteps))
    model.add(LSTM(n_units, return_sequences=True))
    model.add(TimeDistributed(Dense(output_vocab, activation="softmax")))
    return model

In [None]:
######################
# Create and compile model
######################
#We can change the number of hidden units (right now its 256)
#increasing the number of hidden units will increase performance and training time
#We can change the loss function (right now its categorical_crossentropy)
#I also create a file called model.png that shows the shape of the model
#I thought we might want to use the image for our final presentation :)
model = create_model(utterance_vocab_size, response_vocab_size, utterance_length, response_length, 256)
model.compile(optimizer="adam", loss="categorical_crossentropy")
print(model.summary())
plot_model(model, to_file="model.png", show_shapes=True)

## Train the model 

Right now I'm using 30 epochs and a batch_size of 64. We can always up the number of epochs if we aren't getting good performance.

In [None]:
filename= "model.test1"
checkpoint = ModelCheckpoint(filename, monitor="val_loss", verbose=1, save_best_only=True, mode="min")
model.fit(train_utterance, train_response, epochs=30, batch_size=64, validation_data=(test_utterance, test_response), callbacks=[checkpoint], verbose=2)

## Evaluate the model

In [None]:
#reload the datasets (just in case)
dataset = load_sentences("utt-resp-both.pkl")
train = load_sentences("utt-resp-train.pkl")
test = load_sentences("utt-resp-train.pkl")
#create tokenizers
utterance_tokenizer = create_tokenizer(dataset[:, 0])
response_tokenizer = create_tokenizer(dataset[: 1])
#define vocabulary sizes
utterance_vocab_size = len(utterance_tokenizer.word_index) + 1
response_vocab_size = len(response_tokenizer.word_index) + 1
#define max_lengths
utterance_length = max_length(dataset[:, 0])
response_length = max_length(dataset[:, 1])
#datasets
train_utt = encode_sequences(utterance_tokenizer, utterance_length, train[:, 1])
test_utt = encode_sequences(utterance_tokenizer, utterance_length, train[:, 1])

In [None]:
######################
# Evaluation methods
######################
#reverse-lookup a word in the tokenizer 
def get_word(integer, tokenizer):
    for word, index in tokenizer.word_index.items():
        if index == integer:
            return word
    return None
#we will need to perform this reverse-lookup for every word in a predicted sequence
#this method returns the prediction in words (not integers)
def get_prediction(model, tokenizer, source):
    prediction = model.predict(source, verbose=0)[0]
    integers = [argmax(vector) for vector in prediction]
    target = list()
    for i in integers:
        word = get_word(i, tokenizer)
        if word is None:
            break
        target.append(word)
    return " ".join(target)
#we need to repeat the prediction for every utterance in the test dataset
#we then compare our prediction to the actual response
#I'm using a BLEU score to compare these quantitatively, but if we get a low BLEU score I wouldn't be surprised.
def evaluate_model(model, tokenizer, sources, raw_dataset):
    actual, predicted = list(), list()
    for i, source in enumerate(sources):
        source = source.reshape((1, source.shape[0]))
        translation = get_prediction(model, utterance_tokenizer, source)
        raw_target, raw_source = raw_dataset[i]
        if i < 10:
            print('src=[%s], target=[%s], predicted=[%s]' % (raw_src, raw_target, translation))
        actual.append(raw_target.split())
        predicted.append(translation.split())
    	# calculate BLEU score
	print('BLEU-1: %f' % corpus_bleu(actual, predicted, weights=(1.0, 0, 0, 0)))
	print('BLEU-2: %f' % corpus_bleu(actual, predicted, weights=(0.5, 0.5, 0, 0)))
	print('BLEU-3: %f' % corpus_bleu(actual, predicted, weights=(0.3, 0.3, 0.3, 0)))
	print('BLEU-4: %f' % corpus_bleu(actual, predicted, weights=(0.25, 0.25, 0.25, 0.25)))

In [None]:
######################
# Evaluate
######################
model = load_model("model.test1")
#evalute on training data (this should be pretty good)
evaluate_model(model, utterance_tokenizer, train_utt, train)
#evaluate on test data
evalute_model(model, utterance_tokenizer, test_utt, test)