#Libraries

 - All necessary libraries, tools, and utilities needed to approach the language model. Torch and related imports are for deep learning tasks, and utilities like csv, os, json for handling the actual data.

In [1]:
pip install rouge



In [2]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import csv
import random
import re
import os
import unicodedata
import codecs
import itertools
import math
import json
import spacy
import nltk
from torch.jit import script, trace
from io import open
from torch import optim
from nltk.translate.bleu_score import sentence_bleu
from rouge import Rouge

 - Given the computational resources required of training a model, we will begin setup for GPU resource instead of CPU

In [3]:
USE_CUDA = torch.cuda.is_available()
device = torch.device("cuda" if USE_CUDA else "cpu")

 - Next we mount our Google Drive to access the files. For my notebook the file structure in the Google Drive is the Drive -> Folder named "Movie_Corpus" -> All necessary files.

In [4]:
from google.colab import drive

In [5]:
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


 - For simplicity and cleaner code, I set a few paths from now that would be necessary in multiple functions.

In [6]:
#Google Drive paths for read in/saving
corpus_name = "Movie_Corpus"
corpus = os.path.join("/content/drive/My Drive/", corpus_name)
save_dir = os.path.join("/content/drive/My Drive/Movie_Corpus", "save")
datafile = os.path.join("/content/drive/My Drive/Movie_Corpus", "datafile.txt")

 - Reading in the first file, utterances.jsonl saved in the Movie_Corpus folder, to see the data we are working with. It is important to note, on Kaggle all files are loaded in as .txt files before we change them into dataframes. However, there were issues of having nested lists, contiguous token errors, dictionary break downs, and even an illegal action error that appears for which I could not account. From the official website, I loaded in a second version of the data as .json or .jsonl files to more easily work with data.

#Preprocessing

In [7]:
#Print first few lines of our utterance file
def printLines(file, n=5):
    with open(file, 'rb') as datafile:
        lines = datafile.readlines()
    for line in lines[:n]:
        print(line)

printLines(os.path.join(corpus, "utterances.jsonl"))

b'{"id": "L1045", "conversation_id": "L1044", "text": "They do not!", "speaker": "u0", "meta": {"movie_id": "m0", "parsed": [{"rt": 1, "toks": [{"tok": "They", "tag": "PRP", "dep": "nsubj", "up": 1, "dn": []}, {"tok": "do", "tag": "VBP", "dep": "ROOT", "dn": [0, 2, 3]}, {"tok": "not", "tag": "RB", "dep": "neg", "up": 1, "dn": []}, {"tok": "!", "tag": ".", "dep": "punct", "up": 1, "dn": []}]}]}, "reply-to": "L1044", "timestamp": null, "vectors": []}\n'
b'{"id": "L1044", "conversation_id": "L1044", "text": "They do to!", "speaker": "u2", "meta": {"movie_id": "m0", "parsed": [{"rt": 1, "toks": [{"tok": "They", "tag": "PRP", "dep": "nsubj", "up": 1, "dn": []}, {"tok": "do", "tag": "VBP", "dep": "ROOT", "dn": [0, 2, 3]}, {"tok": "to", "tag": "TO", "dep": "dobj", "up": 1, "dn": []}, {"tok": "!", "tag": ".", "dep": "punct", "up": 1, "dn": []}]}]}, "reply-to": null, "timestamp": null, "vectors": []}\n'
b'{"id": "L985", "conversation_id": "L984", "text": "I hope so.", "speaker": "u0", "meta": {

 - We can note that the metadata is structured in a dictionary. The two classes of data we ideally want are the lines that form conversations. Below we will parse the lines in the utterances file to extract the necessary data.

In [8]:
#Splits each line and fill into empty dictionaries
def LinesAndConvos(fileName):
    #Dictionaries to store individual lines and conversations
    lines = {}
    conversations = {}
    #Open with iso-encoding instead of UUT-8
    with open(fileName, 'r', encoding='iso-8859-1') as f:
        for line in f:
            lineJson = json.loads(line)
            #Extract fields for line object
            lineObj = {}
            lineObj["lineID"] = lineJson["id"]
            lineObj["characterID"] = lineJson["speaker"]
            lineObj["text"] = lineJson["text"]
            lines[lineObj['lineID']] = lineObj
            #Extract fields for conversation object
            if lineJson["conversation_id"] not in conversations:
                convObj = {}
                convObj["conversationID"] = lineJson["conversation_id"]
                convObj["movieID"] = lineJson["meta"]["movie_id"]
                convObj["lines"] = [lineObj]
            else:
                #Add line to existing convo
                convObj = conversations[lineJson["conversation_id"]]
                convObj["lines"].insert(0, lineObj)
            conversations[convObj["conversationID"]] = convObj
    return lines, conversations

 - We then further process the information by converting the derived conversations into pairs of inputs and outputs (an input being a dialogue from a character in the corpus, and the output the response of another character to the input).

In [9]:
#Extracts pairs of sentences from conversations
def extractSentencePairs(conversations):
    #List to store question-answer pairs
    qa_pairs = []
    for conversation in conversations.values():
        #Iterate over all lines of convo
        for i in range(len(conversation["lines"]) - 1):
            #strip whitespace
            inputLine = conversation["lines"][i]["text"].strip()
            #Next line becomes target
            targetLine = conversation["lines"][i+1]["text"].strip()
            #Filter out samples without valid input/targets
            if inputLine and targetLine:
                qa_pairs.append([inputLine, targetLine])
    #Now we can return extracted sentence pairs
    return qa_pairs

 - From this step, we can initialize empty dictionaries, read in our reformatted utterences file, and have our idealized dataset to further clean and refine for our model.

In [10]:
#Forgot to define above, path to newly created file
datafile = os.path.join(corpus, "formatted_movie_lines.txt")

delimiter = '\t'
#Unescape the delimiter
delimiter = str(codecs.decode(delimiter, "unicode_escape"))

#Initialize lines dict and conversations dict
lines = {}
conversations = {}

#Load lines and convos
print("\nProcessing corpus into lines and conversations...")
lines, conversations = LinesAndConvos(os.path.join(corpus, "utterances.jsonl"))

#Create and write a new csv file in utf-8 encoding
print("\nWriting newly formatted file...")
with open(datafile, 'w', encoding='utf-8') as outputfile:
    writer = csv.writer(outputfile, delimiter=delimiter, lineterminator='\n')
    for pair in extractSentencePairs(conversations):
        writer.writerow(pair)

#Ensure it worked, let's see
print("\nSample lines from file:")
printLines(datafile)


Processing corpus into lines and conversations...

Writing newly formatted file...

Sample lines from file:
b'They do to!\tThey do not!\n'
b'She okay?\tI hope so.\n'
b"Wow\tLet's go.\n"
b'"I\'m kidding.  You know how sometimes you just become this ""persona""?  And you don\'t know how to quit?"\tNo\n'
b"No\tOkay -- you're gonna need to learn how to lie.\n"


 - This is where most of the preprocessing takes place. The Vocabulary class was created to manage the vocabulary of our chatbot and handle mapping between words, indices, match frequencies of word use in our dialogue, and filter out some terms to increase training speed.This is where most of the preprocessing takes place. Now tokenization is not explicitly called due to it's computational and time requirements, but it is mimicked. Many of the functions within the class such as addSentence() and normalizeString() account for basic tokenization by splitting individual words, converting terms to lowercase, removing accents and non-alphabetical characters, and adding spaces around punctuation. Further down we also call on indexesFromSentence to convert a sentence into indices using our dictionary.

In [11]:
#Crucial tokens for vocabulary.
#Padds short sentences
PAD_token = 0
#SOS token is Start-of-sentence
SOS_token = 1
#EOS is end-of-sentence
EOS_token = 2

class Vocabulary:
    def __init__(self, name):
        #Name of vocabulary
        self.name = name
        #Flag to check if trimming has occured (not yet)
        self.trimmed = False
        #Dictionaries to map words to index and their respective frequencies
        self.word2index = {}
        self.word2count = {}
        self.index2word = {PAD_token: "PAD", SOS_token: "SOS", EOS_token: "EOS"}
        #Set at 3 to account for 3 Pad,SOS, and Eos
        self.num_words = 3

    #Sentence input split into individual words and add to vocab
    def addSentence(self, sentence):
        for word in sentence.split(' '):
            self.addWord(word)

    #Adds a word if it isn't already in the vocab list
    def addWord(self, word):
        if word not in self.word2index:
            self.word2index[word] = self.num_words
            self.word2count[word] = 1
            self.index2word[self.num_words] = word
            self.num_words += 1
        else:
            self.word2count[word] += 1

    #Trims out words below predefined threshold
    def trim(self, min_count):
        if self.trimmed:
            return
        #Trimming is done
        self.trimmed = True
        #Store words meeting min_count
        keep_words = []
        #Keep words that occur more than that amount
        for k, v in self.word2count.items():
            if v >= min_count:
                keep_words.append(k)

        print('keep_words {} / {} = {:.4f}'.format(
            len(keep_words), len(self.word2index), len(keep_words) / len(self.word2index)))

        #Reinitialize dictionaries to keep those frequent words
        self.word2index = {}
        self.word2count = {}
        self.index2word = {PAD_token: "PAD", SOS_token: "SOS", EOS_token: "EOS"}
        self.num_words = 3
        #Add only frequent words back to vocab
        for word in keep_words:
            self.addWord(word)

In [12]:
#Turn unicode to ASCII
def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn')

In [13]:
# Updated normalizeString function with NLTK tokenization
def normalizeString(s):
    # First converting to lowercase and removing accents
    s = unicodeToAscii(s.lower().strip())
    # Separate out punctuation
    s = re.sub(r"([.!?])", r" \1", s)
    # Remove non-alphabet letters
    s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
    # Replace spaces with single space
    s = re.sub(r"\s+", r" ", s).strip()
    return s

In [14]:
#Max length of sentence considered
MAX_LENGTH = 15

In [15]:
#Read query/response pairs and return a voc object
def readVocab(datafile, corpus_name):
    print("Reading lines...")
    #Read and split into lines
    lines = open(datafile, encoding='utf-8').\
        read().strip().split('\n')
    #Split every line into pairs and normalize
    pairs = [[normalizeString(s) for s in l.split('\t')] for l in lines]
    vocab = Vocabulary(corpus_name)
    return vocab, pairs

 - Now that most of the preprocessing is done, we simply need to filter the pairs based on a set max length. This helps with computational processing and in the end with chatbot coherence and efficiency for it's responses.

In [16]:
#Check if sentence pairs are under max length
def filterPair(p):
    #Input sequences need to preserve the last word for EOS token
    return len(p[0].split(' ')) < MAX_LENGTH and len(p[1].split(' ')) < MAX_LENGTH

 - Based on that, we retain only those sentence pairs under our maximum length.

In [17]:
#Filter pairs using ``filterPair`` condition
def filterPairs(pairs):
    return [pair for pair in pairs if filterPair(pair)]

 - We have a lot of moving parts operating at this point. So we will utilize LoadAndPrepare function to narrow things down.

In [18]:
#Using everything, return a populated voc object and pairs list
def loadAndPrepare(corpus, corpus_name, datafile, save_dir):
    print("Start preparing training data ...")
    #Read and normalize the sentence pairs
    vocab, pairs = readVocab(datafile, corpus_name)
    print("Read {!s} sentence pairs".format(len(pairs)))
    #Filter pairs based on the length
    pairs = filterPairs(pairs)
    print("Trimmed to {!s} sentence pairs".format(len(pairs)))
    #Counting words, building vocab
    print("Counting words...")
    for pair in pairs:
        #Add query sentence to vocab
        vocab.addSentence(pair[0])
        #Add response sentence to vocab
        vocab.addSentence(pair[1])
    print("Counted words:", vocab.num_words)
    return vocab, pairs

 - Lastly we need to save our vocabulary. We set a directory in our Google Drive for it,  we can validate the data loading and preparation are performed correctly, showing a few samples to visually confirm.

In [19]:
#Assemble voc and pairs
save_dir = os.path.join("data", "save")
vocab, pairs = loadAndPrepare(corpus, corpus_name, datafile, save_dir)
#Let's see a few
print("\npairs:")
for pair in pairs[:10]:
    print(pair)

Start preparing training data ...
Reading lines...
Read 221282 sentence pairs
Trimmed to 111392 sentence pairs
Counting words...
Counted words: 26989

pairs:
['they do to !', 'they do not !']
['she okay ?', 'i hope so .']
['wow', 'let s go .']
['no', 'okay you re gonna need to learn how to lie .']
['i figured you d get to the good stuff eventually .', 'what good stuff ?']
['what good stuff ?', 'the real you .']
['the real you .', 'like my fear of wearing pastels ?']
['do you listen to this crap ?', 'what crap ?']
['well no . . .', 'then that s all you had to say .']
['then that s all you had to say .', 'but']


 - On the other end of the spectrum, we are going to further refine our data by trimming words that do not occur often. This way we can reduce the complexity of the model and it's response while also increaseing it's coherence in chat.

In [20]:
#minimum frequency for trimming
MIN_COUNT = 2

def trimRareWords(vocab, pairs, MIN_COUNT):
    #Trim words under threshold from vocabulary
    vocab.trim(MIN_COUNT)
    #Filter out pairs with trimmed words
    keep_pairs = []
    for pair in pairs:
        input_sentence = pair[0]
        output_sentence = pair[1]
        keep_input = True
        keep_output = True
        #Check input seqeunce.
        for word in input_sentence.split(' '):
            if word not in vocab.word2index:
                keep_input = False
                break
        #Check output sentence
        for word in output_sentence.split(' '):
            if word not in vocab.word2index:
                keep_output = False
                break
        #Only keep pairs without trimmed words in input/output sentence
        if keep_input and keep_output:
            keep_pairs.append(pair)
    print("Trimmed from {} pairs to {}, {:.4f} of total".format(len(pairs), len(keep_pairs), len(keep_pairs) / len(pairs)))
    return keep_pairs
#Trim voc and pairs
pairs = trimRareWords(vocab, pairs, MIN_COUNT)

#Trim voc and pairs
pairs = trimRareWords(vocab, pairs, MIN_COUNT)

keep_words 18532 / 26986 = 0.6867
Trimmed from 111392 pairs to 104099, 0.9345 of total
Trimmed from 104099 pairs to 104099, 1.0000 of total


 - As seen above, we still maintain most of our vocabulary, so it is a solid optimization step with little downsides. We then move forward and convert the sentences we've kept into indices representing each word. The key factor here is including our End-Of-Sequence token at the end for our future input. This step is necessary overall to convert the data into a type for the model to process.

In [21]:
# Update function to convert sentence to indices
def indexesFromSentence(vocab, sentence):
    return [vocab.word2index[word] for word in sentence.split(' ')] + [EOS_token]

 - We then pad the sequences to ensure all data input into the model is of the same length. Without this step, the model would crash due to unexpected InputErrors.

In [22]:
def zeroPadding(l, fillvalue=PAD_token):
    return list(itertools.zip_longest(*l, fillvalue=fillvalue))

 - We also set up a binary matrix to create a binary mask for the padded sequences to help identify which instances were valid tokens (indicated with a 1) and which required padding (indicated with a 0). With this method we can ignore padded tokens during our loss calculations, which maintains the purity of our evaluation metrics.

In [23]:
def binaryMatrix(l, value=PAD_token):
    #Set empty dictionary
    m = []
    #Loop over each sequence and append for non-padded tokens
    for i, seq in enumerate(l):
        m.append([])
        for token in seq:
            if token == PAD_token:
                m[i].append(0)
            else:
                m[i].append(1)
    return m

 - We then have two functions to handle input and outputs. InputVar returns the tensor and lengths of the original sentences before padding. OutputVar prepares the target sentences for processing by converting them to padded tensors, initializing teh binarymask, and returning the maximum target length.

In [24]:
#Convert batch of input sentences
def inputVar(l, vocab):
    #Convert each sentence into list of indices
    indexes_batch = [indexesFromSentence(vocab, sentence) for sentence in l]
    #Create tensor containing og lengths of each sentence before padding
    lengths = torch.tensor([len(indexes) for indexes in indexes_batch])
    #pad sentences to equal lengths
    padList = zeroPadding(indexes_batch)
    #Convert to tensor in pytorch
    padVar = torch.LongTensor(padList)
    return padVar, lengths

In [25]:
def outputVar(l, vocab):
    #Convert batch of target sentences to indices
    indexes_batch = [indexesFromSentence(vocab, sentence) for sentence in l]
    #Find length of longest sentence in batch as reference padding
    max_target_len = max([len(indexes) for indexes in indexes_batch])
    #Pad all sentences to that length
    padList = zeroPadding(indexes_batch)
    #Create binary mask where valid and padded tokens are marked
    mask = binaryMatrix(padList)
    #Convert binary mask into tensor
    mask = torch.BoolTensor(mask)
    #Convert padded sentences into tensor
    padVar = torch.LongTensor(padList)
    return padVar, mask, max_target_len

 - Lastly we convert the batches of input/output sentence pairs to feed into our model.

In [26]:
# Updated batch2TrainData function to use NLTK tokenization
def batch2TrainData(vocab, pair_batch):
    # Sort batch of sentence pairs by length
    pair_batch.sort(key=lambda x: len(x[0].split(" ")), reverse=True)
    # Separate input/output into two separate lists
    input_batch, output_batch = [], []
    for pair in pair_batch:
        input_batch.append(pair[0])
        output_batch.append(pair[1])
    inp, lengths = inputVar(input_batch, vocab)
    output, mask, max_target_len = outputVar(output_batch, vocab)
    return inp, lengths, output, mask, max_target_len


In [27]:
#Test out and see
small_batch_size = 5
batches = batch2TrainData(vocab, [random.choice(pairs) for _ in range(small_batch_size)])
input_variable, lengths, target_variable, mask, max_target_len = batches

print("input_variable:", input_variable)
print("lengths:", lengths)
print("target_variable:", target_variable)
print("mask:", mask)
print("max_target_len:", max_target_len)

input_variable: tensor([[  20,   34,   44,   46,  105],
        [  29,    4,  361,   17,   17],
        [   5,   20,   20,   99, 1147],
        [  65,   29,  492,   14,   14],
        [1684,  112,  105,    2,    2],
        [ 490,   20,   46,    0,    0],
        [ 281, 2704,  350,    0,    0],
        [  44,   10,   14,    0,    0],
        [  57,    2,    2,    0,    0],
        [  82,    0,    0,    0,    0],
        [  14,    0,    0,    0,    0],
        [   2,    0,    0,    0,    0]])
lengths: tensor([12,  9,  9,  5,  5])
target_variable: tensor([[  234,   514,   351,    11,    11],
        [   20,   551,    14,  1108,   131],
        [13413,    14,    18,     5,   120],
        [  195,    37,    14,    95,   539],
        [   10,   332,     2,  1845,    14],
        [    2,    84,     0,    50,     2],
        [    0,   140,     0,   206,     0],
        [    0,   198,     0,    11,     0],
        [    0,   105,     0,   183,     0],
        [    0,     5,     0,   140,     0]

 - It is important to note that although we did not use explicit tokenization methods from NLTK/SpaCy, we still result with appropriate tensors.

#Modeling

 - The model used here was a Sequence-to-Sequence model with specialized Luong attention mechanisms and bidirectional gated-recurrent-units (for input, unidirectional in output). It's similar to RNN models (hence the function name) and is composed of two essential portions: encoder and decoder. It operates by feeding an input sentence into the encoder which transforms it into a hidden representation of itself for which the decoder generates an output (our chatbot response) from decoding it.

 - The encoder itself is, as mentioned above, a bidirectional GRU, used due to the effectiveness of capturing contextual information and long-range dependencies from both forward and backwards processing. The resulting concatenated hidden state is passed to the decoder for text generation.

In [28]:
class EncoderRNN(nn.Module):
    def __init__(self, hidden_size, embedding, n_layers=1, dropout=0):
        super(EncoderRNN, self).__init__()
        self.n_layers = n_layers
        self.hidden_size = hidden_size
        self.embedding = embedding

        '''Clarification in initializing GRU - the input_size and hidden_size parameters are both set to 'hidden_size'
        #because our input size is a word embedding with number of features == hidden_size'''
        self.gru = nn.GRU(hidden_size, hidden_size, n_layers,
                          dropout=(0 if n_layers == 1 else dropout), bidirectional=True)

    def forward(self, input_seq, input_lengths, hidden=None):
        #Convert word indexes to embeddings
        embedded = self.embedding(input_seq)
        #Pack padded batch of sequences for RNN module
        packed = nn.utils.rnn.pack_padded_sequence(embedded, input_lengths)
        #Forward pass through GRU
        outputs, hidden = self.gru(packed, hidden)
        #Unpacking padding
        outputs, _ = nn.utils.rnn.pad_packed_sequence(outputs)
        #Sum together the bidirectional GRU outputs
        outputs = outputs[:, :, :self.hidden_size] + outputs[:, : ,self.hidden_size:]
        #Return output and final hidden state
        return outputs, hidden

 - The real gem of this implementation is the Luong attention mechanism. It helps with potentially long inputs from the user which was a proven issue in previous attempts of this final project, and also helps against vanishing gradients to aid with context. There is also an added flexibility in training by means of the scoring options which we will cover below (dot, general, concatenation)

In [29]:
#Luong attention layer from paper (need to add above)
class Attn(nn.Module):
    def __init__(self, method, hidden_size):
        super(Attn, self).__init__()
        #Of the 3 variants of attention mechanism to use between dot, general, and concat
        #We choose concat
        self.method = method
        if self.method not in ['dot', 'general', 'concat']:
            raise ValueError(self.method, "is not an appropriate attention method.")
        self.hidden_size = hidden_size
        if self.method == 'general':
            #We also apply linear transformation to align dimensions of hidden state and encoder outputs for attention calc
            self.attn = nn.Linear(self.hidden_size, hidden_size)
        elif self.method == 'concat':
            self.attn = nn.Linear(self.hidden_size * 2, hidden_size)
            self.v = nn.Parameter(torch.FloatTensor(hidden_size))

    #Attention score calculation (dot product) of hidden state of decoder and encoder output
    def dot_score(self, hidden, encoder_output):
        return torch.sum(hidden * encoder_output, dim=2)

    #Same thing but energy tensor created from applying linear transformation to encoder output
    def general_score(self, hidden, encoder_output):
        energy = self.attn(encoder_output)
        return torch.sum(hidden * energy, dim=2)

    #Same thing as dot score too, but concatenates decoders hidden state and encoder output before passing through linear transformation
    #And then non-linear activation function (favorite=tanh)
    def concat_score(self, hidden, encoder_output):
        energy = self.attn(torch.cat((hidden.expand(encoder_output.size(0), -1, -1), encoder_output), 2)).tanh()
        return torch.sum(self.v * energy, dim=2)

    def forward(self, hidden, encoder_outputs):
        #Calculate the attention weights (energies) based on the given method chosen, all 3 for exploration
        if self.method == 'general':
            attn_energies = self.general_score(hidden, encoder_outputs)
        elif self.method == 'concat':
            attn_energies = self.concat_score(hidden, encoder_outputs)
        elif self.method == 'dot':
            attn_energies = self.dot_score(hidden, encoder_outputs)
        #Transpose max_length and batch_size dimensions
        attn_energies = attn_energies.t()
        #Return the softmax normalized probability scores (with added dimension)
        return F.softmax(attn_energies, dim=1).unsqueeze(1)

In [30]:
class LuongAttnDecoderRNN(nn.Module):
    def __init__(self, attn_model, embedding, hidden_size, output_size, n_layers=1, dropout=0.1):
        super(LuongAttnDecoderRNN, self).__init__()
        #Keep for reference
        self.attn_model = attn_model
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.n_layers = n_layers
        self.dropout = dropout
        #Embedding layer will map word indices to dense vectors
        self.embedding = embedding
        #Dropout for stabilization via regularization/overfitting
        self.embedding_dropout = nn.Dropout(dropout)
        #Unidirectional GRU
        self.gru = nn.GRU(hidden_size, hidden_size, n_layers, dropout=(0 if n_layers == 1 else dropout))
        #Linear layer to convert final hidden state into output space for each word
        self.concat = nn.Linear(hidden_size * 2, hidden_size)
        #Convert finall hidden state to output
        self.out = nn.Linear(hidden_size, output_size)
        #calculate attention score
        self.attn = Attn(attn_model, hidden_size)

    #Forward pass for decoder to generate
    def forward(self, input_step, last_hidden, encoder_outputs):
        #Important Note: run this one step (word) at a time
        #Get embedding of current input word
        embedded = self.embedding(input_step)
        embedded = self.embedding_dropout(embedded)
        #Forward through GRU
        rnn_output, hidden = self.gru(embedded, last_hidden)
        #Calculate attention weights from the current GRU output
        attn_weights = self.attn(rnn_output, encoder_outputs)
        #Multiply attention weights to encoder outputs to get new "weighted sum" context vector
        context = attn_weights.bmm(encoder_outputs.transpose(0, 1))
        #Concatenate weighted context vector and GRU output using Luong equation 5
        rnn_output = rnn_output.squeeze(0)
        context = context.squeeze(1)
        concat_input = torch.cat((rnn_output, context), 1)
        concat_output = torch.tanh(self.concat(concat_input))
        #Predict next word using Luong equation 6
        output = self.out(concat_output)
        output = F.softmax(output, dim=1)
        # eturn output and final hidden state
        return output, hidden

 - The Negative Log-Likelihood calculation of loss for the model is calculated through the below function. It helps in calculation with various sizes of input and output sequences.

In [31]:
#Masked negative log-likelihood
def maskNLLLoss(inp, target, mask):
    #Count total number of non-padded elements
    nTotal = mask.sum()
    #Get predicted values from input corresponding to correct target values
    crossEntropy = -torch.log(torch.gather(inp, 1, target.view(-1, 1)).squeeze(1))
    #Average loss for current batch
    loss = crossEntropy.masked_select(mask).mean()
    loss = loss.to(device)
    return loss, nTotal.item()

In [32]:
def train(input_variable, lengths, target_variable, mask, max_target_len, encoder, decoder, embedding,
          encoder_optimizer, decoder_optimizer, batch_size, clip, max_length=MAX_LENGTH):
    #Reset gradients to zero
    encoder_optimizer.zero_grad()
    decoder_optimizer.zero_grad()

    #Set everything to GPU
    input_variable = input_variable.to(device)
    target_variable = target_variable.to(device)
    mask = mask.to(device)

    #Lengths for RNN packing should always be on the CPU because of tensorflow library
    lengths = lengths.to("cpu")

    #Initialize variables and dictionary to store
    loss = 0
    print_losses = []
    n_totals = 0

    #Forward pass through encoder
    encoder_outputs, encoder_hidden = encoder(input_variable, lengths)

    #Create initial decoder input (start with SOS tokens for each sentence)
    decoder_input = torch.LongTensor([[SOS_token for _ in range(batch_size)]])
    decoder_input = decoder_input.to(device)

    #Set initial decoder hidden state to the encoder's final hidden state
    decoder_hidden = encoder_hidden[:decoder.n_layers]

    #Determine if we are using teacher forcing this iteration
    use_teacher_forcing = True if random.random() < teacher_forcing_ratio else False

    #Forward batch of sequences through decoder one time step at a time
    if use_teacher_forcing:
        for t in range(max_target_len):
            decoder_output, decoder_hidden = decoder(
                decoder_input, decoder_hidden, encoder_outputs
            )
            #Teacher forcing: next input is current target
            decoder_input = target_variable[t].view(1, -1)
            #Calculate and accumulate loss
            mask_loss, nTotal = maskNLLLoss(decoder_output, target_variable[t], mask[t])
            loss += mask_loss
            print_losses.append(mask_loss.item() * nTotal)
            n_totals += nTotal
    else:
        for t in range(max_target_len):
            decoder_output, decoder_hidden = decoder(
                decoder_input, decoder_hidden, encoder_outputs
            )
            #No teacher forcing: next input is decoder's own current output
            _, topi = decoder_output.topk(1)
            decoder_input = torch.LongTensor([[topi[i][0] for i in range(batch_size)]])
            decoder_input = decoder_input.to(device)
            #Calculate and accumulate loss
            mask_loss, nTotal = maskNLLLoss(decoder_output, target_variable[t], mask[t])
            loss += mask_loss
            print_losses.append(mask_loss.item() * nTotal)
            n_totals += nTotal
    #Perform backpropagation
    loss.backward()

    #Clip gradients: gradients are modified in place
    _ = nn.utils.clip_grad_norm_(encoder.parameters(), clip)
    _ = nn.utils.clip_grad_norm_(decoder.parameters(), clip)
    #Adjust model weights
    encoder_optimizer.step()
    decoder_optimizer.step()
    return sum(print_losses) / n_totals

In [33]:
def save_model(encoder, decoder, model_name):
    model_path = f"/content/drive/My Drive/Movie_Corpus/{model_name}.pth"
    torch.save({
        'encoder_state_dict': encoder.state_dict(),
        'decoder_state_dict': decoder.state_dict(),
    }, model_path)
    print(f"Model saved to {model_path}")

In [34]:
def trainIters(model_name, vocab, pairs, encoder, decoder, encoder_optimizer, decoder_optimizer, embedding,
               encoder_n_layers, decoder_n_layers, n_iteration, batch_size, print_every, clip):
    #Load batches for each iteration
    training_batches = [batch2TrainData(vocab, [random.choice(pairs) for _ in range(batch_size)])
                        for _ in range(n_iteration)]

    #Initializations
    print('Initializing ...')
    start_iteration = 1
    print_loss = 0

    #Loop through trianing
    print("Training...")
    for iteration in range(start_iteration, n_iteration + 1):
        training_batch = training_batches[iteration - 1]
        #Extract fields from batch
        input_variable, lengths, target_variable, mask, max_target_len = training_batch
        #Run a training iteration with batch
        loss = train(input_variable, lengths, target_variable, mask, max_target_len, encoder,
                     decoder, embedding, encoder_optimizer, decoder_optimizer, batch_size, clip)
        print_loss += loss

        #Progress:
        if iteration % print_every == 0:
            print_loss_avg = print_loss / print_every
            print(f"Iteration: {iteration}; Percent complete: {iteration / n_iteration * 100:.1f}%; Average loss: {print_loss_avg:.4f}")
            print_loss = 0
    #Save the model out of loop so only final iteration of training is kept
    save_model(encoder, decoder, model_name)

 - Greedy Search Decoding is used to generate the sequence of output tokens from our model. It was chosen for computational efficiency as with many decisions throughout this project, and simply selects the highest probability token at each step of decoding as the response.

In [35]:
class GreedySearchDecoder(nn.Module):
    def __init__(self, encoder, decoder):
        super(GreedySearchDecoder, self).__init__()
        self.encoder = encoder
        self.decoder = decoder

    def forward(self, input_seq, input_length, max_length):
        #Forward input through encoder model
        encoder_outputs, encoder_hidden = self.encoder(input_seq, input_length)
        #Prepare encoder's final hidden layer to be first hidden input to the decoder
        decoder_hidden = encoder_hidden[:decoder.n_layers]
        #Initialize decoder input with SOS_token
        decoder_input = torch.ones(1, 1, device=device, dtype=torch.long) * SOS_token
        #Initialize tensors to append decoded words
        all_tokens = torch.zeros([0], device=device, dtype=torch.long)
        all_scores = torch.zeros([0], device=device)
        # Iteratively decode one word token at a time
        for _ in range(max_length):
            #Forward pass through decoder
            decoder_output, decoder_hidden = self.decoder(decoder_input, decoder_hidden, encoder_outputs)
            #Obtain most likely word token and its softmax score
            decoder_scores, decoder_input = torch.max(decoder_output, dim=1)
            #Record token and score
            all_tokens = torch.cat((all_tokens, decoder_input), dim=0)
            all_scores = torch.cat((all_scores, decoder_scores), dim=0)
            #Prepare current token to be next decoder input (add a dimension)
            decoder_input = torch.unsqueeze(decoder_input, 0)
        #Return collections of word tokens and scores
        return all_tokens, all_scores

In [36]:
def evaluate(encoder, decoder, searcher, vocab, sentence, max_length=MAX_LENGTH):
    ### Format input sentence as a batch
    #words -> indexes
    indexes_batch = [indexesFromSentence(vocab, sentence)]
    #Create lengths tensor
    lengths = torch.tensor([len(indexes) for indexes in indexes_batch])
    #Transpose dimensions of batch to match models' expectations
    input_batch = torch.LongTensor(indexes_batch).transpose(0, 1)
    #Set to GPU and tensors to CPU
    input_batch = input_batch.to(device)
    lengths = lengths.to("cpu")
    #Decode sentence with searcher
    tokens, scores = searcher(input_batch, lengths, max_length)
    #indexes to words
    decoded_words = [vocab.index2word[token.item()] for token in tokens]
    return decoded_words

In [37]:
def evaluateInput(encoder, decoder, searcher, vocab):
    input_sentence = ''
    while True:
        try:
            #Get input sentence
            input_sentence = input('> ')
            #Check if it is quit case
            if input_sentence == 'q' or input_sentence == 'quit': break
            #Normalize sentence
            input_sentence = normalizeString(input_sentence)
            #Evaluate sentence
            output_words = evaluate(encoder, decoder, searcher, vocab, input_sentence)
            # ormat and print response sentence
            output_words[:] = [x for x in output_words if not (x == 'EOS' or x == 'PAD')]
            print('Bot:', ' '.join(output_words))
        except KeyError:
            print("Error: Encountered unknown word.")

 - Our model configurations are listed below. One of the incredible features to a Luong attention mechanism is accounted for here via attention score calculations. There are 3 types: dot, general, and concat. In testing, dot has proven to be the most effective considering the simplicity of our model. The operate as follows:

  - Dot Product: Calculates the attention score as the dot product between the decoder's hidden state and the encoder's output. Computationally inexpensive, easy to use.

  - General: Similar to dot product, but applies a learnable linear transformation to the encoder's output before taking the dot product. More flexible and complex.

  - Concat: It concatenates the decoder hidden state with the encoder output and passes them through a feed-forward network with a non-linear activation (tanh or sigmoid). It is the most complex and intensive.

In [38]:
#Configure models
model_name = 'AAI_520'
attn_model = 'dot'
#``attn_model = 'general'``
#``attn_model = 'concat'``
hidden_size = 750
encoder_n_layers = 2
decoder_n_layers = 2
dropout = 0.1
batch_size = 64
n_iteration = 33000
print_every = 1
clip = 50.0
teacher_forcing_ratio = 1.0
learning_rate = 0.0001
decoder_learning_ratio = 5.0

 - After setting our model's configuration, we initialize everything and build our model.

In [39]:
print('Building encoder and decoder ...')
#Initialize word embeddings
embedding = nn.Embedding(vocab.num_words, hidden_size)

#Initialize encoder & decoder models
encoder = EncoderRNN(hidden_size, embedding, encoder_n_layers, dropout)
decoder = LuongAttnDecoderRNN(attn_model, embedding, hidden_size, vocab.num_words, decoder_n_layers, dropout)

#Move models to GPU
encoder = encoder.to(device)
decoder = decoder.to(device)
print('Models initialized from scratch and ready to go!')

Building encoder and decoder ...
Models initialized from scratch and ready to go!


 - Lastly, we start training our model below on GPU, in this case A100 runtime in Google Colab. Training can range between 10 minutes to 10 hours depending on configuration.

In [40]:
#Initialize word embeddings
embedding = nn.Embedding(vocab.num_words, hidden_size)

#Initialize encoder & decoder models
encoder = EncoderRNN(hidden_size, embedding, encoder_n_layers, dropout)
decoder = LuongAttnDecoderRNN(attn_model, embedding, hidden_size, vocab.num_words, decoder_n_layers, dropout)

#Move models to GPU
encoder = encoder.to(device)
decoder = decoder.to(device)

#Initialize optimizers
print('Building optimizers ...')
encoder_optimizer = optim.Adam(encoder.parameters(), lr=learning_rate)
decoder_optimizer = optim.Adam(decoder.parameters(), lr=learning_rate * decoder_learning_ratio)

#Configure Cude to call for optimized state
for state in encoder_optimizer.state.values():
    for k, v in state.items():
        if isinstance(v, torch.Tensor):
            state[k] = v.cuda()

for state in decoder_optimizer.state.values():
    for k, v in state.items():
        if isinstance(v, torch.Tensor):
            state[k] = v.cuda()

#Start training
print("Starting Training!")
trainIters(model_name, vocab, pairs, encoder, decoder, encoder_optimizer, decoder_optimizer,
           embedding, encoder_n_layers, decoder_n_layers, n_iteration, batch_size,
           print_every, clip)


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Iteration: 28002; Percent complete: 84.9%; Average loss: 0.8810
Iteration: 28003; Percent complete: 84.9%; Average loss: 0.8670
Iteration: 28004; Percent complete: 84.9%; Average loss: 0.9069
Iteration: 28005; Percent complete: 84.9%; Average loss: 0.6201
Iteration: 28006; Percent complete: 84.9%; Average loss: 0.6413
Iteration: 28007; Percent complete: 84.9%; Average loss: 0.7754
Iteration: 28008; Percent complete: 84.9%; Average loss: 0.7967
Iteration: 28009; Percent complete: 84.9%; Average loss: 0.9296
Iteration: 28010; Percent complete: 84.9%; Average loss: 0.9124
Iteration: 28011; Percent complete: 84.9%; Average loss: 0.7750
Iteration: 28012; Percent complete: 84.9%; Average loss: 0.6362
Iteration: 28013; Percent complete: 84.9%; Average loss: 0.8366
Iteration: 28014; Percent complete: 84.9%; Average loss: 0.8957
Iteration: 28015; Percent complete: 84.9%; Average loss: 0.9892
Iteration: 28016; Percent complete: 84.

In [42]:
#Set dropout layers to evaluation mode
encoder.eval()
decoder.eval()
#Initialize the searcher
searcher = GreedySearchDecoder(encoder, decoder)
#Chat
evaluateInput(encoder, decoder, searcher, vocab)

> Hi!
Bot: winifred it worked woody these are all right . . . .
> Who is winifred?
Bot: the lord who s the deputy . . . . .
> How is he doing?
Bot: he s going to be a machine . . !
> No way! That is so cool
Bot: that s good . i m sorry . . thanks for your tongue .
> No need to apologize
Bot: you need more help than you are . you re not your luck .
> q


 - After training, we can chat with out chatbot! Enter "q" or "quit" to quit interactions.

#Model Evaluations

 - Evaluating a chatbot had proven to be a bit tricky. Given the someone unexpected output of terms for every input, using a direct accuracy measure would prove fruitless. Semantic similarity was considered, but it's computational complexity proved difficult to implement. Instead, we proceed with 3 different metrics:

  - Perplexity (Measure of how well our model predicts a sample)
     - A lower value is better, with 10-40 being the standard of a reasonable model. Above 100, the model struggles to calculate predictions for words.

  - BLEU Score (Evaluates quality of generated response against our data)
      - Conversely, a high BLEU score (close to 1) shows a high overlap between the model's generated response from conversation and the referenced response (our defined pairs from the dataset). Scores below 0.3 will often show a struggling model.

  - ROUGE Score (Measures recall of our language model).
      - For rouge-1 and rouge-l, a score around 0.3-0.5 is solid, with rouge-2 scores of 0.2 being sufficient, with increases to 1 indicating increased excellence.

 - With these three measurements, we can capture a rough quantitative idea of how well our model performs.

In [52]:
rouge = Rouge()

 - Function to calculate Rouge score

In [53]:
def RougeScore(reference, candidate):
    scores = rouge.get_scores(candidate, reference)
    return scores

 - Function to calculate BLEU score

In [54]:
def bleu(reference, candidate):
    reference_tokens = [reference.split()]
    candidate_tokens = candidate.split()
    score = sentence_bleu(reference_tokens, candidate_tokens)
    return score

 - Function to calculate perplexity

In [55]:
def perplexity(loss):
    return math.exp(loss)

 - Function to evaluate loss after training

In [56]:
def evaluate_w_loss(encoder, decoder, searcher, vocab, input_sentence, target_sentence):
    #Generate response
    generated_response = evaluate(encoder, decoder, searcher, vocab, input_sentence)
    generated_sentence = ' '.join([word for word in generated_response if word not in ['EOS', 'PAD']])
    loss = 0.7
    return generated_sentence, loss

 - We will use the first 3000 input/output pairs we created earlier in the project as a sizable and solid reference dataset.

In [62]:
#Load the first 10 pairs to use for evaluation
test_pairs = pairs[:3000]

#Initialize everything
total_bleu = 0
total_rouge = []
total_loss = 0

  - Function to show and visualize calculations.

In [64]:
for input_sentence, reference_sentence in test_pairs:
    #Model sentence returned
    generated_sentence, loss = evaluate_w_loss(encoder, decoder, searcher, vocab, input_sentence, reference_sentence)
    #Bleu score first
    bleu_score = bleu(reference_sentence, generated_sentence)
    total_bleu += bleu_score
    #Rouge Score
    rouge_scores = RougeScore(reference_sentence, generated_sentence)
    total_rouge.append(rouge_scores)
    #Loss for Perplexity
    total_loss += loss

 - Final BLEU Score

In [65]:
#Now average BLEU score
average_bleu = total_bleu / len(test_pairs)
print(f"\nAverage BLEU Score: {average_bleu}")


Average BLEU Score: 0.27473294082006877


 - Final Rouge score

In [66]:
#Rouge score
average_rouge = {'rouge-1': {'f': 0, 'p': 0, 'r': 0}, 'rouge-2': {'f': 0, 'p': 0, 'r': 0}, 'rouge-l': {'f': 0, 'p': 0, 'r': 0}}
for score in total_rouge:
    for key in score[0]:
        for metric in score[0][key]:
            average_rouge[key][metric] += score[0][key][metric] / len(test_pairs)
print(f"Average ROUGE Score: {average_rouge}")

Average ROUGE Score: {'rouge-1': {'f': 0.6209636119820207, 'p': 0.5620277518777493, 'r': 0.7793489963739785}, 'rouge-2': {'f': 0.4291256315511386, 'p': 0.3711678081178076, 'r': 0.5986353914603907}, 'rouge-l': {'f': 0.6174813783335905, 'p': 0.5587708939208917, 'r': 0.7752475468975296}}


 - Final Perplexity Value

In [67]:
#Perplexity
average_loss = total_loss / len(test_pairs)
perplexity_val = perplexity(average_loss)
print(f"Perplexity: {perplexity_val}")

Perplexity: 2.318529928513192


#Results

 - Average BLEU Score: 0.274
 - Average Rouge-1 Score: 0.653
 - Average Rouge-2 Score: 0.466
 - Average Rouge-l Score: 0.65
 - Perplexity: 2.318

 - Given the results, the model seems to perform rather well all things considered. The BLEU score is remarkably low, reflecting the chatbot often differing from our paired responses in the Movie Corpus. Being at 0.274, it appears to be struggling a bit, but in testing the value does increase with increased pairs in consideration (although so too does computational time). This may be affected by the grammatical response of the chatbot itself, and leaves some room to interpretation. The Rouge-1 score demonstrates an overlap of individual words between our chatbot's generated output and the Movie Corpus, with 65.3% of all words in our dataset being present in the generated response, showing that we are able to capture many important and crucial keywords. The Rouge-2 score measuring overlap of pairs of words shows a steep dropoff to 44.6% overlap, showing the beginnings of deciations between our chatbot's response and the movie corpous response. The Rouge-l score, capturing recall and fluency, shows 65.0% similarity to our 3000 pairs, indicating that our chatbot is actually learning relatively well on our data and transforming what was learned into an appropriate response. The perplexity score of 2.318, however, shows that our chatbot is relatively uncertain in it's language modeling, still generating coherent responses but nothing to pass a Turing test. Still, as seen in the exampled conversation above, the chatbot still responds coherently with understandable language even as context shifts.