# A Simple Chatbot Using PyTorch


In this notebook, we shall be building a chatbot based on the original transformer architecture. The data used to train this model has been taken from the movie dialogue corpus which can be seen from <a href="https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html">here</a>

## Imports

In [166]:
import torch
from torch.jit import script, trace
import torch.nn as nn
from torch import optim
import torch.nn.functional as F
import csv
import random
import re
import os
import unicodedata
import codecs
from io import open
import itertools
import math
import json
import pandas as pd 
import torchtext
from torchtext.data import get_tokenizer


USE_CUDA = torch.cuda.is_available()
device = torch.device("cuda" if USE_CUDA else "cpu")

## Data Preparation 

The movie corpus data has to be prepared so that it can be fed to the model we create later on. We first take a look at the data to see what steps need to be taken for this. 

In [167]:
corpus_name = "movie-corpus"
corpus = os.path.join("data", corpus_name)

def printLines(file, n=10):
    """Print the first n lines of a file"""
    with open(file, 'rb') as datafile:
        lines = datafile.readlines()
    for line in lines[:n]:
        print(line)

printLines(os.path.join(corpus, "utterances.jsonl"))

b'{"id": "L1045", "conversation_id": "L1044", "text": "They do not!", "speaker": "u0", "meta": {"movie_id": "m0", "parsed": [{"rt": 1, "toks": [{"tok": "They", "tag": "PRP", "dep": "nsubj", "up": 1, "dn": []}, {"tok": "do", "tag": "VBP", "dep": "ROOT", "dn": [0, 2, 3]}, {"tok": "not", "tag": "RB", "dep": "neg", "up": 1, "dn": []}, {"tok": "!", "tag": ".", "dep": "punct", "up": 1, "dn": []}]}]}, "reply-to": "L1044", "timestamp": null, "vectors": []}\n'
b'{"id": "L1044", "conversation_id": "L1044", "text": "They do to!", "speaker": "u2", "meta": {"movie_id": "m0", "parsed": [{"rt": 1, "toks": [{"tok": "They", "tag": "PRP", "dep": "nsubj", "up": 1, "dn": []}, {"tok": "do", "tag": "VBP", "dep": "ROOT", "dn": [0, 2, 3]}, {"tok": "to", "tag": "TO", "dep": "dobj", "up": 1, "dn": []}, {"tok": "!", "tag": ".", "dep": "punct", "up": 1, "dn": []}]}]}, "reply-to": null, "timestamp": null, "vectors": []}\n'
b'{"id": "L985", "conversation_id": "L984", "text": "I hope so.", "speaker": "u0", "meta": {

As can be seen, the sentences are in a JSON file. We process this into a DataFrame as that makes it easier to work with the data. 

In [168]:
df = pd.read_json("data/movie-corpus/utterances.jsonl", lines=True)
df = (df.groupby(['conversation_id']).agg({'text': lambda x: x.tolist()})).reset_index()
df["text"] = df["text"].apply(lambda x: x[::-1])

We now take a peek at the wonderful DataFrame we just created! 

In [236]:
df.head()

Unnamed: 0,conversation_id,text
0,L100001,[then why did you go see mr . koehler in the f...
1,L100003,"[hi joe ., frank what are you doing here ?, i ..."
2,L10001,[those guys ain t so tough . i fought plenty o...
3,L100011,"[hello ?, frank it s rebecca . i need to see y..."
4,L100016,[you killed him . you killed him and i got you...


## Data Preprocessing 

We now create a vocabulary of words from the sentence pairs created previously. But why do we even need this? Let's analyse this a bit deeper 

Right now we have a bunch of sentence pairs in our DataFrame which the model cannot directly process. Models can only work with numbers, obviously. So we need to first convert these sentences into numbers. How would we do that? 

Now, lets say we have a dictionary of words where the id is a number and the value is a word, we could then easily convert each word into an appropriate, meaningful number which can then be used further. This is why we create the vocabulary. 

To make the vocabulary creation more understandable, we create the "Voc" class. 

In [169]:
# Default word tokens
PAD_token = 0  # Used for padding short sentences
SOS_token = 1  # Start-of-sentence token
EOS_token = 2  # End-of-sentence token

class Voc:
    def __init__(self, name):
        self.name = name
        self.trimmed = False
        self.word2index = {}
        self.word2count = {}
        self.index2word = {PAD_token: "PAD", SOS_token: "SOS", EOS_token: "EOS"}
        self.num_words = 3  # Count SOS, EOS, PAD

    def addSentence(self, sentence):
        """Takes the sentence, splits into words and then adds that to the dictionary"""
        for word in sentence.split(' '):
            self.addWord(word)
        
    def addTokens(self, tokens):
        """Takes the tokens prepared using the PyTorch library and adds that to the dictionary"""
        for token in tokens:
            self.addWord(token)

    def addWord(self, word):
        """Adds each new input word is added to the word-count dictionary, else the word count is increased"""
        if word not in self.word2index:
            self.word2index[word] = self.num_words
            self.word2count[word] = 1
            self.index2word[self.num_words] = word
            self.num_words += 1
        else:
            self.word2count[word] += 1

But we cannot create the vocabulary just yet! We have to clean up our sentences to only have relevant words and no gibberish/special tokens. In typical text preprocessing, we do the following 
<ul>
<li>Standardize the case of the string</li>
<li>Remove special characters for a simpler processing</li>
</ul>

We also remove those sentences which are too long, in this case longer than the parameter MAX_LENGTH, so as to avoid too much resource consumption in the model training phase. So any sentence longer than 10 words will be removed from the input data. 

In [170]:
MAX_LENGTH = 10  # Maximum sentence length to consider

# Lowercase, trim, and remove non-letter characters
def normalizeString(s):
    s = s.lower().strip()
    s = re.sub(r"([.!?])", r" \1", s)
    s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
    s = re.sub(r"\s+", r" ", s).strip()
    return s

def normalizeTexts(texts):
    newTexts = []
    for s in texts:
        s = s.lower().strip()
        s = re.sub(r"([.!?])", r" \1", s)
        s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
        s = re.sub(r"\s+", r" ", s).strip()
        newTexts.append(s)
    return newTexts

# Returns True if both sentences in a pair 'p' are under the MAX_LENGTH threshold
def filterCondition(texts):
    # Input sequences need to preserve the last word for EOS token
    if len(texts) > 2:
        considerText = False
    else: 
        considerText = True
        for sentence in texts:
            considerText = considerText and (len(sentence.split(" ")) < MAX_LENGTH)
    return considerText

def filterData(pairs):
    return [texts for texts in pairs if filterCondition(texts)]

# Using the functions defined above, return a populated voc object and pairs list
def prepareData(df):
    print("Start preparing training data ...")
    df["text"] = df["text"].apply(normalizeTexts)
    print("Read {!s} sentence pairs".format(len(df)))
    new_df = df.loc[df["text"].apply(filterCondition)]
    print("Trimmed to {!s} sentence pairs".format(len(df)))
    return new_df


# Load/Assemble voc and pairs
input_data = prepareData(df)
# Print some pairs to validate
print("\nTexts:")
for idx in range(10):
    print(input_data.iloc[idx])

Start preparing training data ...


Read 83097 sentence pairs
Trimmed to 83097 sentence pairs

Texts:
conversation_id                                  L100050
text               [can i go ?, you get his statement ?]
Name: 10, dtype: object
conversation_id                        L100052
text               [yeah ., then you can go .]
Name: 11, dtype: object
conversation_id                                    L100071
text               [are you seeing betty tonight ?, nah .]
Name: 14, dtype: object
conversation_id                                           L10012
text               [well well . mrs . brigman ., not for long .]
Name: 27, dtype: object
conversation_id                                              L100133
text               [it s a beautiful picture of her ., why are th...
Name: 32, dtype: object
conversation_id                                              L100157
text               [carolyn you want these candlesticks ?, no . y...
Name: 38, dtype: object
conversation_id                                           

Now we could start making the vocabulary right away no problem! However, I want to simplify the DataFrame a bit more so that I have an easier time dealing with it. So I split the sentence pairs into questions, answers and then proceed further. 

In [171]:
input_data.loc[:, "questions"] = input_data.loc[:, "text"].apply(lambda l: l[0])
input_data.loc[:, "answers"] = input_data.loc[:, "text"].apply(lambda l: l[1])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  input_data.loc[:, "questions"] = input_data.loc[:, "text"].apply(lambda l: l[0])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  input_data.loc[:, "answers"] = input_data.loc[:, "text"].apply(lambda l: l[1])


Finally comes the vocabulary part! We now split the sentence into "tokens" using a tokenizer. Each "token" may not be the same as a word and we use a tokenizer for this part 

In [172]:
tokenizer = get_tokenizer("basic_english")
input_data.loc[:, "question_tokens"] = input_data.loc[:, "questions"].apply(tokenizer)
input_data.loc[:, "answer_tokens"] = input_data.loc[:, "answers"].apply(tokenizer)


voc = Voc(corpus_name)

input_data.loc[:, "question_tokens"].apply(voc.addTokens)
input_data.loc[:, "answer_tokens"].apply(voc.addTokens)

voc.num_words

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  input_data.loc[:, "question_tokens"] = input_data.loc[:, "questions"].apply(tokenizer)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  input_data.loc[:, "answer_tokens"] = input_data.loc[:, "answers"].apply(tokenizer)


8195

We now move on to converting the sentences into number sequences, with each number being an ID for the token in the vocabulary. Since the model needs to know when a sequence starts and ends, we add special SOS_token and EOS_token. 

SOS = Start Of Sequence, EOS = End Of Sequence 

In [237]:
input_tokens = input_data.loc[:, "question_tokens"].apply(
    lambda l: [SOS_token] + [voc.word2index[ele] for ele in l] + [EOS_token])
output_tokens = input_data.loc[:, "answer_tokens"].apply(
    lambda l: [SOS_token] + [voc.word2index[ele] for ele in l] + [EOS_token])

In [198]:
input_data.loc[:, "question_tokens"]

10                                     [can, i, go, ?]
11                                           [yeah, .]
14               [are, you, seeing, betty, tonight, ?]
27                 [well, well, ., mrs, ., brigman, .]
32          [it, s, a, beautiful, picture, of, her, .]
                             ...                      
83065                               [no, questions, .]
83076                      [objection, your, honor, .]
83086                                     [i, know, .]
83089    [let, me, go, !, godammit, frank, let, go, !]
83092             [you, did, a, good, job, charlie, .]
Name: question_tokens, Length: 12043, dtype: object

In [174]:
# quick check to see if there's any sentences longer than 10. If there's not too many, we just drop these ones 
input_tokens.apply(lambda l: len(l) if len(l) > 10 else 0).sum(), output_tokens.apply(lambda l: len(l) if len(l) > 10 else 0).sum()

(12837, 13225)

All the sequences we created are of different lengths. The Transformer architecture takes a fixed length sequence as input by default. How do we solve this then? With Padding of course! 

We add a bunch of "PAD_tokens" at the end of the sequence to make all the sequences of the same length. They mean absolutely nothing to the model and we make sure of it by adding a mask in the architecture, more on this later. 

In [238]:
PAD_LENGTH = 15

def zeroPadding(l, fillvalue=PAD_token):
    return l + [PAD_token] * (PAD_LENGTH - len(l))

def padMask(l, value=PAD_token):
    m = []
    for ele in l: 
        if ele == value:
            m.append(0)
        else: 
            m.append(1)
    return m  

# Returns padded input sequence tensor and lengths
def inputVar(l, voc):
    input_tokens = l.apply(zeroPadding)
    lengths = torch.tensor([len(indexes) for indexes in input_tokens])
    padVar = [torch.LongTensor(padElement) for padElement in input_tokens]
    return padVar, lengths

# Returns padded target sequence tensor, padding mask, and max target length
def outputVar(l, voc):
    output_tokens = l.apply(zeroPadding)
    max_target_len = max([len(indexes) for indexes in output_tokens])
    padmask = padMask(output_tokens.apply(lambda l: len(l)))
    padmask = torch.BoolTensor(padmask)
    padVar = [torch.LongTensor(padElement) for padElement in output_tokens]
    return padVar, padmask, max_target_len

# Returns all items for a given batch of pairs
def batch2TrainData(voc, input_tokens, output_tokens, num_of_pairs):
    input_tokens_list = input_tokens.iloc[:num_of_pairs]
    output_tokens_list = output_tokens.iloc[:num_of_pairs]
    inp, lengths = inputVar(input_tokens_list, voc)
    output, mask, max_target_len = outputVar(output_tokens_list, voc)
    return inp, lengths, output, mask, max_target_len


# Example for validation
small_batch_size = 5
batches = batch2TrainData(voc, input_tokens, output_tokens, small_batch_size)
input_variable, lengths, target_variable, mask, max_target_len = batches

print("input_variable:", input_variable)
print("lengths:", lengths)
print("target_variable:", target_variable)
print("mask:", mask)
print("max_target_len:", max_target_len)

input_variable: [tensor([1, 3, 4, 5, 6, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0]), tensor([1, 7, 8, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), tensor([ 1,  9, 10, 11, 12, 13,  6,  2,  0,  0,  0,  0,  0,  0,  0]), tensor([ 1, 14, 14,  8, 15,  8, 16,  8,  2,  0,  0,  0,  0,  0,  0]), tensor([ 1, 17, 18, 19, 20, 21, 22, 23,  8,  2,  0,  0,  0,  0,  0])]
lengths: tensor([15, 15, 15, 15, 15])
target_variable: [tensor([   1,   10,  188,  319, 2672,    6,    2,    0,    0,    0,    0,    0,
           0,    0,    0]), tensor([  1, 358,  10,   3,   5,   8,   2,   0,   0,   0,   0,   0,   0,   0,
          0]), tensor([   1, 2194,    8,    2,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0]), tensor([  1, 111, 144, 654,   8,   2,   0,   0,   0,   0,   0,   0,   0,   0,
          0]), tensor([   1,  224,    9,   41,  855, 3559,  108,    6,    2,    0,    0,    0,
           0,    0,    0])]
mask: tensor([True, True, True, True, True])
max_target_len: 15


## Model Architecture

Modelling time! 

We shall be using the original Transformer architecture from the "Attention is all you need" paper. The minor changes made to the architecture in our case are as follows, 

<ul>
<li>Query, Key and Value vectors are not being prepared as per the "learned weight matrices". Instead we simply pass the input as the query, key and values for easier computation</li>
</ul>


If you do not know the architecture, I recommend reading the handout file which I included in the repo. Of course, it just glances over the architecture but I strongly believe that it gives a good overview and you can dive deeper on the internet when necessary 

We have the following blocks to create: 
- Multi Head Attention Block - which implements the attention formula 
- Encoder Block - containing all the "Add & Norm", Attention layers needed 
- Decoder Block - again containing all the necessary layers  

In [None]:
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        # Ensure that the model dimension (d_model) is divisible by the number of heads
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        
        # Initialize dimensions
        self.d_model = d_model # Model's dimension
        self.num_heads = num_heads # Number of attention heads
        self.d_k = d_model // num_heads # Dimension of each head's key, query, and value
        
        # Linear layers for transforming inputs
        self.W_q = nn.Linear(d_model, d_model) # Query transformation
        self.W_k = nn.Linear(d_model, d_model) # Key transformation
        self.W_v = nn.Linear(d_model, d_model) # Value transformation
        self.W_o = nn.Linear(d_model, d_model) # Output transformation
        
    def scaled_dot_product_attention(self, Q, K, V, mask=None):
        # Calculate attention scores
        attn_scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        
        # Apply mask if provided (useful for preventing attention to certain parts like padding)
        if mask is not None:
            attn_scores = attn_scores.masked_fill(mask == 0, -1e9)
        
        # Softmax is applied to obtain attention probabilities
        attn_probs = torch.softmax(attn_scores, dim=-1)
        
        # Multiply by values to obtain the final output
        output = torch.matmul(attn_probs, V)
        return output
        
    def split_heads(self, x):
        # Reshape the input to have num_heads for multi-head attention
        batch_size, seq_length, d_model = x.size()
        return x.view(batch_size, seq_length, self.num_heads, self.d_k).transpose(1, 2)
        
    def combine_heads(self, x):
        # Combine the multiple heads back to original shape
        batch_size, _, seq_length, d_k = x.size()
        return x.transpose(1, 2).contiguous().view(batch_size, seq_length, self.d_model)
        
    def forward(self, Q, K, V, mask=None):
        # Apply linear transformations and split heads
        Q = self.split_heads(self.W_q(Q))
        K = self.split_heads(self.W_k(K))
        V = self.split_heads(self.W_v(V))
        
        # Perform scaled dot-product attention
        attn_output = self.scaled_dot_product_attention(Q, K, V, mask)
        
        # Combine heads and apply output transformation
        output = self.W_o(self.combine_heads(attn_output))
        return output

In [None]:
class PositionWiseFeedForward(nn.Module):
    def __init__(self, d_model, d_ff):
        super(PositionWiseFeedForward, self).__init__()
        self.fc1 = nn.Linear(d_model, d_ff)
        self.fc2 = nn.Linear(d_ff, d_model)
        self.relu = nn.ReLU()

    def forward(self, x):
        return self.fc2(self.relu(self.fc1(x)))

In [None]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_seq_length):
        super(PositionalEncoding, self).__init__()
        
        pe = torch.zeros(max_seq_length, d_model)
        position = torch.arange(0, max_seq_length, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model))
        
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        
        self.register_buffer('pe', pe.unsqueeze(0))
        
    def forward(self, x):
        return x + self.pe[:, :x.size(1)]

In [None]:
class EncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout):
        super(EncoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = PositionWiseFeedForward(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x, mask):
        attn_output = self.self_attn(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_output))
        ff_output = self.feed_forward(x)
        x = self.norm2(x + self.dropout(ff_output))
        return x

In [None]:
class DecoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout):
        super(DecoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.cross_attn = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = PositionWiseFeedForward(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x, enc_output, src_mask, tgt_mask):
        attn_output = self.self_attn(x, x, x, tgt_mask)
        x = self.norm1(x + self.dropout(attn_output))
        attn_output = self.cross_attn(x, enc_output, enc_output, src_mask)
        x = self.norm2(x + self.dropout(attn_output))
        ff_output = self.feed_forward(x)
        x = self.norm3(x + self.dropout(ff_output))
        return x

I mentioned a "mask" before the architecture, if you were paying attention (ba dum tss!). This is where it comes into play! 

The following code combines all the blocks we have built so far into a final Transformer class. In this class, we see a function called "generate_masks" that is return a src_mask and a tgt_mask. 

The src_mask checks which tokens are PAD_tokens and marks those positions and purposefully give them low attention scores (check the MultiHeadAttention block for this). The tgt_mask masks all the future tokens for each given token so that the Decoder cannot cheat and should actually learn. 

In [None]:
class Transformer(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model, num_heads, num_layers, d_ff, max_seq_length, dropout):
        super(Transformer, self).__init__()
        self.encoder_embedding = nn.Embedding(src_vocab_size, d_model)
        self.decoder_embedding = nn.Embedding(tgt_vocab_size, d_model)
        self.positional_encoding = PositionalEncoding(d_model, max_seq_length)

        self.encoder_layers = nn.ModuleList([EncoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        self.decoder_layers = nn.ModuleList([DecoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])

        self.fc = nn.Linear(d_model, tgt_vocab_size)
        self.dropout = nn.Dropout(dropout)

    def generate_mask(self, src, tgt):
        src_mask = (src != 0).unsqueeze(1).unsqueeze(2)
        tgt_mask = (tgt != 0).unsqueeze(1).unsqueeze(3)
        seq_length = tgt.size(1)
        nopeek_mask = (1 - torch.triu(torch.ones(1, seq_length, seq_length), diagonal=1)).bool()
        tgt_mask = tgt_mask & nopeek_mask
        return src_mask, tgt_mask

    def forward(self, src, tgt):
        src_mask, tgt_mask = self.generate_mask(src, tgt)
        src_embedded = self.dropout(self.positional_encoding(self.encoder_embedding(src)))
        tgt_embedded = self.dropout(self.positional_encoding(self.decoder_embedding(tgt)))

        enc_output = src_embedded
        for enc_layer in self.encoder_layers:
            enc_output = enc_layer(enc_output, src_mask)

        dec_output = tgt_embedded
        for dec_layer in self.decoder_layers:
            dec_output = dec_layer(dec_output, enc_output, src_mask, tgt_mask)

        output = self.fc(dec_output)
        return output

Now we just create the Transformer with a set of chosen Hyperparameters 

In [184]:
src_vocab_size = voc.num_words
d_model = 512
tgt_vocab_size = voc.num_words # source and target vocab belong to the same Voc object 
num_heads = 8
num_layers = 6 # number of encoder, decoder stacks
d_ff = 2048 # dimensionality of the hidden layer for the feed forward neural network 
max_seq_length = PAD_LENGTH # same as the original input because you add the input tensor to the Positional Encoding tensor   
dropout = 0.1 # dropout probability. Initial value 0.1 

transformer = Transformer(src_vocab_size, tgt_vocab_size, d_model, num_heads, num_layers, d_ff, max_seq_length, dropout)

Training time! 

We can take a larger batch size but for the purpose of this notebook, I considered 64 as the batch size per epoch and 100 epochs in total since my system can only handle so much training  

In [185]:
criterion = nn.CrossEntropyLoss(ignore_index=0)
optimizer = optim.Adam(transformer.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)

transformer.train()

batch_size = 64 # each batch is a pair of sentences 
n_batches = 1

for epoch in range(100):
    optimizer.zero_grad()

    # training data preparation 
    training_batch = batch2TrainData(voc, input_tokens, output_tokens, batch_size)
    input_variable, _, target_variable, _, _ = training_batch

    # input_list is a list of tensors right now - [(tensor), (tensor), .... n_batches times]
    temp = [(input_var, target_var) for input_var, target_var in zip(input_variable, target_variable) 
                                         if ((len(input_var) == PAD_LENGTH) and (len(target_var) == PAD_LENGTH))]
    res = list(zip(*temp))
    input_variable = list(res[0])
    target_variable = list(res[1])

    input_variable = torch.stack(input_variable, dim=0)
    target_variable = torch.stack(target_variable, dim=0)

    output = transformer(input_variable, target_variable[:, :-1]) # shifting decoder input by 1 token 
    loss = criterion(output.contiguous().view(-1, tgt_vocab_size), target_variable[:, 1:].contiguous().view(-1)) 
    # exclude the first token for calculating the loss 
    loss.backward()
    optimizer.step()
    print(f"Epoch: {epoch+1}, Loss: {loss.item()}")

Epoch: 1, Loss: 9.330514907836914
Epoch: 2, Loss: 7.404484272003174
Epoch: 3, Loss: 6.7136921882629395
Epoch: 4, Loss: 6.480833530426025
Epoch: 5, Loss: 6.34003210067749
Epoch: 6, Loss: 6.228860378265381
Epoch: 7, Loss: 6.134629726409912
Epoch: 8, Loss: 6.040375232696533
Epoch: 9, Loss: 5.932708263397217
Epoch: 10, Loss: 5.823095321655273
Epoch: 11, Loss: 5.6992926597595215
Epoch: 12, Loss: 5.568178653717041
Epoch: 13, Loss: 5.431950092315674
Epoch: 14, Loss: 5.326467514038086
Epoch: 15, Loss: 5.175271987915039
Epoch: 16, Loss: 5.047525882720947
Epoch: 17, Loss: 4.9178338050842285
Epoch: 18, Loss: 4.790621757507324
Epoch: 19, Loss: 4.666529178619385
Epoch: 20, Loss: 4.5362935066223145
Epoch: 21, Loss: 4.403532981872559
Epoch: 22, Loss: 4.261481761932373
Epoch: 23, Loss: 4.125705242156982
Epoch: 24, Loss: 3.9655356407165527
Epoch: 25, Loss: 3.8199057579040527
Epoch: 26, Loss: 3.683149576187134
Epoch: 27, Loss: 3.5613436698913574
Epoch: 28, Loss: 3.41880202293396
Epoch: 29, Loss: 3.31535

Moment of truth! We now check how it behaves on an input

One thing to note here that really confused me at first is how to actually perform inference with the model created. 

As you know from the architecture, Decoder parses the tokens one-by-one, meaning that it takes one token and generates the next one. But how does it do that? You might have seen that the loss we chose was the CrossEntropyLoss, used during multi-class classification problems and considers probability as the output.  

This means that we get a bunch of probabilities as the output. If we check which index has the highest probability and note that, we can map it to the original word in the vocabulary and then get the final output! In hindsight, this is quite simple but I had difficulty wrapping my head around it. Hope this helps people! 

In [235]:
def sentenceFromIndexes(voc, indexes):
    sent_tokens = [voc.index2word[idx.item()] for idx in indexes[0] if idx.item() not in [0, 1, 2]]
    return "chatbot: " + " ".join(sent_tokens) 

def chat(input_stmt):
    print("> ", input_stmt)

    # prepare the input 
    input_tokens_inference = tokenizer(input_stmt)
    input_tokens_inference = [SOS_token] + [voc.word2index[ele] for ele in input_tokens_inference] + [EOS_token]
    
    input_variable, _ = inputVar(pd.Series([input_tokens_inference]), voc)
    # decoder_input = torch.zeros(MAX_LENGTH - 1).unsqueeze(0)
    # decoder_input[0, 0] = SOS_token # make the first token of the decoder as Start-of-Sequence -- SOS 
    decoder_input = torch.tensor([SOS_token]).unsqueeze(0)

    # we can pass this input to the encoder, but what about the decoder? 
    for i in range(PAD_LENGTH):
        prediction = transformer(input_variable[0].unsqueeze(0).to(torch.long), decoder_input.to(torch.long))
        prediction = prediction[:, -1:]
        predicted_id = torch.argmax(prediction, axis=2).to(torch.long) # find the token ID which has the highest probability as per the transformer

        if predicted_id == EOS_token:
            break  

        decoder_input = torch.concat([decoder_input, predicted_id], axis=-1)
    return sentenceFromIndexes(voc, decoder_input.to(torch.long))

print(chat("Hi"))

>  Hi
chatbot: what ? oh yeah .


This marks the end of the notebook! I hope you learned something from this. The main goal was to showcase my understanding of the architecture and to clear any misunderstandings I might have had. 

Finally a chatbot created from scratch! One item off my bucket list. 

Some side notes
<ul>
<li>Embedding layer is a lookup table which takes as input the word indexes and outputs the corresponding embeddings for these indexes</li>
</ul>