# A Simple Chatbot Using PyTorch


In this notebook, we shall be building a chatbot based on the original transformer architecture. The data used to train this model has been taken from the movie dialogue corpus which can be seen from <a href="https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html">here</a>

## Imports

In [2]:
# and put in a ``data/`` directory under the current directory.
#
# After that, let’s import some necessities.
#

import torch
from torch.jit import script, trace
import torch.nn as nn
from torch import optim
import torch.nn.functional as F
import csv
import random
import re
import os
import unicodedata
import codecs
from io import open
import itertools
import math
import json


USE_CUDA = torch.cuda.is_available()
device = torch.device("cuda" if USE_CUDA else "cpu")

## Data Preparation 

The movie corpus data has to be prepared so that it can be fed to the model we create later on. We first take a look at the data to see what steps need to be taken for this. 

In [3]:
corpus_name = "movie-corpus"
corpus = os.path.join("data", corpus_name)

def printLines(file, n=10):
    """Print the first n lines of a file"""
    with open(file, 'rb') as datafile:
        lines = datafile.readlines()
    for line in lines[:n]:
        print(line)

printLines(os.path.join(corpus, "utterances.jsonl"))

b'{"id": "L1045", "conversation_id": "L1044", "text": "They do not!", "speaker": "u0", "meta": {"movie_id": "m0", "parsed": [{"rt": 1, "toks": [{"tok": "They", "tag": "PRP", "dep": "nsubj", "up": 1, "dn": []}, {"tok": "do", "tag": "VBP", "dep": "ROOT", "dn": [0, 2, 3]}, {"tok": "not", "tag": "RB", "dep": "neg", "up": 1, "dn": []}, {"tok": "!", "tag": ".", "dep": "punct", "up": 1, "dn": []}]}]}, "reply-to": "L1044", "timestamp": null, "vectors": []}\n'
b'{"id": "L1044", "conversation_id": "L1044", "text": "They do to!", "speaker": "u2", "meta": {"movie_id": "m0", "parsed": [{"rt": 1, "toks": [{"tok": "They", "tag": "PRP", "dep": "nsubj", "up": 1, "dn": []}, {"tok": "do", "tag": "VBP", "dep": "ROOT", "dn": [0, 2, 3]}, {"tok": "to", "tag": "TO", "dep": "dobj", "up": 1, "dn": []}, {"tok": "!", "tag": ".", "dep": "punct", "up": 1, "dn": []}]}]}, "reply-to": null, "timestamp": null, "vectors": []}\n'
b'{"id": "L985", "conversation_id": "L984", "text": "I hope so.", "speaker": "u0", "meta": {

As can be seen, the sentences are in a JSON file. The model needs the input in the format "<START>sent1<SEP>sent2<END>" where the <SOME_NAME> represents a "special token". More on this later. 

We first process the file so that we have the sentence pairs together in a single string instead of the complex JSON. 

In [4]:
def loadLinesAndConversations(fileName):
    """Processes the file into individual lines and sentence pairs"""
    lines = {}
    conversations = {}
    with open(fileName, 'r', encoding='iso-8859-1') as f:
        for line in f:
            lineJson = json.loads(line)
            # Convert the JSON fields into more readable fields for the line object
            lineObj = {}
            lineObj["lineID"] = lineJson["id"]
            lineObj["characterID"] = lineJson["speaker"]
            lineObj["text"] = lineJson["text"]
            # lines[lineObj['lineID']] = lineObj

            # Convert the JSON fields into more readable fields for conversation object
            if lineJson["conversation_id"] not in conversations:
                convObj = {}
                convObj["conversationID"] = lineJson["conversation_id"]
                convObj["movieID"] = lineJson["meta"]["movie_id"]
                convObj["lines"] = [lineObj]
            else:
                convObj = conversations[lineJson["conversation_id"]]
                convObj["lines"].insert(0, lineObj)
            conversations[convObj["conversationID"]] = convObj

    return lines, conversations


def extractSentencePairs(conversations):
    """Convert the conversations object into pairs of sentences"""
    qa_pairs = []
    for conversation in conversations.values():
        # Iterate over all the lines of the conversation
        for i in range(len(conversation["lines"]) - 1):  # We ignore the last line (no answer for it)
            inputLine = conversation["lines"][i]["text"].strip()
            targetLine = conversation["lines"][i+1]["text"].strip()
            # Filter wrong samples (if one of the lists is empty)
            if inputLine and targetLine:
                qa_pairs.append([inputLine, targetLine])
    return qa_pairs

Now that we have defined the functions to extract the sentence pairs, we write them to a output file called "formatted_movie_lines". This is both for further reference and a backup 

In [5]:
# Define path to new file
datafile = os.path.join(corpus, "formatted_movie_lines.txt")

delimiter = '\t'
# Unescape the delimiter
delimiter = str(codecs.decode(delimiter, "unicode_escape"))

# Initialize lines dict and conversations dict
# lines = {}
conversations = {}
# Load lines and conversations
print("\nProcessing corpus into lines and conversations...")
_, conversations = loadLinesAndConversations(os.path.join(corpus, "utterances.jsonl"))

# Write new csv file
print("\nWriting newly formatted file...")
with open(datafile, 'w', encoding='utf-8') as outputfile:
    writer = csv.writer(outputfile, delimiter=delimiter, lineterminator='\n')
    for pair in extractSentencePairs(conversations):
        writer.writerow(pair)

# Print a sample of lines
print("\nSample lines from file:")
printLines(datafile)


Processing corpus into lines and conversations...

Writing newly formatted file...

Sample lines from file:
b'They do to!\tThey do not!\r\n'
b'She okay?\tI hope so.\r\n'
b"Wow\tLet's go.\r\n"
b'"I\'m kidding.  You know how sometimes you just become this ""persona""?  And you don\'t know how to quit?"\tNo\r\n'
b"No\tOkay -- you're gonna need to learn how to lie.\r\n"
b"I figured you'd get to the good stuff eventually.\tWhat good stuff?\r\n"
b'What good stuff?\t"The ""real you""."\r\n'
b'"The ""real you""."\tLike my fear of wearing pastels?\r\n'
b'do you listen to this crap?\tWhat crap?\r\n'
b"What crap?\tMe.  This endless ...blonde babble. I'm like, boring myself.\r\n"


## Data Preprocessing 

We now create a vocabulary of words from the sentence pairs created previously. This vocabulary will be fed to the model during the training phase as the model needs the input sequence vocabulary and the output sequence vocabulary in general. 

To make the objects more readable, this vocabulary is stored in the objects belonging to the "Voc" class.  

In [6]:
# Default word tokens
PAD_token = 0  # Used for padding short sentences
SOS_token = 1  # Start-of-sentence token
EOS_token = 2  # End-of-sentence token

class Voc:
    def __init__(self, name):
        self.name = name
        self.trimmed = False
        self.word2index = {}
        self.word2count = {}
        self.index2word = {PAD_token: "PAD", SOS_token: "SOS", EOS_token: "EOS"}
        self.num_words = 3  # Count SOS, EOS, PAD

    def addSentence(self, sentence):
        """Takes the sentence, splits into words and then adds that to the dictionary"""
        for word in sentence.split(' '):
            self.addWord(word)

    def addWord(self, word):
        """Adds each new input word is added to the word-count dictionary, else the word count is increased"""
        if word not in self.word2index:
            self.word2index[word] = self.num_words
            self.word2count[word] = 1
            self.index2word[self.num_words] = word
            self.num_words += 1
        else:
            self.word2count[word] += 1

We had previously created the "formatted_movies_lines" file containing the input and the target sentences together in one line. Unfortunately, this also cannot be directly fed to the model. This is because the input "sequences" the model should be 

<ul>
<li>Of the same length - achieved by using PAD tokens at the end of shorter sentences - this creates "Zero Pad" sentences</li>
<li>Containing the start and end of sequence - SOS, EOS in our case - to indicate to the model that the input sequence is no longer supposed to be processed and the output can now be generated</li>
<li>z</li>
</ul>

We also remove those sentences which are too long, in this case longer than the parameter MAX_LENGTH. So any sentence longer than 10 words will be removed from the input data. 

In [7]:
MAX_LENGTH = 10  # Maximum sentence length to consider

# Turn a Unicode string to plain ASCII, thanks to
# https://stackoverflow.com/a/518232/2809427
def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
    )

# Lowercase, trim, and remove non-letter characters
def normalizeString(s):
    s = unicodeToAscii(s.lower().strip())
    s = re.sub(r"([.!?])", r" \1", s)
    s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
    s = re.sub(r"\s+", r" ", s).strip()
    return s

# Read query/response pairs and return a voc object
def readVocs(datafile, corpus_name):
    print("Reading lines...")
    # Read the file and split into lines
    lines = open(datafile, encoding='utf-8').\
        read().strip().split('\n')
    # Split every line into pairs and normalize
    pairs = [[normalizeString(s) for s in l.split('\t')] for l in lines]
    voc = Voc(corpus_name)
    return voc, pairs

# Returns True if both sentences in a pair 'p' are under the MAX_LENGTH threshold
def filterPair(p):
    # Input sequences need to preserve the last word for EOS token
    return len(p[0].split(' ')) < MAX_LENGTH and len(p[1].split(' ')) < MAX_LENGTH

# Filter pairs using the ``filterPair`` condition
def filterPairs(pairs):
    return [pair for pair in pairs if filterPair(pair)]

# Using the functions defined above, return a populated voc object and pairs list
def loadPrepareData(corpus, corpus_name, datafile, save_dir):
    print("Start preparing training data ...")
    voc, pairs = readVocs(datafile, corpus_name)
    print("Read {!s} sentence pairs".format(len(pairs)))
    pairs = filterPairs(pairs)
    print("Trimmed to {!s} sentence pairs".format(len(pairs)))
    print("Counting words...")
    for pair in pairs:
        voc.addSentence(pair[0])
        voc.addSentence(pair[1])
    print("Counted words:", voc.num_words)
    return voc, pairs


# Load/Assemble voc and pairs
save_dir = os.path.join("data", "save")
voc, pairs = loadPrepareData(corpus, corpus_name, datafile, save_dir)
# Print some pairs to validate
print("\npairs:")
for pair in pairs[:10]:
    print(pair)

Start preparing training data ...
Reading lines...
Read 221282 sentence pairs
Trimmed to 64313 sentence pairs
Counting words...
Counted words: 18082

pairs:
['they do to !', 'they do not !']
['she okay ?', 'i hope so .']
['wow', 'let s go .']
['what good stuff ?', 'the real you .']
['the real you .', 'like my fear of wearing pastels ?']
['do you listen to this crap ?', 'what crap ?']
['well no . . .', 'then that s all you had to say .']
['then that s all you had to say .', 'but']
['but', 'you always been this selfish ?']
['have fun tonight ?', 'tons']


This is looking good so far! In the original tutorial, the next step is to trim out words that rarely occur in order to achieve faster convergence in model training. Now while this is a good idea, it could make us miss out on rare but important words. Hence I chose not to do so. 

We now convert the input sequences into tensors. To do so, we take a sentence pair (sequence), map the tokens in the sequence to the indices in the vocabulary, then use this as the tensor. This would give us a tensor of size (batch_size, max_length_of_sequence). 

You might have also noticed that we did not do the padding yet, we shall do so now. 

In [8]:
PAD_LENGTH = 10

def indexesFromSentence(voc, sentence):
    return [voc.word2index[word] for word in sentence.split(' ')] + [EOS_token]

def zeroPadding(l, fillvalue=PAD_token):
    # return list(itertools.zip_longest(*l, fillvalue=fillvalue))
    return [seq + [PAD_token] * (PAD_LENGTH - len(seq)) for seq in l]

def binaryMatrix(l, value=PAD_token):
    m = []
    for i, seq in enumerate(l):
        m.append([])
        for token in seq:
            if token == PAD_token:
                m[i].append(0)
            else:
                m[i].append(1)
    return m

# Returns padded input sequence tensor and lengths
def inputVar(l, voc):
    indexes_batch = [indexesFromSentence(voc, sentence) for sentence in l]
    lengths = torch.tensor([len(indexes) for indexes in indexes_batch])
    padList = zeroPadding(indexes_batch)
    padVar = [torch.LongTensor(padElement) for padElement in padList]
    return padVar, lengths

# Returns padded target sequence tensor, padding mask, and max target length
def outputVar(l, voc):
    indexes_batch = [indexesFromSentence(voc, sentence) for sentence in l]
    max_target_len = max([len(indexes) for indexes in indexes_batch])
    padList = zeroPadding(indexes_batch)
    mask = binaryMatrix(padList)
    mask = torch.BoolTensor(mask)
    padVar = [torch.LongTensor(padElement) for padElement in padList]
    return padVar, mask, max_target_len

# Returns all items for a given batch of pairs
def batch2TrainData(voc, pair_batch):
    pair_batch.sort(key=lambda x: len(x[0].split(" ")), reverse=True)
    input_batch, output_batch = [], []
    for pair in pair_batch:
        input_batch.append(pair[0])
        output_batch.append(pair[1])
    inp, lengths = inputVar(input_batch, voc)
    output, mask, max_target_len = outputVar(output_batch, voc)
    return inp, lengths, output, mask, max_target_len


# Example for validation
small_batch_size = 5
batches = batch2TrainData(voc, [random.choice(pairs) for _ in range(small_batch_size)])
input_variable, lengths, target_variable, mask, max_target_len = batches

print("input_variable:", input_variable)
print("lengths:", lengths)
print("target_variable:", target_variable)
print("mask:", mask)
print("max_target_len:", max_target_len)

input_variable: [tensor([ 35,  75,  75,  75,  14, 429,  24,  14,   2,   0]), tensor([   11,   117,   251,   351,    28,    22, 10148,    14,     2,     0]), tensor([  11,  119,  609, 9130,   14,    2,    0,    0,    0,    0]), tensor([19, 94, 88, 10,  2,  0,  0,  0,  0,  0]), tensor([   89, 16548,    14,     2,     0,     0,     0,     0,     0,     0])]
lengths: tensor([9, 9, 6, 5, 4])
target_variable: [tensor([ 88,  17, 101,  14,   2,   0,   0,   0,   0,   0]), tensor([172,  10,   2,   0,   0,   0,   0,   0,   0,   0]), tensor([  14,   14, 9131,   19,   10,    2,    0,    0,    0,    0]), tensor([  11,   45,    5, 5271,   14,    2,    0,    0,    0,    0]), tensor([19, 10,  2,  0,  0,  0,  0,  0,  0,  0])]
mask: tensor([[ True,  True,  True,  True,  True, False, False, False, False, False],
        [ True,  True,  True, False, False, False, False, False, False, False],
        [ True,  True,  True,  True,  True,  True, False, False, False, False],
        [ True,  True,  True,  True,

## Model Architecture

We shall be using the original Transformer architecture from the "Attention is all you need" paper. The minor changes made to the architecture in our case are as follows, 

<ul>
<li>Query, Key and Value vectors are not being prepared as per the "learned weight matrices". Instead we simply pass the input as the query, key and values for easier computation</li>
</ul>

In [9]:
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        # Ensure that the model dimension (d_model) is divisible by the number of heads
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        
        # Initialize dimensions
        self.d_model = d_model # Model's dimension
        self.num_heads = num_heads # Number of attention heads
        self.d_k = d_model // num_heads # Dimension of each head's key, query, and value
        
        # Linear layers for transforming inputs
        self.W_q = nn.Linear(d_model, d_model) # Query transformation
        self.W_k = nn.Linear(d_model, d_model) # Key transformation
        self.W_v = nn.Linear(d_model, d_model) # Value transformation
        self.W_o = nn.Linear(d_model, d_model) # Output transformation
        
    def scaled_dot_product_attention(self, Q, K, V, mask=None):
        # Calculate attention scores
        attn_scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        
        # Apply mask if provided (useful for preventing attention to certain parts like padding)
        if mask is not None:
            attn_scores = attn_scores.masked_fill(mask == 0, -1e9)
        
        # Softmax is applied to obtain attention probabilities
        attn_probs = torch.softmax(attn_scores, dim=-1)
        
        # Multiply by values to obtain the final output
        output = torch.matmul(attn_probs, V)
        return output
        
    def split_heads(self, x):
        # Reshape the input to have num_heads for multi-head attention
        batch_size, seq_length, d_model = x.size()
        return x.view(batch_size, seq_length, self.num_heads, self.d_k).transpose(1, 2)
        
    def combine_heads(self, x):
        # Combine the multiple heads back to original shape
        batch_size, _, seq_length, d_k = x.size()
        return x.transpose(1, 2).contiguous().view(batch_size, seq_length, self.d_model)
        
    def forward(self, Q, K, V, mask=None):
        # Apply linear transformations and split heads
        Q = self.split_heads(self.W_q(Q))
        K = self.split_heads(self.W_k(K))
        V = self.split_heads(self.W_v(V))
        
        # Perform scaled dot-product attention
        attn_output = self.scaled_dot_product_attention(Q, K, V, mask)
        
        # Combine heads and apply output transformation
        output = self.W_o(self.combine_heads(attn_output))
        return output

In [10]:
class PositionWiseFeedForward(nn.Module):
    def __init__(self, d_model, d_ff):
        super(PositionWiseFeedForward, self).__init__()
        self.fc1 = nn.Linear(d_model, d_ff)
        self.fc2 = nn.Linear(d_ff, d_model)
        self.relu = nn.ReLU()

    def forward(self, x):
        return self.fc2(self.relu(self.fc1(x)))

In [11]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_seq_length):
        super(PositionalEncoding, self).__init__()
        
        pe = torch.zeros(max_seq_length, d_model)
        position = torch.arange(0, max_seq_length, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model))
        
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        
        self.register_buffer('pe', pe.unsqueeze(0))
        
    def forward(self, x):
        return x + self.pe[:, :x.size(1)]

In [12]:
class EncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout):
        super(EncoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = PositionWiseFeedForward(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x, mask):
        attn_output = self.self_attn(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_output))
        ff_output = self.feed_forward(x)
        x = self.norm2(x + self.dropout(ff_output))
        return x

In [13]:
class DecoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout):
        super(DecoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.cross_attn = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = PositionWiseFeedForward(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x, enc_output, src_mask, tgt_mask):
        attn_output = self.self_attn(x, x, x, tgt_mask)
        x = self.norm1(x + self.dropout(attn_output))
        attn_output = self.cross_attn(x, enc_output, enc_output, src_mask)
        x = self.norm2(x + self.dropout(attn_output))
        ff_output = self.feed_forward(x)
        x = self.norm3(x + self.dropout(ff_output))
        return x

In [14]:
class Transformer(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model, num_heads, num_layers, d_ff, max_seq_length, dropout):
        super(Transformer, self).__init__()
        self.encoder_embedding = nn.Embedding(src_vocab_size, d_model)
        self.decoder_embedding = nn.Embedding(tgt_vocab_size, d_model)
        self.positional_encoding = PositionalEncoding(d_model, max_seq_length)

        self.encoder_layers = nn.ModuleList([EncoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        self.decoder_layers = nn.ModuleList([DecoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])

        self.fc = nn.Linear(d_model, tgt_vocab_size)
        self.dropout = nn.Dropout(dropout)

    def generate_mask(self, src, tgt):
        src_mask = (src != 0).unsqueeze(1).unsqueeze(2)
        tgt_mask = (tgt != 0).unsqueeze(1).unsqueeze(3)
        seq_length = tgt.size(1)
        nopeek_mask = (1 - torch.triu(torch.ones(1, seq_length, seq_length), diagonal=1)).bool()
        tgt_mask = tgt_mask & nopeek_mask
        return src_mask, tgt_mask

    def forward(self, src, tgt):
        src_mask, tgt_mask = self.generate_mask(src, tgt)
        src_embedded = self.dropout(self.positional_encoding(self.encoder_embedding(src)))
        tgt_embedded = self.dropout(self.positional_encoding(self.decoder_embedding(tgt)))

        enc_output = src_embedded
        for enc_layer in self.encoder_layers:
            enc_output = enc_layer(enc_output, src_mask)

        dec_output = tgt_embedded
        for dec_layer in self.decoder_layers:
            dec_output = dec_layer(dec_output, enc_output, src_mask, tgt_mask)

        output = self.fc(dec_output)
        return output

In [15]:
src_vocab_size = voc.num_words
d_model = 512
tgt_vocab_size = voc.num_words # source and target vocab belong to the same Voc object 
num_heads = 8
num_layers = 6 # number of encoder, decoder stacks
d_ff = 2048 # dimensionality of the hidden layer for the feed forward neural network 
max_seq_length = 10 # same as the original input because you add the input tensor to the Positional Encoding tensor   
dropout = 0.1 # dropout probability. Initial value 0.1 

transformer = Transformer(src_vocab_size, tgt_vocab_size, d_model, num_heads, num_layers, d_ff, max_seq_length, dropout)

In [16]:
criterion = nn.CrossEntropyLoss(ignore_index=0)
optimizer = optim.Adam(transformer.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)

transformer.train()

batch_size = 64 # each batch is a pair of sentences 
n_batches = 1

for epoch in range(100):
    optimizer.zero_grad()

    # training data preparation 
    training_batch = batch2TrainData(voc, [random.choice(pairs) for _ in range(batch_size * n_batches)])
    input_variable, _, target_variable, _, _ = training_batch

    # input_list is a list of tensors right now - [(tensor), (tensor), .... n_batches times]
    input_variable = torch.stack(input_variable, dim=0)
    target_variable = torch.stack(target_variable, dim=0)

    output = transformer(input_variable, target_variable[:, :-1]) # shifting decoder input by 1 token 
    loss = criterion(output.contiguous().view(-1, tgt_vocab_size), target_variable[:, 1:].contiguous().view(-1)) 
    # exclude the first token for calculating the loss 
    loss.backward()
    optimizer.step()
    print(f"Epoch: {epoch+1}, Loss: {loss.item()}")

Epoch: 1, Loss: 9.8135404586792
Epoch: 2, Loss: 8.018126487731934
Epoch: 3, Loss: 7.140478134155273
Epoch: 4, Loss: 7.090920448303223
Epoch: 5, Loss: 6.485570907592773
Epoch: 6, Loss: 6.502720832824707
Epoch: 7, Loss: 6.353397846221924
Epoch: 8, Loss: 6.691864967346191
Epoch: 9, Loss: 6.442277431488037
Epoch: 10, Loss: 6.37252140045166
Epoch: 11, Loss: 6.061436653137207
Epoch: 12, Loss: 6.40967321395874
Epoch: 13, Loss: 6.157182693481445
Epoch: 14, Loss: 6.259674072265625
Epoch: 15, Loss: 6.339513778686523
Epoch: 16, Loss: 6.153425216674805
Epoch: 17, Loss: 6.188608646392822
Epoch: 18, Loss: 6.1257829666137695
Epoch: 19, Loss: 6.09058952331543
Epoch: 20, Loss: 5.901798725128174
Epoch: 21, Loss: 6.084578514099121
Epoch: 22, Loss: 6.006988048553467
Epoch: 23, Loss: 5.9860711097717285
Epoch: 24, Loss: 5.783020496368408
Epoch: 25, Loss: 5.63553524017334
Epoch: 26, Loss: 5.853550910949707
Epoch: 27, Loss: 5.4357099533081055
Epoch: 28, Loss: 5.6947736740112305
Epoch: 29, Loss: 5.714897632598

In [47]:
def sentenceFromIndexes(voc, indexes):
    return [voc.index2word[index] for index in indexes] 

def chat(input_stmt):
    # prepare the input 
    input_variable, _ = inputVar([input_stmt], voc)
    tgt_variable = torch.zeros(MAX_LENGTH - 1).unsqueeze(0)

    # we can pass this input to the encoder, but what about the decoder? 
    prediction_indexes = transformer(input_variable[0].unsqueeze(0).to(torch.long), tgt_variable.to(torch.long))
    # prediction = sentenceFromIndexes(voc, prediction_indexes.to(torch.long))
    return prediction_indexes

print(chat("name"))

tensor([[[-2.5802, -2.8773,  3.8755,  ..., -3.1269, -2.9968, -3.2692],
         [-2.4594, -2.7751,  3.9226,  ..., -3.3305, -2.9728, -2.8956],
         [-2.5771, -3.1471,  4.1140,  ..., -3.2287, -3.1706, -2.9809],
         ...,
         [-2.4993, -2.8435,  4.7631,  ..., -3.4594, -3.2652, -3.0950],
         [-2.3107, -2.9953,  4.8845,  ..., -3.3894, -3.3064, -2.7514],
         [-2.4608, -2.9672,  4.7127,  ..., -3.1127, -3.5223, -3.0928]]],
       grad_fn=<ViewBackward0>)


This is the final tensor that the transformer outputs. Upon further analysis, we see that this is gibberish and does not actually represent any proper sentence. This could be due to the architecture of the model or the lower number of training examples that was used. To correct this, we should update the architecture and the subsequent steps. 

Some side notes
<ul>
<li>Embedding layer is a lookup table which takes as input the word indexes and outputs the corresponding embeddings for these indexes</li>
</ul>