# Lab 10 - Chatbot using PyTorch
In this lab you will create a Chatbot using sequence to sequence models. The chatbot will be trained on movie scripts from the [Cornell Movie-Dialogs Corpus](https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html), using an encoder and decoder architecture.

This lab is heavily based on the [Chatbot Tutorial](https://pytorch.org/tutorials/beginner/chatbot_tutorial.html#load-preprocess-data) available on the PyTorch website.

In [None]:
%pip install torch==1.11.0+cu113 torchdata==0.3.0 torchtext==0.12.0 -f https://download.pytorch.org/whl/cu113/torch_stable.html
%pip install tqdm ipywidgets spacy
!python -m spacy download en_core_web_sm

In [None]:
import torch
import torchtext
import random
import numpy as np

SEED = 1234
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.manual_seed(SEED)
random.seed(SEED)
np.random.seed(SEED)
torch.backends.cudnn.deterministic = True

print("PyTorch Version: ", torch.__version__)
print("torchtext Version: ", torchtext.__version__)
print(f"Using {'GPU' if str(DEVICE) == 'cuda' else 'CPU'}.")

# Data Preparation
The first step will be downloading and processing the corpus we will be working with.

We will be using the [Cornell Movie-Dialogs Corpus](https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html) to train the model on dialog line pairs.

The corpus contains 220,579 conversational exchanges between 10,292 pairs of movie characters with 9,035 characters total, from 617 movies and 304,713 total utterances. This variety makes the dataset very diverse as far as tone and sentiment are concerned, which  makes it ideal for training a chatbot.

The primary downside of the dataset is that it needs rather thorough processing and cleaning for use in chatbot training, but we will tackle it step by step.

First, we need to download it. We'll use Python's `urllib` and `zipfile` libraries to quickly download the zip file and unzip it.

In [None]:
import urllib.request
import zipfile
from pathlib import Path

def download_cornell_movie_dialogs(extract_dir = Path(".")):
    # Download
    URL = "http://www.cs.cornell.edu/~cristian/data/cornell_movie_dialogs_corpus.zip"
    zip_path, _ = urllib.request.urlretrieve(URL)

    # Unzip
    with zipfile.ZipFile(zip_path, "r") as f:
        f.extractall(extract_dir)

download_cornell_movie_dialogs()

## Data Exploration
We can inspect what the data looks like originally pretty easily:

In [None]:
DATA_PATH = Path("cornell movie-dialogs corpus")

def _print_x_lines(x, file):
    with open(file, "r") as f:
        for _ in range(x):
            print(f.readline())

_print_x_lines(10, DATA_PATH / "movie_lines.txt")

As you can see, each line has:
- a line ID at the start,
- then the ID of the character saying the line,
- the ID of the movie the line belongs to,
- the character's name, and last but not least,
- the actual line

The file is essentially in csv format with `+++$+++` as the delimiter. As such, we'll process each line into a dictionary with those fields much more explicit for easier access later.

The corpus additionally has a "conversations" file, which groups individual lines together into distinct conversations.

In [None]:
_print_x_lines(10, DATA_PATH / "movie_conversations.txt")

We can observe that each conversation has the following format:
- the ID of the first character involved in the conversation
- the ID of the second character involved in the conversation
- the ID of the movie the conversation takes place in
- a list of line IDs included in the conversation.

`+++$+++` is the delimiter once again.

## Data Processing
Having our data in its raw format is very inconvenient, so we'll convert everything to Python dictionaries for easy access to all the information we need during training.

As far as lines are concerned, we will simply split them and conver them to Python dictionraries with the fields identified in the previous section as the keys.

In [None]:
def _process_line(line, field_names):
    line = line.split(" +++$+++ ")  # Delimiter
    line = dict(zip(field_names, line))  # To dict using field_names as keys
    return line

def process_lines():
    # Fields as they appear in the data
    FIELDS = ["lineID", "characterID", "movieID", "character", "text"]
    
    lines = {}  # Map from line ID to line object
    # Note the encoding, the dataset is not plain UTF-8
    with open(DATA_PATH / "movie_lines.txt", "r", encoding="iso-8859-1") as f:
        for line in f.readlines():  # For each line
            line_dict = _process_line(line, FIELDS)  # Process it
            lines[line_dict["lineID"]] = line_dict  # Store it according to its ID

    return lines

movie_lines = process_lines()

In [None]:
print(movie_lines["L1045"])
print(movie_lines["L1044"])

We'll process conversations in a very similar way. The only thing we'll do differently is that we will match each Line ID to its actual line object and store it to each extracted conversation object.

In [None]:
def _process_conversation(convo, field_names):
    
    convo = convo.split(" +++$+++ ")
    convo = dict(zip(field_names, convo))
    
    convo["lineIDs"] = eval(convo["lineIDs"])  # Convert to Python list
    # Fetch actual line objects
    convo["lines"] = [movie_lines[line_id] for line_id in convo["lineIDs"]]
    
    return convo
    
def process_conversations():
    FIELDS = ["character1ID", "character2ID", "movieID", "lineIDs"]
    
    convos = []
    with open(DATA_PATH / "movie_conversations.txt", "r", encoding="iso-8859-1") as f:
        for convo in f.readlines():
            convo_dict = _process_conversation(convo, FIELDS)
            convos.append(convo_dict)
            
    return convos

movie_conversations = process_conversations()

In [None]:
movie_conversations[0]

Finally, we'll extract line pairs, in a sort of answer and response format from the corpus. We'll use the conversation information to extract the line pairs.

Then, we'll save these line pairs in a TSV file.

In [None]:
def _get_line_pairs(convos):
    pairs = []
    for convo in convos:
        # Ignore last line as it has no line to be paired with
        for i, input_line in enumerate(convo["lines"][:-1]):
            input_line = input_line["text"].strip()
            target_line = convo["lines"][i+1]["text"].strip()
            
            if input_line and target_line:  # It's possible either line is empty
                pairs.append((input_line, target_line))
    return pairs

# Example
_get_line_pairs([movie_conversations[0]])

In [None]:
import codecs
import csv

output_file = DATA_PATH / "formatted_movie_lines.tsv"
delimiter = str(codecs.decode("\t", "unicode_escape"))  # HACK to quickly get an actual tab character

with open(output_file, "w", encoding="utf-8") as f:
    writer = csv.writer(f, delimiter=delimiter, lineterminator="\n")
    for pair in _get_line_pairs(movie_conversations):
        writer.writerow(pair)

In [None]:
# Visualize the TSV file
_print_x_lines(2, output_file)

If you're using SageMaker Studio Lab, you can open the actual `.tsv` file to inspect it in a more visual format.

### Standard NLP data processing
As the title of this subsection suggests, the next step will be applying some good old NLP data processing to the lines.

So far, we processed the raw dataset into pairs of sentences that we know are part of a conversation. We will now apply standard processing pipelines to each line to facilitate the application of the models we'll apply later.

As usual, this parimarily involves tokenisation and creating a vocabulary.

First let's define the tokenizer. We'll use SpaCy as in previous labs but with a couple of modifications, such as removing punctuation and lowercasing everything. Also as you might have noticed in the previous steps, there were some tricks related to encodings when reading in the corpus or saving it (had to use full utf-8). We want only plain latin characters (ASCII) to remain, so we will be converting to that too and stripping out everything else.

In [None]:
import re
import unicodedata

from torchtext.data.utils import get_tokenizer

class SpacyTokenizer(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.tokenizer = get_tokenizer("spacy", language="en_core_web_sm")
        
    # Turn a Unicode string to plain ASCII, thanks to
    # https://stackoverflow.com/a/518232/2809427
    def _unicode_to_ascii(self, sentence):
        return ''.join(
            c for c in unicodedata.normalize('NFD', sentence)
            if unicodedata.category(c) != 'Mn'
        )
        
    def tokenize_sentence(self, sentence):
        # To ASCII and to lower
        sentence = self._unicode_to_ascii(sentence.lower().strip())
        # Tokenize
        sentence = self.tokenizer(sentence)
        # Remove extra puncutation
        sentence = [re.sub(r"[^a-zA-Z.'!?]+", "", token) for token in sentence]
        # Remove empty strings
        sentence = [token for token in sentence if token]
        return sentence
        
    
    def forward(self, input):
        if isinstance(input, list):
            tokens = []
            for text in input:
                tokens.append(self.tokenize_sentence(text))
            return tokens
        elif isinstance(input, str):
            return self.tokenize_sentence(input)
        raise ValueError(f"Type {type(input)} is not supported.")

# Example
line = movie_lines["L197"]["text"]
print(f"Before tokenization: {line}")
tokenizer = SpacyTokenizer()
print(f"After tokenization: {tokenizer(line)}")

Now that we have our tokenizer, next step is to apply it to our line pairs and build a vocabulary from it.

First, let's actually load our line pairs into memory.

In [None]:
import csv

DATA_PATH = Path("cornell movie-dialogs corpus")

def _load_pair_data():
    with open(DATA_PATH / "formatted_movie_lines.tsv", "r", encoding="utf-8") as f:
        pairs = list(csv.reader(f, delimiter="\t"))
    return pairs
        
movie_line_pairs = _load_pair_data()

In [None]:
movie_line_pairs[0]

Before we continue, we'll trim our dataset to only include lines that are on the shorter end ($10$ words or less).

In [None]:
%%time
from tqdm import tqdm

MAX_LENGTH = 10

def filter_pairs(pairs, max_length: int):
    tokenizer = SpacyTokenizer()
    
    def _keep_pair(pair):
        return len(tokenizer(pair[0])) < max_length\
            and len(tokenizer(pair[1])) < max_length

    out_pairs = []
    for pair in tqdm(pairs):
        if _keep_pair(pair):
            out_pairs.append(pair)
    return out_pairs

print(f"Original dataset size: {len(movie_line_pairs)} pairs")
trimmed_movie_line_pairs = filter_pairs(movie_line_pairs, MAX_LENGTH)
print(f"Trimmed dataset size: {len(trimmed_movie_line_pairs)} pairs")

Next we'll use our usual Vocab building methods using torchtext to build our vocabulary.

We'll also include a Beginning of Sentence (BOS) and End of Sentence (EOS) special tokens alongside our typical Uknown (UNK) and Padding (PAD) special tokens.

Additionally, we'll only include words in our vocab that appear a minimum of $3$ times.

In [None]:
from torchtext.vocab import build_vocab_from_iterator

tokenizer = SpacyTokenizer()
MIN_FREQ = 3

def _process_pairs_for_vocab(data):
    for pairs in data:  # Add tokens from both lines in the pair
        yield tokenizer(pairs[0]) + tokenizer(pairs[1])

text_vocab = build_vocab_from_iterator(
    _process_pairs_for_vocab(trimmed_movie_line_pairs),
    specials=('<unk>', '<pad>', '<bos>', '<eos>'),
    min_freq=MIN_FREQ
)

In [None]:
print(f"Unique tokens in vocabulary: {len(text_vocab)}")
print("\nFirst 20 tokens: ")
print(text_vocab.get_itos()[:20])

Next we'll filter our dataset again to only include pairs where both lines contain tokens that are found in our vocabulary, i.e. tokens that weren't filtered by our minimum frequency rule.

In [None]:
%%time
from tqdm import tqdm

def filter_pairs(pairs, vocab):
    tokenizer = SpacyTokenizer()    
    vocab_tokens = set(vocab.get_itos())
    
    def _keep_pair(pair):
        tokens = set(tokenizer(pair[0]) + tokenizer(pair[1]))
        for token in tokens:
            if token not in vocab_tokens:
                return False
        return True
    
    out_pairs = []
    for pair in tqdm(pairs, desc="Trimming pairs based on vocab..."):
        if _keep_pair(pair):
            out_pairs.append(pair)
    return out_pairs

print(f"Original trimmed dataset size: {len(trimmed_movie_line_pairs)} pairs.")
final_movie_line_pairs = filter_pairs(trimmed_movie_line_pairs, text_vocab)
print(f"Final dataset size: {len(final_movie_line_pairs)} pairs.")

The final step is to define processing pipelines and data loaders.

We want to apply the following transformations to our data:
- For inputs (first of the two sentences in each pair):
  1. Tokenize
  2. Transform into vocabulary indices
  3. Add the EOS token
  4. Do the following:
     1. Get the length of this tokenized sentence.
     2. Pad this tokenized sentence to the maximum length of the batch and convert to a tensor. 
  - To this end, we'll use `input_transform_common` for steps 1 through 4. On top of that, we'll use `inputs_transform` for 4.2 and and `lengths_transform` for 4.1. To extract sentence lengths we'll define a `ToLengths` transform.

- For outputs (second of the two sentences in each pair):
  1. Tokenize
  2. Transform into vocabulary indices
  3. Add the EOS token
  4. Do the following:
     1. Get the maximum tokenized sentence length.
     2. Pad this tokenized sentence to the maximum length of the batch and convert to a tensor.
     3. Create a mask that has 0 for each token that is not padding, and 1 for each token that is.
  - We'll use `output_transform_common` for steps 1 through 4, then `output_transform` for step 4.2 and `mask_transform` for step 4.3. For step 4.1 we'll use a tiny bit of Python code when we actually apply the transforms to extract this maximum length from the results of `output_transform_common`. For the mask transform we'll define a custom transformation to quickly do that.

Applying this processing with typical torchtext and some custom transforms will lead to the first dimension of our batch being the actual batch size, and the second dimension being the tokens. However, for each time step, we want to be able to retrieve all the words for that time step in the batch easily, so we will also be donig a transpose in our actual `collate_batch` function.

![](https://pytorch.org/tutorials/_images/seq2seq_batches.png)

In [None]:
import torchtext.transforms as T

class ToLengths(torch.nn.Module):
    """Converts a list to its length or a list of lists to a list of lengths."""
    def forward(self, input):
        if isinstance(input[0], list) or isinstance(input[0], torch.Tensor):
            lengths = []
            for text in input:
                lengths.append(len(text))
            return lengths
        elif isinstance(input, list) or isinstance(input, torch.Tensor):
            return len(input)
        raise ValueError(f"Type {type(input)} is not supported.")

class PaddingMask(torch.nn.Module):
    """Converts a list of padded sequences to a binary mask that shows which tokens are padding."""
    def __init__(self, padding_value):
        super().__init__()
        self.padding_value = padding_value
        
    def _to_mask(self, sequence):
        return [0 if token == self.padding_value else 1 for token in sequence]
    
    def forward(self, input):
        if isinstance(input[0], list) or isinstance(input[0], torch.Tensor):
            return [self._to_mask(seq) for seq in input]
        elif isinstance(input, list) or isinstance(input, torch.Tensor):
            return self._to_mask(input)
        raise ValueError(f"Type {type(input)} is not supported.")


input_transform_common = T.Sequential(
    SpacyTokenizer(),  # Tokenize
    T.VocabTransform(text_vocab),  # Convert to vocab IDs
    T.AddToken(token=text_vocab["<eos>"], begin=False),  # Add EOS
)

input_transform = T.Sequential(
    T.ToTensor(padding_value=text_vocab["<pad>"]),  # Convert to tensor and pad
)

lengths_transform = T.Sequential(
    ToLengths(),
    T.ToTensor(),
)

output_transform_common = T.Sequential(
    SpacyTokenizer(),  # Tokenize
    T.VocabTransform(text_vocab),  # Convert to vocab IDs
    T.AddToken(token=text_vocab["<eos>"], begin=False),  # Add EOS
)

output_transform = T.Sequential(
    T.ToTensor(padding_value=text_vocab["<pad>"]),  # Convert to tensor and pad
)

mask_transform = T.Sequential(
    PaddingMask(padding_value=text_vocab["<pad>"]),
    T.ToTensor(dtype=bool),
)

In [None]:
from torch.utils.data import DataLoader

BATCH_SIZE = 64

def collate_batch(batch):
    # Sort in the batch using reverse order of input length
    sort_tokenizer = SpacyTokenizer()
    batch.sort(key=lambda x: len(sort_tokenizer(x[0])), reverse=True)

    inputs, outputs = zip(*batch)
    
    # Input processing
    inputs = input_transform_common(list(inputs))
    lengths = lengths_transform(inputs)
    inputs = input_transform(inputs)
    
    # Output processing
    outputs = output_transform_common(list(outputs))
    max_output_length = max([len(output)for output in outputs])  # Step 4.1
    outputs = output_transform(outputs)
    mask = mask_transform(outputs)

    # Transpose
    inputs = inputs.T
    outputs = outputs.T
    mask = mask.T

    # Ensure boolean dtype for mask due to a torchtext bug.
    mask = mask.bool()

    return inputs.to(DEVICE), lengths.to("cpu"), outputs.to(DEVICE), mask.to(DEVICE), max_output_length

def _get_dataloader(data, batch_size):
    return DataLoader(data, batch_size=batch_size, shuffle=True, collate_fn=collate_batch)

dataloader = _get_dataloader(final_movie_line_pairs, BATCH_SIZE)

We can then inspect the DataLoader and our batches a little bit.

In [None]:
test_dataloader = _get_dataloader(final_movie_line_pairs, 5)  # Small batch size as a test

inputs, lengths, outputs, mask, max_output_length = next(iter(test_dataloader))
print(f"Inputs: {inputs}")
print(f"Inputs size: {inputs.size()}")
print(f"Input Lengths: {lengths}")
print(f"\nOutputs: {outputs}")
print(f"Inputs size: {outputs.size()}")
print(f"Output Padding Mask: {mask}")
print(f"Padding mask size: {mask.size()}")
print(f"Max Output length: {max_output_length}")
print(f"\nPadding token value: {text_vocab['<pad>']}")

As a sanity check, you may notice that the input lengths correspond to the amount of items in the input sequences that aren't padded (the value isn't the padding token value of `1`). Additionally you may notice that in the output padding mask, the value of `0` does indeed correspond to the items in the `outputs` tensor where the padding token `1` is present. The maximum output length also correctly corresponds to the maximum length of the unpadded output tensors. 

# Creating our Models
The brains of our chatbot is a sequence-to-sequence (seq2seq) model. The goal of a seq2seq model is to take a variable-length sequence as an input, and return a variable-length sequence as an output using a fixed-sized model.

[Sutskever et al.](https://arxiv.org/abs/1409.3215) discovered that by using two separate recurrent neural nets together, we can accomplish this task. One RNN acts as an **encoder**, which encodes a variable length input sequence to a fixed-length context vector. In theory, this context vector (the final hidden layer of the RNN) will contain semantic information about the query sentence that is input to the bot. The second RNN is a **decoder**, which takes an input word and the context vector, and returns a guess for the next word in the sequence and a hidden state to use in the next iteration.

![](https://pytorch.org/tutorials/_images/seq2seq_ts.png)



## Encoder
As our encoder, we will use a very similar architecture to we saw in previous labs in this module: a Bi-directional Recurrent Neural Network. As our RNN architecture, we will use GRUs.

We will of course also have an embedding layer, but unlike previous labs, it won't live in our model. We will want the embeddings to be consistent across our Encoder and Decoder so we will have the embedding layer be a reference to a pre-defined layer that will come in as an argument at Encoder instantiation time.

Besides that, the rest is just as in previous labs. To pass our sequences through the RNN, we will use the `pack_padded_sequence` utility along with the lengths we computed in our dataset preparation, and after we have run it through our recurrent model we will unpack it using `pad_packed_sequence`. We will also sum the bidirectional GRU outputs accordingly.

In [None]:
from torch import nn

class EncoderRNN(nn.Module):
    def __init__(self, hidden_size, embedding, n_layers=1, dropout=0):
        super(EncoderRNN, self).__init__()
        self.n_layers = n_layers
        self.hidden_size = hidden_size
        self.embedding = embedding

        # Initialize GRU; the input_size and hidden_size params are both set to 'hidden_size'
        #   because our input size is a word embedding with number of features == hidden_size
        self.gru = nn.GRU(hidden_size, hidden_size, n_layers,
                          dropout=(0 if n_layers == 1 else dropout), bidirectional=True)

    def forward(self, inputs, lengths, hidden=None):
        # Convert word indexes to embeddings & pack padded sequences
        embedded = self.embedding(inputs)
        packed = nn.utils.rnn.pack_padded_sequence(embedded, lengths)

        # Forward pass through GRU & unpack
        outputs, hidden = self.gru(packed, hidden)
        outputs, _ = nn.utils.rnn.pad_packed_sequence(outputs)

        # Sum bidirectional GRU outputs & output
        outputs = outputs[:, :, :self.hidden_size] + outputs[:, : ,self.hidden_size:]
        return outputs, hidden

## Decoder
The decoder RNN generates the response sentence in a token-by-token fashion. It uses the encoder’s context vectors, and internal hidden states to generate the next word in the sequence. It continues generating words until it outputs an EOS token, which if you remember we included in every line in our dataset processing pipelines.

A common problem with a vanilla seq2seq decoder is that if we rely solely on the context vector to encode the entire input sequence’s meaning, it is likely that we will have information loss. This is especially the case when dealing with long input sequences, greatly limiting the capability of our decoder.

To combat this, [Bahdanau et al.](https://arxiv.org/abs/1409.0473) created an “attention mechanism” that allows the decoder to pay attention to certain parts of the input sequence, rather than using the entire fixed context at every step.

At a high level, attention is calculated using the decoder’s current hidden state and the encoder’s outputs. The output attention weights have the same shape as the input sequence, allowing us to multiply them by the encoder outputs, giving us a weighted sum which indicates the parts of encoder output to pay attention to. [Sean Robertson’s](https://github.com/spro) figure describes this very well:

![](https://pytorch.org/tutorials/_images/attn2.png)

[Luong et al.](https://arxiv.org/abs/1508.04025) improved upon Bahdanau et al.’s groundwork by creating “Global attention”. The key difference is that with “Global attention”, we consider all of the encoder’s hidden states, as opposed to Bahdanau et al.’s “Local attention”, which only considers the encoder’s hidden state from the current time step. Another difference is that with “Global attention”, we calculate attention weights, or energies, using the hidden state of the decoder from the current time step only. Bahdanau et al.’s attention calculation requires knowledge of the decoder’s state from the previous time step. Also, Luong et al. provides various methods to calculate the attention energies between the encoder output and decoder output which are called “score functions”:

$score(\mathbf{h}_t,\bar{\mathbf{h}}_s)=\begin{cases}
    \mathbf{h}_t^T\bar{\mathbf{h}}_s & \text{dot}\\
    \mathbf{h}_t^T\mathbf{W}_\alpha \bar{\mathbf{h}}_s & \text{general}\\
    \mathbf{v}_\alpha^T \tanh (\mathbf{W}_\alpha[\mathbf{h}_t;\bar{\mathbf{h}}_s]) & \text{concat}\\
\end{cases}$

where $\mathbf{h}_t$ = current target decoder state and $\bar{\mathbf{h}}_s$ = all encoder states.

Overall, the Global attention mechanism can be summarized by the following figure. Note that we will implement the “Attention Layer” as a separate `nn.Module` called Attn. The output of this module is a softmax normalized weights tensor of shape *(batch_size, 1, max_length)*.

![](https://pytorch.org/tutorials/_images/global_attn.png)

In [None]:
from torch import nn
import torch.nn.functional as F

class Attn(nn.Module):
    """Luong attention layer"""
    def __init__(self, method, hidden_size):
        super(Attn, self).__init__()
        self.method = method
        # Different methods to compute attention energy according to the equation above
        if self.method not in ['dot', 'general', 'concat']:
            raise ValueError(self.method, "is not an appropriate attention method.")

        self.hidden_size = hidden_size
        if self.method == 'general':  # Weight matrix W for general attention method
            self.attn = nn.Linear(self.hidden_size, hidden_size)
        elif self.method == 'concat':  # Weight matrix W and weight vector v for concat attention method
            self.attn = nn.Linear(self.hidden_size * 2, hidden_size)
            self.v = nn.Parameter(torch.FloatTensor(hidden_size))

    def dot_score(self, hidden, encoder_output):
        """Dot method implementation."""
        return torch.sum(hidden * encoder_output, dim=2)

    def general_score(self, hidden, encoder_output):
        """General method implementation."""
        energy = self.attn(encoder_output)
        return torch.sum(hidden * energy, dim=2)

    def concat_score(self, hidden, encoder_output):
        """Concat method implementation."""
        energy = self.attn(torch.cat((hidden.expand(encoder_output.size(0), -1, -1), encoder_output), 2)).tanh()
        return torch.sum(self.v * energy, dim=2)

    def forward(self, hidden, encoder_outputs):
        """Calculate the attention weights (energies) based on the given method"""
        if self.method == 'general':
            attn_energies = self.general_score(hidden, encoder_outputs)
        elif self.method == 'concat':
            attn_energies = self.concat_score(hidden, encoder_outputs)
        elif self.method == 'dot':
            attn_energies = self.dot_score(hidden, encoder_outputs)

        # Transpose max_length and batch_size dimensions
        attn_energies = attn_energies.t()

        # Return the softmax normalized probability scores (with added dimension)
        return F.softmax(attn_energies, dim=1).unsqueeze(1)
    
class LuongAttnDecoderRNN(nn.Module):
    def __init__(self, attn_model, embedding, hidden_size, output_size, n_layers=1, dropout=0.1):
        super(LuongAttnDecoderRNN, self).__init__()

        # Keep for reference
        self.attn_model = attn_model
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.n_layers = n_layers
        self.dropout = dropout

        # Define layers
        self.embedding = embedding
        self.embedding_dropout = nn.Dropout(dropout)
        self.gru = nn.GRU(hidden_size, hidden_size, n_layers, dropout=(0 if n_layers == 1 else dropout))
        self.concat = nn.Linear(hidden_size * 2, hidden_size)
        self.out = nn.Linear(hidden_size, output_size)

        self.attn = Attn(attn_model, hidden_size)

    def forward(self, input_step, last_hidden, encoder_outputs):
        embedded = self.embedding(input_step)  # 1. Get embedding of current word
        embedded = self.embedding_dropout(embedded)  # 1.5 Dropout on the embeddings
        
        rnn_output, hidden = self.gru(embedded, last_hidden)  # 2. Forward through unidirectional GRU
        attn_weights = self.attn(rnn_output, encoder_outputs)  # 3. Calculate attention weights from the current GRU output
        context = attn_weights.bmm(encoder_outputs.transpose(0, 1)) # 4. Multiply attention weights to encoder outputs to get
                                                                    # new "weighted sum" context vector
        rnn_output = rnn_output.squeeze(0)  # 5. Concatenate weighted context vector and GRU output using Luong eq. 5
        context = context.squeeze(1)
        concat_input = torch.cat((rnn_output, context), 1)
        concat_output = torch.tanh(self.concat(concat_input))

        output = self.out(concat_output) # 6. Predict next word using Luong eq. 6
        output = F.softmax(output, dim=1) # 7. Return output and final hidden state
        return output, hidden

# Training
Since we are dealing with batches of padded sequences, we cannot simply consider all elements of the tensor when calculating loss. We define `maskNLLLoss` to calculate our loss based on our decoder’s output tensor, the target tensor, and a binary mask tensor describing the padding of the target tensor. This loss function calculates the average negative log likelihood of the elements that correspond to a 1 in the mask tensor.

In [None]:
def maskNLLLoss(inp, target, mask):
    nTotal = mask.sum()
    crossEntropy = -torch.log(torch.gather(inp, 1, target.view(-1, 1)).squeeze(1))
    loss = crossEntropy.masked_select(mask).mean()
    loss = loss.to(DEVICE)
    return loss, nTotal.item()

We will use a couple of clever tricks to aid in convergence:

1. The first trick is using **teacher forcing**. This means that at some probability, set by `teacher_forcing_ratio`, we use the current target word as the decoder’s next input rather than using the decoder’s current guess. This technique acts as training wheels for the decoder, aiding in more efficient training. However, teacher forcing can lead to model instability during inference, as the decoder may not have a sufficient chance to truly craft its own output sequences during training. Thus, we must be mindful of how we are setting the `teacher_forcing_ratio`, and not be fooled by fast convergence.

2. The second trick that we implement is **gradient clipping**. This is a commonly used technique for countering the “exploding gradient” problem. In essence, by clipping or thresholding gradients to a maximum value, we prevent the gradients from growing exponentially and either overflow (`NaN`), or overshoot steep cliffs in the cost function.

![](https://pytorch.org/tutorials/_images/grad_clip.png)

Overall, the training process will involve the following sequence:

1. Forward pass entire input batch through encoder.
2. Initialize decoder inputs as the BOS token, and hidden state as the encoder’s final hidden state.
3. Forward the input batch through the decoder one time step at a time.
4. If teacher forcing: set next decoder input as the current target; else: set next decoder input as current decoder output.
5. Calculate and accumulate loss.
6. Perform backpropagation.
7. Clip gradients.
8. Update encoder and decoder model parameters.

In [None]:
def train_step(
    inputs, lengths,  # Input
    outputs, mask, max_target_len,  # Output
    encoder, decoder,  # Models
    encoder_optimizer, decoder_optimizer,  # Optimizers
    clip, teacher_forcing_ratio  # Hyper parameters.
):
    """Train function to be run for a single data point."""
    # Implicit batch size
    batch_size = inputs.size()[1]

    # Zero gradients
    encoder_optimizer.zero_grad()
    decoder_optimizer.zero_grad()

    # Lengths for rnn packing should always be on the CPU
    lengths = lengths.to("cpu")

    # Initialize variables
    loss, print_losses, n_totals = 0, [], 0

    # Forward pass through encoder
    encoder_outputs, encoder_hidden = encoder(inputs, lengths)

    # Create initial decoder input & Set initial decoder hidden state to the encoder's final hidden state
    decoder_input = torch.LongTensor([[text_vocab["<bos>"] for _ in range(batch_size)]]).to(DEVICE)
    decoder_hidden = encoder_hidden[:decoder.n_layers]

    # HACK Teacher Forcing - Determine if we are using teacher forcing this iteration
    use_teacher_forcing = True if random.random() < teacher_forcing_ratio else False

    # Forward batch of sequences through decoder one time step at a time
    for t in range(max_target_len):
        decoder_output, decoder_hidden = decoder(
            decoder_input, decoder_hidden, encoder_outputs
        )

        if use_teacher_forcing:  # HACK Teacher forcing: next input is current target
            decoder_input = outputs[t].view(1, -1)
        else:  # next input is decoder's own current output
            _, topi = decoder_output.topk(1)
            decoder_input = torch.LongTensor([[topi[i][0] for i in range(batch_size)]]).to(DEVICE)

        # Calculate and accumulate loss
        mask_loss, nTotal = maskNLLLoss(decoder_output, outputs[t], mask[t])
        loss += mask_loss
        print_losses.append(mask_loss.item() * nTotal)
        n_totals += nTotal

    # Perform backpropatation
    loss.backward()

    # HACK Gradient clipping: gradients are modified in place
    _ = nn.utils.clip_grad_norm_(encoder.parameters(), clip)
    _ = nn.utils.clip_grad_norm_(decoder.parameters(), clip)

    # Adjust model weights
    encoder_optimizer.step()
    decoder_optimizer.step()

    return sum(print_losses) / n_totals

However, that is for a single data point only. We still need to define a function that will run our training procedure for batches of data.

In [None]:
from tqdm import tqdm
from pathlib import Path

def train_epoch(
    iterator,
    encoder, decoder, encoder_optimizer, decoder_optimizer,
    clip, teacher_forcing_ratio
):
    """Training procedure for an epoch."""
    loss_sum, batches = 0, 0

    # Training loop 
    for batch in tqdm(iterator, desc="\tTraining"):
        # Extract fields from batch
        inputs, lengths, outputs, mask, max_target_len = batch

        # Run a training iteration with batch
        loss = train_step(
            inputs, lengths, outputs, mask, max_target_len,
            encoder, decoder, encoder_optimizer, decoder_optimizer,
            clip, teacher_forcing_ratio
        )
        loss_sum += loss
        batches += 1
    
    return loss_sum / batches

Last but not least, we will need a function for the entire training proecss (all epochs). We will also be saving our model periodically on best training loss (so every epoch in practice) using PyTorch's `save()` method.

In [None]:
def train(iterator, epochs, encoder, decoder, encoder_optimizer, decoder_optimizer, clip, teacher_forcing_ratio, encoder_n_layers, decoder_n_layers, hidden_size, attn_model):
    """Training procedure for the entire model (multiple epochs)."""
    encoder.train()
    decoder.train()

    best_loss = float('inf')

    for epoch in range(epochs):
        print(f'Epoch: {epoch+1:02}')

        train_loss = train_epoch(iterator, encoder, decoder, encoder_optimizer, decoder_optimizer, clip, teacher_forcing_ratio)
        print(f'\tTrain Loss: {train_loss:.3f}')

        if train_loss < best_loss:
            best_loss = train_loss
            torch.save({
                "encoder": encoder.state_dict(),
                "decoder": decoder.state_dict(),
                "embedding": encoder.embedding.state_dict(),
                "loss": train_loss,
                "vocabulary": text_vocab,
                "encoder_n_layers": encoder_n_layers,
                "decoder_n_layers": decoder_n_layers,
                "hidden_size": hidden_size,
                "attn_model": attn_model,
            }, f"chatbot-{encoder_n_layers}-{decoder_n_layers}-{hidden_size}-{attn_model}.pt")

Now we can finally train our model. Feel free to play with the hyperparameters for better performance.

In [None]:
from torch import nn, optim

# Model parameters
HIDDEN_SIZE = 500
ENCODER_N_LAYERS = 2
DECODER_N_LAYERS = 2
ATTENTION_MODEL = "dot"
DROPOUT = 0.1

# Training length parameters
N_EPOCHS = 5
BATCH_SIZE = 64
ITERATOR = _get_dataloader(final_movie_line_pairs, BATCH_SIZE)

# Training hyper parameters
CLIP = 50.0
TEACHER_FORCING_RATIO = 1.0
LEARNING_RATE = 0.0001
DECODER_LEARNING_RATIO = 5.0

embedding = nn.Embedding(num_embeddings=len(text_vocab), embedding_dim=HIDDEN_SIZE)

encoder = EncoderRNN(
    hidden_size=HIDDEN_SIZE,
    embedding=embedding, 
    n_layers=ENCODER_N_LAYERS,
    dropout=DROPOUT
).to(DEVICE)
encoder_optimizer = optim.Adam(encoder.parameters(), lr=LEARNING_RATE)

decoder = LuongAttnDecoderRNN(
    attn_model=ATTENTION_MODEL,
    embedding=embedding,
    hidden_size=HIDDEN_SIZE,
    output_size=len(text_vocab),
    n_layers=DECODER_N_LAYERS,
    dropout=DROPOUT
).to(DEVICE)
decoder_optimizer = optim.Adam(decoder.parameters(), lr=LEARNING_RATE * DECODER_LEARNING_RATIO)

# Run training!
print(f"Using {'GPU' if str(DEVICE) == 'cuda' else 'CPU'} for training.")
train(
    ITERATOR, N_EPOCHS,
    encoder, decoder, encoder_optimizer, decoder_optimizer,
    CLIP, TEACHER_FORCING_RATIO,
    ENCODER_N_LAYERS, DECODER_N_LAYERS, HIDDEN_SIZE, ATTENTION_MODEL
)

# Inference
In order to interact with the chatbot, we need to be able to convert its decoder's output from numbers to actual words. To do that, we will use Greedy decoding, which is essentially what we're already doing in our training process when we're not using teacher forcing. Each time step, we'll choose the word from the decoder output with the highest softmax value. This decoding method is optimal on a single time-step level.

To facilitate the greedy decoding operation, we'll define a `GreedySearchDecoder` class. When run, an object of this class takes an input sequence, a scalar input length tensor and a maximum length to bound the response sentence length.

The entire process will work as follows:
1. Forward input through the encoder.
2. Prepare the encoder's final hidden layer to be the first hidden input to the decoder.
3. Initialize the decoder's first input as the BOS token.
4. Initialize tensors to append decoded words to.
5. Iteratively decode one word at a time:
    1. Forward pass through decoder.
    2. Obtain most likely word token and its softmax score.
    3. Record token and score.
    4. Prepare current token to be next decoder input.
6. Return a collection of word tokens and scores.

In [None]:
from torch import nn

class GreedySearchDecoder(nn.Module):
    def __init__(self, encoder, decoder, vocab):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.vocab = vocab

    def forward(self, input_seq, input_length, max_length):
        # Forward input through encoder model
        encoder_outputs, encoder_hidden = self.encoder(input_seq, input_length)

        # Prepare encoder's final hidden layer to be first hidden input to the decoder & Initialize decoder input with SOS_token
        decoder_hidden = encoder_hidden[:decoder.n_layers]
        decoder_input = torch.ones(1, 1, device=DEVICE, dtype=torch.long) * self.vocab["<bos>"]

        # Initialize tensors to append decoded words to
        all_tokens = torch.zeros([0], device=DEVICE, dtype=torch.long)
        all_scores = torch.zeros([0], device=DEVICE)

        # Iteratively decode one word token at a time
        for _ in range(max_length):
            # Forward pass through decoder & Obtain most likely word token and its softmax score
            decoder_output, decoder_hidden = self.decoder(decoder_input, decoder_hidden, encoder_outputs)
            decoder_scores, decoder_input = torch.max(decoder_output, dim=1)

            # Record token and score
            all_tokens = torch.cat((all_tokens, decoder_input), dim=0)
            all_scores = torch.cat((all_scores, decoder_scores), dim=0)

            # Prepare current token to be next decoder input (add a dimension)
            decoder_input = torch.unsqueeze(decoder_input, 0)

        # Return collections of word tokens and scores
        return all_tokens, all_scores

Finally, we'll make a utility class called `Chatbot` that will provide an interface for interaction. It will also handle tokenization and pre-processing the text.

In [None]:
class Chatbot():
    def __init__(self, model_file, device, max_length=10):
        self.model = torch.load(model_file)
        self.vocab = self.model["vocabulary"]
        self.hidden_size = self.model["hidden_size"]
        self.attn_model = self.model["attn_model"]
        self.device = device
        self.max_length = max_length

        self.embedding = nn.Embedding(num_embeddings=len(self.vocab), embedding_dim=HIDDEN_SIZE)
        self.embedding.load_state_dict(self.model["embedding"])
        
        self.encoder = EncoderRNN(self.hidden_size, self.embedding, self.model["encoder_n_layers"]).to(self.device)
        self.encoder.load_state_dict(self.model["encoder"])

        self.decoder = LuongAttnDecoderRNN(
            self.attn_model, self.embedding, self.hidden_size, len(self.vocab), self.model["decoder_n_layers"]
        ).to(self.device)
        self.decoder.load_state_dict(self.model["decoder"])

        self.searcher = GreedySearchDecoder(self.encoder, self.decoder, self.vocab)
    
    def _transform_input(self, input):
        input = input_transform_common(input)
        lengths = lengths_transform(input)
        input = input_transform(input)
        input = input.T
        return input.to(DEVICE), lengths.cpu()
    
    def answer(self, question):
        processed_question, length = self._transform_input([question])
        _answer, scores = self.searcher(processed_question, length, self.max_length)
        decoded_answer = [self.vocab.get_itos()[word.item()] for word in _answer]
        decoded_answer[:] = [word for word in decoded_answer if not (word == "<eos>" or word == "<pad>")]
        return " ".join(decoded_answer)

    def chat(self):
        while(1):
            try:
                question = input("> ")
                if question in ["q", "quit"]:
                    break
                
                _answer = self.answer(question)
                print("Bot: ", _answer)
            except KeyError:
                print("Error: Encountered unknown word.")


Let's try it out!

In [None]:
bot = Chatbot("chatbot-2-2-500-dot.pt", DEVICE)
bot.chat()