
# Train a Seq2Seq Chatbot


In this notebook I train a sequence to sequence RNN chatbot on the [Friends TV Show Script](https://www.kaggle.com/datasets/divyansh22/friends-tv-show-script?rvi=1).

I used most of the code from NLP week 9:

The code in this notebook is split into multiple files. Have a look in the folder `chatbot_src` to see functions that define the network model (`chatbot_model.py`) and utility functions for getting the data into the write kind of structure (`chatput_util.py`). 

This code is **significantly more complicated** that other PyTorch code we have seen. We are not expecting you to understand all of it now. Building a chatbot of this nature is a complex task, and this code is based on the work of many other people, listed below in the acknowledgements:

**Acknowledgments**

This notebook is adapted from the official [PyTorch chatbot tutorial](https://pytorch.org/tutorials/beginner/chatbot_tutorial.html) by [Matthew Inkawhich](https://github.com/MatthewInkawhich), which in turn borows code from:

1) Yuan-Kuei Wu’s pytorch-chatbot implementation:
   https://github.com/ywk991112/pytorch-chatbot

2) Sean Robertson’s practical-pytorch seq2seq-translation example:
   https://github.com/spro/practical-pytorch/tree/master/seq2seq-translation

3) FloydHub Cornell Movie Corpus preprocessing code:
   https://github.com/floydhub/textutil-preprocess-cornell-movie-corpus


Lets do some imports:

In [1]:
import os
import csv
import torch
import random
import torch.nn as nn
import torch.nn.functional as F
from torch import optim

In [2]:
device = 'cpu'

In [3]:
from chatbot_src.model import *
from chatbot_src.util import *

Define hyperparameters:

The number of encoder and decoder layers was adjusted from the original 2 to 3 with 500 hidden layers, and 4000 training sessions were performed in each setting. This adjustment aimed to explore the impact of different layer numbers on the model's performance.

In [4]:
clip = 50.0
teacher_forcing_ratio = 1.0
learning_rate = 0.0001
decoder_learning_ratio = 5.0
n_iteration = 4000
print_every = 1
save_every = 500

MIN_COUNT = 3    # Minimum word count threshold for trimming
MAX_LENGTH = 10  # Maximum sentence length to consider

# Configure models
model_name = 'cb_model'
attn_model = 'dot' # Alternatives: 'general', 'concat'
hidden_size = 500
encoder_n_layers = 3
decoder_n_layers = 3
dropout = 0.1
batch_size = 64

# Set checkpoint to load from; set to None if starting from scratch
loadFilename = None
checkpoint_iter = 4000
save_dir = 'Friends_ckpt_chatbot'

#### Load dataset

In [5]:
corpus_name = 'friends'
data_file = 'dataset/Friends_pairs.txt'

Now lets read our data into pairs of text:

In [6]:
voc, pairs = loadPrepareData(corpus_name, data_file, MAX_LENGTH)
print("\npairs:")
for pair in pairs[:10]:
    print(pair)

Start preparing training data ...
Reading lines...
Read 58272 sentence pairs
Trimmed to 16700 sentence pairs
Counting words...
Counted words: 5575

pairs:
['instead of . . . ?', 'that s right .']
['that s right .', 'never had that dream .']
['never had that dream .', 'no .']
['hi .', 'this guy says hello i wanna kill myself .']
['this guy says hello i wanna kill myself .', 'are you okay sweetie ?']
['cookie ?', 'carol moved her stuff out today .']
['carol moved her stuff out today .', 'ohh .']
['ohh .', 'let me get you some coffee .']
['let me get you some coffee .', 'thanks .']
['thanks .', 'ooh ! oh !']


Lets remove pairs that have rare words in them. 

In [7]:
pairs = trimRareWords(voc, pairs, MIN_COUNT)

keep_words 2425 / 5572 = 0.4352
Trimmed from 16700 pairs to 13172, 0.7887 of total


Lets sample a small batch for validation:

In [8]:
# Example for validation
small_batch_size = 5
batches = batch2TrainData(voc, [random.choice(pairs) for _ in range(small_batch_size)])
input_variable, lengths, target_variable, mask, max_target_len = batches

print("input_variable:", input_variable)
print("lengths:", lengths)
print("target_variable:", target_variable)
print("mask:", mask)
print("max_target_len:", max_target_len)

input_variable: tensor([[  43, 1394,  215,  106,  106],
        [  13, 1394,    6,   42,    5],
        [  42,    5,    2,    2,    2],
        [  43,    5,    0,    0,    0],
        [  42,    5,    0,    0,    0],
        [ 158,    2,    0,    0,    0],
        [  42,    0,    0,    0,    0],
        [  42,    0,    0,    0,    0],
        [  42,    0,    0,    0,    0],
        [   2,    0,    0,    0,    0]])
lengths: tensor([10,  6,  3,  3,  3])
target_variable: tensor([[352,  89, 554, 139, 106],
        [  6,   6,  42, 112,   5],
        [  2,   2,   2, 175,   2],
        [  0,   0,   0,  24,   0],
        [  0,   0,   0, 191,   0],
        [  0,   0,   0,   6,   0],
        [  0,   0,   0,   2,   0]])
mask: tensor([[ True,  True,  True,  True,  True],
        [ True,  True,  True,  True,  True],
        [ True,  True,  True,  True,  True],
        [False, False, False,  True, False],
        [False, False, False,  True, False],
        [False, False, False,  True, False],
      

#### Define train loop

In [9]:
def train(input_variable, lengths, target_variable, mask, max_target_len, encoder, decoder, embedding,
          encoder_optimizer, decoder_optimizer, batch_size, clip, max_length=MAX_LENGTH):

    # Zero gradients
    encoder_optimizer.zero_grad()
    decoder_optimizer.zero_grad()

    # Set device options
    input_variable = input_variable.to(device)
    target_variable = target_variable.to(device)
    mask = mask.to(device)
    # Lengths for RNN packing should always be on the CPU
    lengths = lengths.to("cpu")

    # Initialize variables
    loss = 0
    print_losses = []
    n_totals = 0

    # Forward pass through encoder
    encoder_outputs, encoder_hidden = encoder(input_variable, lengths)

    # Create initial decoder input (start with SOS tokens for each sentence)
    decoder_input = torch.LongTensor([[SOS_token for _ in range(batch_size)]])
    decoder_input = decoder_input.to(device)

    # Set initial decoder hidden state to the encoder's final hidden state
    decoder_hidden = encoder_hidden[:decoder.n_layers]

    # Determine if we are using teacher forcing this iteration
    use_teacher_forcing = True if random.random() < teacher_forcing_ratio else False

    # Forward batch of sequences through decoder one time step at a time
    if use_teacher_forcing:
        for t in range(max_target_len):
            decoder_output, decoder_hidden = decoder(
                decoder_input, decoder_hidden, encoder_outputs
            )
            # Teacher forcing: next input is current target
            decoder_input = target_variable[t].view(1, -1)
            # Calculate and accumulate loss
            mask_loss, nTotal = maskNLLLoss(decoder_output, target_variable[t], mask[t])
            loss += mask_loss.to(device)
            print_losses.append(mask_loss.item() * nTotal)
            n_totals += nTotal
    else:
        for t in range(max_target_len):
            decoder_output, decoder_hidden = decoder(
                decoder_input, decoder_hidden, encoder_outputs
            )
            # No teacher forcing: next input is decoder's own current output
            _, topi = decoder_output.topk(1)
            decoder_input = torch.LongTensor([[topi[i][0] for i in range(batch_size)]])
            decoder_input = decoder_input.to(device)
            # Calculate and accumulate loss
            mask_loss, nTotal = maskNLLLoss(decoder_output, target_variable[t], mask[t])
            loss += mask_loss.to(device)
            print_losses.append(mask_loss.item() * nTotal)
            n_totals += nTotal

    # Perform backpropagation
    loss.backward()

    # Clip gradients: gradients are modified in place
    _ = nn.utils.clip_grad_norm_(encoder.parameters(), clip)
    _ = nn.utils.clip_grad_norm_(decoder.parameters(), clip)

    # Adjust model weights
    encoder_optimizer.step()
    decoder_optimizer.step()

    return sum(print_losses) / n_totals

### Training iterations

In [10]:
def trainIters(model_name, voc, pairs, encoder, decoder, encoder_optimizer, decoder_optimizer, embedding, encoder_n_layers, decoder_n_layers, save_dir, n_iteration, batch_size, print_every, save_every, clip, corpus_name, loadFilename):

    # Load batches for each iteration
    training_batches = [batch2TrainData(voc, [random.choice(pairs) for _ in range(batch_size)])
                      for _ in range(n_iteration)]

    # Initializations
    print('Initializing ...')
    start_iteration = 1
    print_loss = 0
    if loadFilename:
        start_iteration = checkpoint['iteration'] + 1

    # Training loop
    print("Training...")
    for iteration in range(start_iteration, n_iteration + 1):
        training_batch = training_batches[iteration - 1]
        # Extract fields from batch
        input_variable, lengths, target_variable, mask, max_target_len = training_batch

        # Run a training iteration with batch
        loss = train(input_variable, lengths, target_variable, mask, max_target_len, encoder,
                     decoder, embedding, encoder_optimizer, decoder_optimizer, batch_size, clip)
        print_loss += loss

        # Print progress
        if iteration % print_every == 0:
            print_loss_avg = print_loss / print_every
            print("Iteration: {}; Percent complete: {:.1f}%; Average loss: {:.4f}".format(iteration, iteration / n_iteration * 100, print_loss_avg))
            print_loss = 0

        # Save checkpoint
        if (iteration % save_every == 0):
            directory = save_dir
            if not os.path.exists(directory):
                os.makedirs(directory)
            torch.save({
                'iteration': iteration,
                'en': encoder.state_dict(),
                'de': decoder.state_dict(),
                'en_opt': encoder_optimizer.state_dict(),
                'de_opt': decoder_optimizer.state_dict(),
                'loss': loss,
                'voc_dict': voc.__dict__,
                'embedding': embedding.state_dict()
            }, os.path.join(directory, '{}_{}.pt'.format(iteration, 'checkpoint')))

### Evaluate my text

In [11]:
def evaluate(encoder, decoder, searcher, voc, sentence, max_length=MAX_LENGTH):
    ### Format input sentence as a batch
    # words -> indexes
    indexes_batch = [indexesFromSentence(voc, sentence)]
    # Create lengths tensor
    lengths = torch.tensor([len(indexes) for indexes in indexes_batch])
    # Transpose dimensions of batch to match models' expectations
    input_batch = torch.LongTensor(indexes_batch).transpose(0, 1)
    # Use appropriate device
    input_batch = input_batch.to(device)
    lengths = lengths.to("cpu")
    # Decode sentence with searcher
    tokens, scores = searcher(input_batch, lengths, max_length, device)
    # indexes -> words
    decoded_words = [voc.index2word[token.item()] for token in tokens]
    return decoded_words


def evaluateInput(encoder, decoder, searcher, voc):
    input_sentence = ''
    while(1):
        try:
            # Get input sentence
            input_sentence = input('> ')
            # Check if it is quit case
            if input_sentence == 'q' or input_sentence == 'quit': break
            # Normalize sentence
            input_sentence = normalizeString(input_sentence)
            # Evaluate sentence
            output_words = evaluate(encoder, decoder, searcher, voc, input_sentence)
            # Format and print response sentence
            output_words[:] = [x for x in output_words if not (x == 'EOS' or x == 'PAD')]
            print('Bot:', ' '.join(output_words))

        except KeyError:
            print("Error: Encountered unknown word.")

#### Setup models

In [12]:
# Load model if a ``loadFilename`` is provided
if loadFilename:
    # If loading on same machine the model was trained on
    checkpoint = torch.load(loadFilename)
    # If loading a model trained on GPU to CPU
    #checkpoint = torch.load(loadFilename, map_location=torch.device('cpu'))
    encoder_sd = checkpoint['en']
    decoder_sd = checkpoint['de']
    encoder_optimizer_sd = checkpoint['en_opt']
    decoder_optimizer_sd = checkpoint['de_opt']
    embedding_sd = checkpoint['embedding']
    voc.__dict__ = checkpoint['voc_dict']


print('Building encoder and decoder ...')
# Initialize word embeddings
embedding = nn.Embedding(voc.num_words, hidden_size)
if loadFilename:
    embedding.load_state_dict(embedding_sd)
# Initialize encoder & decoder models
encoder = EncoderRNN(hidden_size, embedding, encoder_n_layers, dropout)
decoder = LuongAttnDecoderRNN(attn_model, embedding, hidden_size, voc.num_words, decoder_n_layers, dropout)
if loadFilename:
    encoder.load_state_dict(encoder_sd)
    decoder.load_state_dict(decoder_sd)
# Use appropriate device
encoder = encoder.to(device)
decoder = decoder.to(device)
print('Models built and ready to go!')

Building encoder and decoder ...
Models built and ready to go!


### Run Training

Here we can finally run training.

In [13]:
# Ensure dropout layers are in train mode
encoder.train()
decoder.train()

# Initialize optimizers
print('Building optimizers ...')
encoder_optimizer = optim.Adam(encoder.parameters(), lr=learning_rate)
decoder_optimizer = optim.Adam(decoder.parameters(), lr=learning_rate * decoder_learning_ratio)
if loadFilename:
    encoder_optimizer.load_state_dict(encoder_optimizer_sd)
    decoder_optimizer.load_state_dict(decoder_optimizer_sd)

# If you have CUDA, configure CUDA to call
for state in encoder_optimizer.state.values():
    for k, v in state.items():
        if isinstance(v, torch.Tensor):
            state[k] = v.cuda()

for state in decoder_optimizer.state.values():
    for k, v in state.items():
        if isinstance(v, torch.Tensor):
            state[k] = v.cuda()

# Run training iterations
print("Starting Training!")
trainIters(model_name, voc, pairs, encoder, decoder, encoder_optimizer, decoder_optimizer,
           embedding, encoder_n_layers, decoder_n_layers, save_dir, n_iteration, batch_size,
           print_every, save_every, clip, corpus_name, loadFilename)

Building optimizers ...
Starting Training!
Initializing ...
Training...
Iteration: 1; Percent complete: 0.0%; Average loss: 7.8081
Iteration: 2; Percent complete: 0.1%; Average loss: 7.7026
Iteration: 3; Percent complete: 0.1%; Average loss: 7.5580
Iteration: 4; Percent complete: 0.1%; Average loss: 7.2569
Iteration: 5; Percent complete: 0.1%; Average loss: 6.7680
Iteration: 6; Percent complete: 0.1%; Average loss: 6.1044
Iteration: 7; Percent complete: 0.2%; Average loss: 6.0494
Iteration: 8; Percent complete: 0.2%; Average loss: 6.1461
Iteration: 9; Percent complete: 0.2%; Average loss: 5.7875
Iteration: 10; Percent complete: 0.2%; Average loss: 5.4410
Iteration: 11; Percent complete: 0.3%; Average loss: 5.1402
Iteration: 12; Percent complete: 0.3%; Average loss: 5.0905
Iteration: 13; Percent complete: 0.3%; Average loss: 5.1736
Iteration: 14; Percent complete: 0.4%; Average loss: 5.1896
Iteration: 15; Percent complete: 0.4%; Average loss: 5.0574
Iteration: 16; Percent complete: 0.4%