# Tutorial: Transformer for Dialogue
This tutorial will go over the process of implementing a transformer for dialogue. 

Before running, make sure you have "data_small" and "pretrained_model" in the same directory as this file. These folders can be downloaded from the dropbox in the README

Transformer description found in paper Attention Is All You Need
(https://arxiv.org/abs/1706.03762 )

Dataset: Open subtitles - http://opus.nlpl.eu/OpenSubtitles-v2018.php

Transformer code - https://github.com/jadore801120/attention-is-all-you-need-pytorch

# Import libraries

In [34]:
import numpy as np
import json
import torch
import torch.nn.functional as F
import os
import random
from tqdm import tqdm
import ipywidgets as widgets

import transformer
from transformer.Models import Transformer
from transformer.Translator import Chatbot
from dataset import DialogueDataset, Vocab

# Load config 

Now, load the config file. This file contains all of the hyperparameters for the experiment. 

If you want to change the parameters, change them in the config.json file

In [52]:
# load config
with open("config.json", "r") as f:
    config = json.load(f)

for key, data in config.items():
    print("{}: {}".format(key, data))

dataset_filename: data_small
output_dir: exp_1/
run_name: run_1/
old_model_dir: run_5
num_epochs: 5
history_len: 50
response_len: 15
embedding_dim: 512
model_dim: 512
inner_dim: 2048
num_layers: 6
num_heads: 8
dim_k: 64
dim_v: 64
dropout: 0.3
min_count: 1
train_batch_size: 200
val_batch_size: 25
warmup_steps: 4000
a_nice_note: baseline test
label_smoothing: False
train_len: 1999
vocab_size: 11507
device: cpu
beam_size: 4
n_best: 4
choose_best: False


In [36]:
# create output dir to save model, and results in
if not os.path.exists(config["output_dir"]):
    os.mkdir(config["output_dir"])

# Load Data

Next we will create our training and validation dataset objects.

The dataset takes the dataset filename, the max length for the history, and the max length for the response. you can initialize the vocab with an already existing vocab object by passing the vocab object. There is also a setting to not update the vocab with the new documents-this is useful for running pretrianed models where you need to have the same vocab as the old model.

We want the 2 datasets to have the same vocab, so the validation dataset will be initialized with the trianing vocab, and the updated vocab from the val dataset is set to the train dataset.

In [37]:
# create train dataset
train_dataset = DialogueDataset(
            os.path.join(config["dataset_filename"], "train.csv"),
            config["history_len"],
            config["response_len"])

# creat validation dataset
val_dataset = DialogueDataset(
            os.path.join(config["dataset_filename"], "val.csv"),
            config["history_len"],
            config["response_len"],
            train_dataset.vocab)

# set vocab:
vocab = val_dataset.vocab
train_dataset.vocab = vocab
config["vocab_size"] = len(vocab)
vocab.save_to_dict(os.path.join(config["output_dir"], "vocab.json"))

# print info
print("train_len: {}\nval_len: {}\nvocab_size: {}".format(len(train_dataset), len(val_dataset), len(vocab)))

train_len: 19999
val_len: 1999
vocab_size: 11507


Dataloaders for the model are initialized with the datasets

We want to shuffle the train dataset, but it does not matter for validation

In [38]:
# initialize dataloaders
data_loader_train = torch.utils.data.DataLoader(
            train_dataset, config["train_batch_size"], shuffle=True)
data_loader_val = torch.utils.data.DataLoader(
            val_dataset, config["val_batch_size"], shuffle=False)


# Create Model
The transformer model is initialized with the parameters in the config file. You can change these parameters  to improve the model.

In [39]:
# initialize device ('cuda', or 'cpu')
device = torch.device(config["device"])

In [40]:
# create model
model = Transformer(
    config["vocab_size"],
    config["vocab_size"],
    config["history_len"],
    config["response_len"],
    d_word_vec=config["embedding_dim"],
    d_model=config["model_dim"],
    d_inner=config["inner_dim"],
    n_layers=config["num_layers"],
    n_head=config["num_heads"],
    d_k=config["dim_k"],
    d_v=config["dim_v"],
    dropout=config["dropout"]
).to(device)

# Create Optimizer

In the transformer paper they update the learning rate during training. To do this, we will make a scheduled optimizer wrapper class. 

We use an adam optimizer.

In [41]:
# optimizer class for updating the learning rate
class ScheduledOptim():
    '''A simple wrapper class for learning rate scheduling'''

    def __init__(self, optimizer, d_model, n_warmup_steps):
        self.optimizer = optimizer
        self.n_warmup_steps = n_warmup_steps
        self.n_current_steps = 0
        self.init_lr = np.power(d_model, -0.5)

    def step_and_update_lr(self):
        "Step with the inner optimizer"
        self._update_learning_rate()
        self.optimizer.step()

    def zero_grad(self):
        "Zero out the gradients by the inner optimizer"
        self.optimizer.zero_grad()

    def _get_lr_scale(self):
        return np.min([
            np.power(self.n_current_steps, -0.5),
            np.power(self.n_warmup_steps, -1.5) * self.n_current_steps])

    def _update_learning_rate(self):
        ''' Learning rate scheduling per step '''

        self.n_current_steps += 1
        lr = self.init_lr * self._get_lr_scale()

        for param_group in self.optimizer.param_groups:
            param_group['lr'] = lr


In [42]:
# create optimizer
optimizer = torch.optim.Adam(
    filter(lambda x: x.requires_grad, model.parameters()),
    betas=(0.9, 0.98), eps=1e-09)
# create a sceduled optimizer object
optimizer = ScheduledOptim(
    optimizer, config["model_dim"], config["warmup_steps"])

# Load Pretrained Model
If you want to run a pretrained model, change the "old_model_dir" from None to the filename with the pretrained model  

You must have the same vocab for the old model, so that is loaded as well

In [43]:
def save_checkpoint(filename, model, optimizer):
    '''
    saves model into a state dict, along with its training statistics,
    and parameters
    :param model:
    :param optimizer:
    :return:
    '''
    state = {
        'model': model.state_dict(),
        'optimizer' : optimizer.state_dict(),
        }
    torch.save(state, filename)

In [44]:
def load_checkpoint(filename, model, optimizer, device):
    '''
    loads previous model
    :param filename: file name of model
    :param model: model that contains same parameters of the one you are loading
    :param optimizer:
    :return: loaded model, checkpoint
    '''
    if os.path.isfile(filename):
        checkpoint = torch.load(filename, map_location=device)
        model.load_state_dict(checkpoint['model'])
        optimizer.load_state_dict(checkpoint['optimizer'])
    return model, optimizer


In [45]:
if config["old_model_dir"] is not None:
    model, optimizer.optimizer = load_checkpoint(os.path.join(config["old_model_dir"], "model.bin"),
                                                model, optimizer.optimizer, device)
    vocab.load_from_dict(os.path.join(config["old_model_dir"], "vocab.json"))

# Output an Example
Sometimes it is useful to see what the model is doing. So we will create a function that outputs an example from the validation set, along with the prediction from the model

In [13]:
def output_example(model, val_dataset, device, vocab):
    '''output an example and the models prediction for that example'''
    random_index = random.randint(0, len(val_dataset))
    example = val_dataset[random_index]

    # prepare data
    h_seq, h_pos, h_seg, r_seq, r_pos = map(
        lambda x: torch.from_numpy(x).to(device).unsqueeze(0), example)

    # take out first token from target for some reason
    gold = r_seq[:, 1:]

    # forward
    pred = model(h_seq, h_pos, h_seg, r_seq, r_pos)
    output = torch.argmax(pred, dim=1)

    # get history text
    string = "history: "
    seg = -1
    for i, idx in enumerate(h_seg.squeeze()):
        if seg != idx.item():
            string+="\n"
            seg=idx.item()
        token = vocab.id2token[h_seq.squeeze()[i].item()]
        if token != '<blank>':
            string += "{} ".format(token)

    # get target text
    string += "\nTarget:\n"
    for idx in gold.squeeze():
        token = vocab.id2token[idx.item()]
        string += "{} ".format(token)

    # get prediction
    string += "\n\nPrediction:\n"
    for idx in output:
        token = vocab.id2token[idx.item()]
        string += "{} ".format(token)

    # print
    print("\n------------------------\n")
    print(string)
    print("\n------------------------\n")

# Calculate Performance

First calculate the loss, with or without smoothing

In all you need is attention, they apply a label smothing to the loss function. They do this to make the model more "unsure" so the accuracy is higher. However, this causes perplexity to decrease. 

Calculate the number of correctly predicted tokens, to calculate accuracy later

In [14]:
def cal_performance(pred, gold, smoothing=False):
    ''' Apply label smoothing if needed '''

    loss = cal_loss(pred, gold, smoothing)

    pred = pred.max(1)[1]
    gold = gold.contiguous().view(-1)
    non_pad_mask = gold.ne(transformer.Constants.PAD)
    # eq omputes element-wise equality
    n_correct = pred.eq(gold)
    n_correct = n_correct.masked_select(non_pad_mask).sum().item()

    return loss, n_correct

In [15]:
def cal_loss(pred, gold, smoothing):
    ''' Calculate cross entropy loss, apply label smoothing if needed. '''

    gold = gold.contiguous().view(-1)

    if smoothing:
        eps = 0.1
        n_class = pred.size(1)

        one_hot = torch.zeros_like(pred).scatter(1, gold.view(-1, 1), 1)
        one_hot = one_hot * (1 - eps) + (1 - one_hot) * eps / (n_class - 1)
        log_prb = F.log_softmax(pred, dim=1)

        non_pad_mask = gold.ne(transformer.Constants.PAD)
        loss = -(one_hot * log_prb).sum(dim=1)
        #loss = loss.masked_select(non_pad_mask).sum()  # average later
        loss = loss.masked_select(non_pad_mask).mean()
    else:
        loss = F.cross_entropy(pred, gold, ignore_index=transformer.Constants.PAD, reduction='mean')
    return loss

# Forward Pass
First prepares the inputs by sending the features to the respective device
-src_seq: input word encodings
-src_pos: input positional encodings
-src_seg: input sequence encodings, for the turns in dialogue history
-tgt_seq: target word encodings
-tgt_pos: target positional encodings

gold is the target but without the CLS token at the begining

If you are training, you want to clear the gradients before getting the output

In [16]:
# forward
def forward(phase, batch, model, optimizer):
    h_seq, h_pos, h_seg, r_seq, r_pos = map(
                lambda x: x.to(device), batch)

    gold = r_seq[:, 1:]

    # forward
    if phase == "train":
        optimizer.zero_grad()
    pred = model(h_seq, h_pos, h_seg, r_seq, r_pos)
    
    return pred, gold
        

# Backward Pass
The backward pass computes the loss, and updates the models parameters if it is training

returns the loss, and the number of correct outputs

In [17]:
# backward
def backward(phase, pred, gold, config):
    # get loss
    loss, n_correct = cal_performance(pred, gold,
        smoothing=config["label_smoothing"])
    
    if phase == "train":
        # backward
        loss.backward()

        # update parameters, and learning rate
        optimizer.step_and_update_lr()

    return float(loss), n_correct

# Training Loop
For every epoch, the loop runs training and evaluation.

Setting the model to eval mode vs training mode disables things like dropout layers, and other things you do not want during evaluation

Metrics are initialized, and saved to the output file

after running validation, we want to save the weights of the model only if the validation loss is lower than it has been before. This means we will only save the best model.

Next step before running training is initialize a dictionary for the results of training. It is important to be organized with experiment results.

We want to save the weights of the model only when the validation loss lower than it has been before. So the lowest loss is initialized to a arbitrary large number. If the validation loss is lower than the lowest loss, save the weights, and set the lowest loss to the validation loss

In [18]:
# initialize results, add config to them
results = dict()
results["config"] = config

# initialize lowest validation loss, use to save weights
lowest_loss = 999

In [60]:
# begin training
for i in range(config["num_epochs"]):
    epoch_metrics = dict()
    # output an example
    output_example(model, val_dataset, device, vocab)
    # run each phase per epoch
    for phase in ["train", "val"]:
        if phase == "train":
            # set model to training mode
            model.train()
            dataloader = data_loader_train
            batch_size = config["train_batch_size"]
        else:
            # set model to evaluation mode
            self.model.eval()
            dataloader = data_loader_val
            batch_size = config["val_batch_size"]
        
        # initialize metrics
        phase_metrics = dict()
        epoch_loss = list()
        average_epoch_loss = None
        n_word_total = 0
        n_correct = 0
        n_word_correct = 0
        for i, batch in enumerate(tqdm(dataloader, mininterval=2, desc=phase, leave=False)):
            # forward
            pred, gold = forward(phase, batch, model, optimizer)
            # backward
            loss, n_correct = backward(phase, pred, gold, config)
            
            # record loss
            epoch_loss.append(loss)
            average_epoch_loss = np.mean(epoch_loss)

            # get_accuracy
            non_pad_mask = gold.ne(transformer.Constants.PAD)
            n_word = non_pad_mask.sum().item()
            n_word_total += n_word
            n_word_correct += n_correct
            
        # record metrics
        phase_metrics["loss"] = average_epoch_loss
        phase_metrics["token_accuracy"] = n_word_correct / n_word_total

        # get perplexity
        perplexity = np.exp(average_epoch_loss)
        phase_metrics["perplexity"] = perplexity
        
        phase_metrics["time_taken"] = time.clock() - start
        
        epoch_metrics[phase] = phase_metrics
        
        # save model if val loss is lower than any of the previous epochs
        if phase == "val":
            if average_epoch_loss <= lowest_loss:
                save_checkpoint(filename, model, optimizer.optimizer)
                lowest_loss = average_epoch_loss
                
    results["epoch_{}".format(epoch)] = epoch_metrics

train:   0%|          | 0/100 [00:00<?, ?it/s]


------------------------

history: 
for them to both suck the sand and pretend they are dying . <s> 
If one of them is down , or if both of them are down ... <s> 
I want one of your trainers to cut their throats . <s> 
And they are to understand that . </s> 
Target:
Now , I shall leave ten thousand on account ... and the rest </s> 

Prediction:
I , I am be you <unk> <unk> the . ... I <unk> </s> 

------------------------



KeyboardInterrupt: 

In [61]:
# save results to file
with open(os.path.join(config["output_dir"], "results.json"), 'w') as f:
    json.dump(results, f)

# Chat With Your Model

Next, we can make a demo chatbot with the transformer. This is slightly different, and will use beam search. The inputs to the chatbot will be all the previous dialogue turns, the queries and responses. 

The chatbot does a beam search, and returns the n_best responses. If chose_best is true, it will return the response with the highest score. This may cause the model to be not interesting, so setting chose_best to false will cause the model to output something it may consider less probable, but possibly something different.

The pretrained model will also output many <unk> tokens because it was trained on a large dataset with a small vocab, so many examples have these tokens, and it will predict them. (You can come up a word to replace the token in your head to make things more fun for yourself). You can also increase the number of possible results with beam_size, and n_best.

With the vocab mapping, it creates the output sentence from the final result

In [62]:
# create chatbot object
chatbot = Chatbot(config, model)
history = list()

def generate_response(query, chatbot, dataset):
    # get input features for the dialogue history
    h_seq, h_pos, h_seg = dataset.get_input_features(history)
    
    # get response from model
    response = chatbot.translate_batch(h_seq, h_pos, h_seg)
    return response

# print the response from the input
def print_response(text_widget):
    # get query, add to the end of history 
    query = text_widget.value
    history.append(query)
    # generate responses
    response = generate_response(history, chatbot, val_dataset)
    # chose response
    if config["choose_best"]:
        response = response[0][0][0]
    else:
        # pick a random result from the n_best
        idx=random.randint(0, max(config["n_best"], config["beam_size"])-1)
        response = response[0][0][idx]
    
    # create output string
    output = ""
    for idx in response[:-1]:
        token = vocab.id2token[idx]
        output += "{} ".format(token)
    print(f'{query} -> {output}')
    history.append(output)

text_input = widgets.Text(placeholder='Type something',
                          description='String:',
                          disabled=False)

text_input.on_submit(print_response)


In [63]:
text_input

Text(value='', description='String:', placeholder='Type something')

hi -> I am sorry , I am late . 
