# HW10: A Simple Chatbot using GPT2

Remember that these homework work as a completion grade. **You can <span style="color:red">not</span> skip one section this homework.**

**Training a Chatbot**

In this exercise, we are going to train a simple chatbot based on DistilGPT2. Find an overview of the GPT2 architecture in hugggingface [here](https://huggingface.co/transformers/model_doc/gpt2.html). We will use the [CCPE data](https://www.aclweb.org/anthology/W19-5941.pdf) (no need to read the paper for this exercise, we provide data loading utilties). The dataset offers exciting possibilities to train sophisticated chatbots, however we only explore a very simple version.

In [None]:
# clone the github repo
#!git clone https://github.com/google-research-datasets/ccpe
#!pip install transformers

In [None]:
import torch
print(f'Running on GPU: {torch.cuda.get_device_name()}')

Running on GPU: Tesla K80


In [None]:
import json
import numpy as np

def load_data():
    with open("ccpe/data.json") as f:
        data = json.load(f)
    conversations = []
    for conversation in data:
        for i, item in enumerate(conversation["utterances"]):
            text = item["text"]
            if i == 0:
                # nothing todo
                pass
            else:
                conversations.append((last_text, text))
            last_text = text
    return conversations     
data = load_data()

data = np.reshape(data[:-13], (-1, 16, 2))
print(type(data))
print (f'data.shape={data.shape}')
print (data[0][:5])

# please note how we arrange the data in pairs of (previous_sentence, current_sentence)
# and each batch contains 16 such sentence pairs

<class 'numpy.ndarray'>
data.shape=(716, 16, 2)
[['generally speaking what type of movies do you watch'
  'I like thrillers a lot.']
 ['I like thrillers a lot.' 'thrillers? for example?']
 ['thrillers? for example?' "Zodiac's one of my favorite movies."]
 ["Zodiac's one of my favorite movies."
  "Zodiac the movie about a serial killer from the '60s or '70s, around there."]
 ["Zodiac the movie about a serial killer from the '60s or '70s, around there."
  'Zodiac? oh wow ok, what do you like about that movie']]


**The data**
As we can see from the data, we only extract sentence pairs and will train models on these pairs in isolation. This is a very simplified version of training a chatbot. 

**In a more realistic setting**, we would include more conversation history, perhaps have to retrieve additional information from a fact base to generate factually accurate examples (e.g. think about a chatbot which could suggest restaurants in a city and needs to have a list available of all restaurants in that city). We would probably also encode the speaker and the chatbot, and guess the speech act to give the desired response (e.g. if a speaker just wants to do small talk, we would guess this and reply accordingly. If we guess he is asking for factual information, we should structure our response very differently. However, in this exercise we stick to our very simplified version.

We have already prepared the data such that it comes in batches of 32 examples each.

To train a chatbot, we need data, a tokenizer, a model and an optimizer.

In [None]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

model_name = "distilgpt2"

# load the tokenizer
tokenizer = GPT2Tokenizer.from_pretrained(model_name, pad_token_id=0)

##because GPT2 tokenizers do not have padding, cls and sep tokens, we have to add these ourselves
##we won't need the # character, so this will be the pad token
tokenizer.add_special_tokens({'pad_token': '#'})
tokenizer.add_special_tokens({'cls_token': 'bos'})
tokenizer.add_special_tokens({'sep_token': 'bos'})

##TODO load model
model = GPT2LMHeadModel.from_pretrained(model_name, 
    pad_token_id=tokenizer.pad_token_id,
    sep_token_id=tokenizer.sep_token_id
    )

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/762 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/336M [00:00<?, ?B/s]

In [None]:
##let's have a look what we generate before fine-tuning our chatbot

input_ids = tokenizer.encode("Why do you like this kind of movies?", return_tensors='pt')

sample_output = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    top_p=0.92, 
)

tokenizer.decode(sample_output[0])

'Why do you like this kind of movies?\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n'

In [None]:
from torch.optim import Adam
##TODO implement an optimizer with learning_rate = 2e-5 for all parameters

learning_rate = 2e-5
optimizer = Adam(model.parameters(), lr=learning_rate)

In [None]:
from tqdm import tqdm

##TODO train the model
num_epochs = 1

##on my laptop, training on all 716 batches takes roughly one hour, so we just train for 100 steps 
max_steps = 100
data = data[:max_steps]

## preprocess data: padding
#max_len = np.char.str_len(data).max()
#data = np.char.ljust(data, max_len, '#')

for i, batch in enumerate(tqdm(data)):
    ##TODO prepare model input
    #In the textual entailment example in the notebook, we encode
    #[CLS-token]premise[SEP-token]hypothesis[SEP-TOKEN]
    #Here, we would like to encode 
    #[CLS-token]previous_sentence[Sep-token]current_sentence[SEP-TOKEN]
    
    previous_sentences, current_sentences = list(batch[:,0]), list(batch[:,1])
    inputs = tokenizer(previous_sentences, current_sentences, padding=True, return_tensors='pt')

    ##Compute a forward step (the labels are simply the input_ids)
    #since gpt-2 reads from left to right, it will predict the label at each timestep without having access to that token's information during training
    outputs = model(**inputs, labels=inputs["input_ids"])
    loss = outputs.loss

    ##Compute a backward step
    loss.backward()

    ##Perform an optimzer step
    optimizer.step()

    ##Clear gradients of the optimizer
    optimizer.zero_grad()


100%|██████████| 100/100 [16:59<00:00, 10.20s/it]


In [None]:
##let's have a look what we generate after fine-tuning our chatbot

input_ids = tokenizer.encode("Why do you like this kind of movies?", return_tensors='pt')

sample_output = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    top_p=0.92, 
)

tokenizer.decode(sample_output[0])

"Why do you like this kind of movies?I'm one of those big actors.#################################"