# Generative models

This tutorial will show how to use generative models to solve:
- Question answering
- Dialogue generation
However, the same prinicples are applicable to 

Yay we made it!

Before starting I need to connect the drive storage to the notebook.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import os
os.chdir('/content/drive/MyDrive/Colab Notebooks/NLP')
os.getcwd()

## Generative Question Answering (QA)

Up to now we have seen how to retrieve relevant passages that may contain the answer to a question.
- What if the answer is not written explicitly in the passage?
- What if the passage is too long to read (and our lives are too dynamic to spend more than one minute reading)?

We can train a model to output the answer to a question given 
- a relevant passage (and we know how to gather relevant documents)
- the question (yes, the question contains relevant information to answer the question)
... Or we can put a pre-trained model on top of our retrieval pipeline

### QA data preparation

In this section we will be using the [WikiQA](https://aclanthology.org/D15-1237/) data set.
It's a data set for open domain generative QA.

It's avaialble via the HuggingFace data set package, let's install it

In [None]:
!pip -q install datasets

Let's download the validation split of the WikiQA data set

In [1]:
import datasets

wiki_qa = datasets.load_dataset('wiki_qa', split='validation')
wiki_qa[:10]

Using custom data configuration default
Reusing dataset wiki_qa (/home/arcslab/.cache/huggingface/datasets/wiki_qa/default/0.1.0/d2d236b5cbdc6fbdab45d168b4d678a002e06ddea3525733a24558150585951c)


{'question_id': ['Q8', 'Q8', 'Q8', 'Q8', 'Q8', 'Q8', 'Q8', 'Q8', 'Q8', 'Q11'],
 'question': ['How are epithelial tissues joined together?',
  'How are epithelial tissues joined together?',
  'How are epithelial tissues joined together?',
  'How are epithelial tissues joined together?',
  'How are epithelial tissues joined together?',
  'How are epithelial tissues joined together?',
  'How are epithelial tissues joined together?',
  'How are epithelial tissues joined together?',
  'How are epithelial tissues joined together?',
  'how big is bmc software in houston, tx'],
 'document_title': ['Tissue (biology)',
  'Tissue (biology)',
  'Tissue (biology)',
  'Tissue (biology)',
  'Tissue (biology)',
  'Tissue (biology)',
  'Tissue (biology)',
  'Tissue (biology)',
  'Tissue (biology)',
  'BMC Software'],
 'answer': ['Cross section of sclerenchyma fibers in plant ground tissue',
  'Microscopic view of a histologic specimen of human lung tissue stained with hematoxylin and eosin .',
  'In Bi

The data set contains questions, title of documents in Wikipedia containing the answers and answers.
Additionally the data set contains distractor answers for a given question.
We can distinguish the two types of responses from the `label` value. 

Let's filter the data to retain only correct question-response pairs

In [2]:
wiki_qa_correct = wiki_qa.filter(lambda x: x['label'] == 1)
wiki_qa_correct[0]



  0%|          | 0/3 [00:00<?, ?ba/s]

{'question_id': 'Q11',
 'question': 'how big is bmc software in houston, tx',
 'document_title': 'BMC Software',
 'answer': 'Employing over 6,000, BMC is often credited with pioneering the BSM concept as a way to help better align IT operations with business needs.',
 'label': 1}

Ok now we have a lot of samples to try our system

### Knowlege retrieval

We have the questions (and the target answers), now we need to prepare our knowledge source and the retrieval system.
We can re-use the simple Wikipedia and the two encoder models from last tutorial.

Let's start loading the data

In [None]:
!pip -q install transformers==4.22.2
!pip -q install -U sentence-transformers

In [3]:
import os
from sentence_transformers import util

wikipedia_filepath = 'simplewiki-2020-11-01.jsonl.gz'
if not os.path.exists(wikipedia_filepath):
    util.http_get('http://sbert.net/datasets/simplewiki-2020-11-01.jsonl.gz', wikipedia_filepath)

We can try indexing all the paragraphs this time (hopefully it won't explode)

In [4]:
import json
import gzip

# NOTE: Change this flag to use only first paragraph
only_first = False

passages = []
# Open the file with the dump of Simple Wikipedia
with gzip.open(wikipedia_filepath, 'rt', encoding='utf8') as f:
    # Iterate over the lines
    for line in f:
        # Parse the document using JSON
        data = json.loads(line.strip())
        if only_first:
            # Only add the first paragraph
            passages.append(data['paragraphs'][0])
        else:
            # Add all paragraphs
            passages.extend(data['paragraphs'])

print(f"Retrieved {len(passages)} passages")

Retrieved 509663 passages


Now we can import the two models

In [5]:
from sentence_transformers import SentenceTransformer, CrossEncoder

semb_model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
xenc_model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

Now let's embed the retrieved passages (We can checkpoint the embeddings to avoid repeating the computation each time)

In [6]:
import os
import pickle

# Define hnswlib index path
embeddings_cache_path = './qa_embeddings_cache.pkl'

# Load cache if available
if os.path.exists(embeddings_cache_path):
    print('Loading embeddings cache')
    with open(embeddings_cache_path, 'rb') as f:
        corpus_embeddings = pickle.load(f)
# Else compute embeddings
else:
    print('Computing embeddings')
    corpus_embeddings = semb_model.encode(passages, convert_to_tensor=True, show_progress_bar=True)
    # Save the index to a file for future loading
    print(f'Saving index to: \'{embeddings_cache_path}\'')
    with open(embeddings_cache_path, 'wb') as f:
        pickle.dump(corpus_embeddings, f)

Loading embeddings cache


Finally let's index the embeddings (and let's sve the index hoping it works this time)

In [None]:
!pip -q install hnswlib

In [7]:
import os
import hnswlib

# Create empthy index
index = hnswlib.Index(space='cosine', dim=384)

# Define hnswlib index path
index_path = './qa_hnswlib.index'

# Load index if available
if os.path.exists(index_path):
    print('Loading index...')
    index.load_index(index_path)
# Else index data collection
else:
    # Initialise the index
    print('Start creating HNSWLIB index')
    index.init_index(max_elements=corpus_embeddings.size(0), ef_construction=400, M=64)
    #  Compute the HNSWLIB index (it may take a while)
    index.add_items(corpus_embeddings.cpu(), list(range(len(corpus_embeddings))))
    # Save the index to a file for future loading
    print(f'Saving index to: {index_path}')
    index.save_index(index_path)

Loading index...


Now we have almost all the tools to answer a question

### Answering a question

We are still missing the core of the QA system, the answering model.
We are going to re-use a pre-trained model.

[FLAN T5](https://arxiv.org/abs/2210.11416) is a pre-trained encode-decoder model trained to be used on different tasks in zero-shot or few-shots leaning settings

In [None]:
!pip -q install transformers sentencepiece accelerate

In [8]:
import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-large")
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-large", device_map="auto", torch_dtype=torch.float16)

Let's check quickly that it works

In [9]:
input_text = "Translate the following sentence from English to Italian: \"Vincenzo is the best\""
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(device)

output_ids = model.generate(input_ids, max_new_tokens=32)
output_text = tokenizer.decode(output_ids[0])
print(output_text)

<pad> "Vincenzo è il migliore"</s>


Now we can test our pipeline. 
First select randomly a question from the data set

In [10]:
import random

random.seed(1995)

idx = random.choice(range(len(wiki_qa_correct)))

sample = wiki_qa_correct[idx]
question = sample['question']
target_answer = sample['answer']

print(f'Question {idx}: {question}?')

Question 116: What was Captain Ahab's Ship in the novel "Moby Dick"?


Embed the question

In [11]:
question_embedding = semb_model.encode(question, convert_to_tensor=True)

Retrieve relevant documents keeping top $k$ matches

In [12]:
corpus_ids, distances = index.knn_query(question_embedding.cpu(), k=64)
scores = 1 - distances

print("Cosine similarity model search results")
print(f"Query: \"{question}\"")
print("---------------------------------------")
for idx, score in zip(corpus_ids[0][:5], scores[0][:5]):
    print(f"Score: {score:.4f}\nDocument: \"{passages[idx]}\"\n\n")

Cosine similarity model search results
Query: "What was Captain Ahab's Ship in the novel "Moby Dick""
---------------------------------------
Score: 0.8052
Document: "Moby-Dick is a novel written by Herman Melville. It was first published in 1851. The story is told by a seaman named Ishmael. He sails on a whaling ship called the "Pequod". Ahab is the captain of the ship. He wants to kill a white whale called Moby Dick. The whale bit his leg off. The book received mixed reviews. "Moby-Dick" is now thought to be one of the greatest novels ever written."


Score: 0.5937
Document: "Richard Basehart was an American actor. He played Admiral Harriman Nelson in the television series "Voyage To The Bottom Of The Sea". He played Ishmael in the 1956 movie "Moby Dick"."


Score: 0.5623
Document: "Herman Melville (August 1, 1819 – September 28, 1891) was an American novelist, short story writer, essayist and poet. He is best known for writing "Moby-Dick"."


Score: 0.5183
Document: "The book "A Cap

Re-rank retrieved documents

In [13]:
import numpy as np

model_inputs = [(question, passages[idx]) for idx in corpus_ids[0]]
cross_scores = xenc_model.predict(model_inputs)

print("Cross-encoder model re-ranking results")
print(f"Query: \"{question}\"")
print("---------------------------------------")
for idx in np.argsort(-cross_scores)[:5]:
    print(f"Score: {cross_scores[idx]:.4f}\nDocument: \"{passages[corpus_ids[0][idx]]}\"\n\n")

Cross-encoder model re-ranking results
Query: "What was Captain Ahab's Ship in the novel "Moby Dick""
---------------------------------------
Score: 9.1982
Document: "Moby-Dick is a novel written by Herman Melville. It was first published in 1851. The story is told by a seaman named Ishmael. He sails on a whaling ship called the "Pequod". Ahab is the captain of the ship. He wants to kill a white whale called Moby Dick. The whale bit his leg off. The book received mixed reviews. "Moby-Dick" is now thought to be one of the greatest novels ever written."


Score: 0.2596
Document: "Richard Basehart was an American actor. He played Admiral Harriman Nelson in the television series "Voyage To The Bottom Of The Sea". He played Ishmael in the 1956 movie "Moby Dick"."


Score: 0.0798
Document: "Herman Melville wrote the famous novel Moby Dick, about a whaling in America. Moby Dick is often called the greatest American novels. Emily Dickinson is one of the greatest American poets. Dickinson wrote

Use best match to answer (and compare to reference answer)

In [14]:
passage_idx = np.argsort(-cross_scores)[0]
passage = passages[corpus_ids[0][passage_idx]]

input_text = f"Given the following passage, answer the related question.\n\nPassage:\n\n{passage}\n\nQ: {question}?"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(device)
print(input_text, "\n")

output_ids = model.generate(input_ids, max_new_tokens=32)
output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(output_text, "\n")

print(f"A (target): {target_answer}")

Given the following passage, answer the related question.

Passage:

Moby-Dick is a novel written by Herman Melville. It was first published in 1851. The story is told by a seaman named Ishmael. He sails on a whaling ship called the "Pequod". Ahab is the captain of the ship. He wants to kill a white whale called Moby Dick. The whale bit his leg off. The book received mixed reviews. "Moby-Dick" is now thought to be one of the greatest novels ever written.

Q: What was Captain Ahab's Ship in the novel "Moby Dick"? 

Pequod 

A (target): The story tells the adventures of wandering sailor Ishmael , and his voyage on the whaleship Pequod , commanded by Captain Ahab.


How do we know if the passage was useful and the model haven't exploited weights memorisation?
Let's try to generate directly the response


In [15]:
input_text = f"Answer the following question.\n\nQ: {question}?"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(device)
print(input_text)

output_ids = model.generate(input_ids, max_new_tokens=32)
output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(f"\nA: {output_text}")

Answer the following question.

Q: What was Captain Ahab's Ship in the novel "Moby Dick"?

A: samuel


### Putting all together

We can finally set up an entire question answering pipeline:
- We have the knowledge
- We have the retrieval system
    - We also have the re-ranking system
- We have the asnwering system

Let's define a function that puts everything together and 

In [16]:
def qa_pipeline(
    question, 
    similarity_model=semb_model, 
    embeddings_index=index, 
    re_ranking_model=xenc_model, 
    generative_model=model,
    device=device
): 
    if not question.endswith('?'):
        question = question + '?'
    # Embed question
    question_embedding = semb_model.encode(question, convert_to_tensor=True)
    # Search documents similar to question in index 
    corpus_ids, distances = index.knn_query(question_embedding.cpu(), k=64)
    # Re-rank results
    xenc_model_inputs = [(question, passages[idx]) for idx in corpus_ids[0]]
    cross_scores = xenc_model.predict(model_inputs)
    # Get best matching passage
    passage_idx = np.argsort(-cross_scores)[0]
    passage = passages[corpus_ids[0][passage_idx]]
    # Encode input
    input_text = f"Given the following passage, answer the related question.\n\nPassage:\n\n{passage}\n\nQ: {question}"
    input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(device)
    # Generate output
    output_ids = model.generate(input_ids, max_new_tokens=64)
    # Decode output
    output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)

    # Return result
    return f"Passage:\n\n{passage}\n\nQ: {question}\n\nA: {output_text}"

Try it out

In [17]:
question = input("Ask a question >>> ")  # e.g., "How many fingers in a hand?", "What's the meaning of life?", ...
print()

print(qa_pipeline(question))

Ask a question >>>  What's the meaning of life?



Passage:

Many have different opinions on what the meaning is. Some say life is a war zone where we are the soldiers fighting in that war for survival. Some think it is all about the relationships that we make in our life. Some people say that life is full of violence and hatred but some say that life is full of hope and happiness. Still, other people say that the meaning of life is to achieve the goals you set in life. According to Douglas Adams, the answer to the question is 42. The biological answer is to have children, which is to pass on your genes. Others say the meaning of life is simply to live your life to the fullest.

Q: What's the meaning of life?

A: To live your life to the fullest


## Generative chatbots

Language models can be used to generate text in dialogues.
Now we are going to see how to use transformer language models as generative chatbots.

As usual we are going to use the Transformer library from HuggingFace. 
All generative models there implement a `generate()` methods we are going to use.
You can find the documentation here: https://huggingface.co/docs/transformers/main_classes/text_generation 

### Pretrained models

For starter let's play around with a pre-trained model.
We can load the [DialoGPT](https://arxiv.org/abs/1911.00536) chatbot, a fine-tuning of GPT-2 trained o large collections of conversations crawled from Reddit.

We can start seeing different ways to decode (generate) responses using this autoregressive model.
What we want to do is use the output probability distribution to select a token compsing a response.
Hopefully we select the most probable sequence, actually that's not feasible.

Let's proceed step-by-step.
First of all get model and tokeniser

In [None]:
pip freeze | grep transformers

In [None]:
!pip uninstall -y transformers
!pip -q install transformers==4.22.2

In [18]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-medium")
model = AutoModelForCausalLM.from_pretrained("microsoft/DialoGPT-medium", device_map="auto", torch_dtype=torch.float16)

#### How does it work?

First we need to understand how to provide data to our model

Up to a couple of years ago, the standard appraoch to present the input to these models was to separate each utterance with a `end-of-sequence` token.
The model would generate an answer and stop every time the `end-of-sequence` tokens is generated.

```
"<|endoftext|>Summer loving had me a blast<|endoftext|>Summer loving happened so fast<|endoftext|>I met a girl crazy for me<|endoftext|>Met a boy cute as can be<|endoftext|>"
```

Nowadays the appraoch is to have an uninterrupted stream of text, like a movie script

```
"
A: Hello.
B: Is it me you're looking for?
A: I can see it in your eyes...
B: I can see it in your smile!
"
```

DialoGPT uses the `end-of-sequence` token.

In [19]:
print(tokenizer.eos_token)
print(tokenizer.eos_token_id)

<|endoftext|>
50256


Let's create a context to use as input for our experiments

In [20]:
context = [
    "Hello, how are you?", 
    "I'm fine thanks, how about you?"
]

Not let's create an input string from the context

In [21]:
input_string = tokenizer.eos_token 
if len(context) > 0:
    input_string = tokenizer.eos_token + tokenizer.eos_token.join(context) + input_string

input_string

"<|endoftext|>Hello, how are you?<|endoftext|>I'm fine thanks, how about you?<|endoftext|>"

Encode input

In [22]:
input_encoding = tokenizer(input_string, return_tensors='pt').to(device)
print(input_encoding.input_ids)
print(input_encoding.input_ids.size())

tensor([[50256, 15496,    11,   703,   389,   345,    30, 50256,    40,  1101,
          3734,  5176,    11,   703,   546,   345,    30, 50256]],
       device='cuda:0')
torch.Size([1, 18])


If we run the sequence through the model, we get a series of logits as output.
Since we are using an autogressive models, in the last position we will have the logits of next token.

In [23]:
outputs = model(**input_encoding)
print(outputs.logits)
print(outputs.logits.size())

tensor([[[ -1.0938, -18.4688, -17.6875,  ..., -14.8672, -13.5625,   0.9443],
         [ -6.1445, -16.7656, -15.6094,  ..., -10.8203, -10.4062,   1.5098],
         [ -3.1699, -12.5938, -11.6328,  ...,  -7.0820,  -7.3398,   2.1406],
         ...,
         [  1.9648, -10.8828,  -8.4453,  ...,  -3.9512,  -2.6504,  15.2500],
         [  1.0215, -10.5859,  -7.7070,  ...,  -4.1914,  -1.7578,  19.2656],
         [  3.8809, -10.0625,  -9.3047,  ...,  -5.8164,  -5.3867,   3.1348]]],
       device='cuda:0', dtype=torch.float16, grad_fn=<UnsafeViewBackward0>)
torch.Size([1, 18, 50257])


We can run these logits through a $\mathrm{softmax}(\cdot)$ and obtain the probability distribution over tokens:
- for each possible token we have the probability of it being the next in the sequence
- We can sample a token from this probability distribution and recurr itin input to get a new token
- We can iterate this process to compose a response

In [24]:
p_dist_next = torch.softmax(outputs.logits[:, -1], dim=1)
print(p_dist_next)
print(p_dist_next.sum())
print(p_dist_next.size())

tensor([[2.5570e-05, 0.0000e+00, 0.0000e+00,  ..., 0.0000e+00, 0.0000e+00,
         1.2159e-05]], device='cuda:0', dtype=torch.float16,
       grad_fn=<SoftmaxBackward0>)
tensor(1., device='cuda:0', dtype=torch.float16, grad_fn=<SumBackward0>)
torch.Size([1, 50257])


What is the most probable next token?

In [25]:
arg_max_idx = torch.argmax(p_dist_next)
print(arg_max_idx)
tokenizer.decode(arg_max_idx)

tensor(40, device='cuda:0')


'I'

Note that we don't need to run the $\mathrm{softmax}(\cdot)$ operation if we just want to find the token with highest probability.

#### Deterministic decoding

Deterministic appraoches yield always the same output for a given input

##### Greedy decoding

The most starightforward way is to pick each time the most probable token and recurr it as next step in input.
Very suboptimal solution, usually yields dull responses like `"I don't know"` or causes degenerate generation (e.g., repeating the same token many times).

In [26]:
output_ids = model.generate(input_encoding.input_ids, max_new_tokens=32, pad_token_id=tokenizer.eos_token_id)
tokenizer.decode(output_ids[0, input_encoding.input_ids.size(1):], skip_special_tokens=True)

"I'm good, thanks."

##### Beam search

We cannot do an exhaustuve search, but we can keep the top $n$ most probable sequences up to now.
This is what beam search does

In [27]:
output_ids = model.generate(input_encoding.input_ids, max_new_tokens=32, num_beams=8, pad_token_id=tokenizer.eos_token_id)
tokenizer.decode(output_ids[0, input_encoding.input_ids.size(1):], skip_special_tokens=True)

"I'm good, how about you?"

#### Sampling

Sampling based decoding adds more spice to the output sampling the next token with a certain probability given by the language model.
The nice thing is that given the same input the generated content may change (higher diversity in the text of responses), the bad thing is that given the same input the generated content may change (possibly inconsistent behaviour).

##### Top-k

Consider only first $k$ most probable tokens and zero out others probabilities 

In [28]:
output_ids = model.generate(input_encoding.input_ids, max_new_tokens=32, do_sample=True, top_k=16, pad_token_id=tokenizer.eos_token_id)
tokenizer.decode(output_ids[0, input_encoding.input_ids.size(1):], skip_special_tokens=True)

"I'm not so well..."

##### Top-p (nucleus sampling)

Consider only first most probable tokens so that their probability sum up to $p \in [0, 1] \subseteq \mathbb{R}$ and zero out others probabilities 
Similar to top-$k$ but variable window.

In [29]:
output_ids = model.generate(input_encoding.input_ids, max_new_tokens=32, do_sample=True, top_p=0.95, top_k=0, pad_token_id=tokenizer.eos_token_id)
tokenizer.decode(output_ids[0, input_encoding.input_ids.size(1):], skip_special_tokens=True)

"Quite boring. It's got a tickle herons and a jazz"

##### Temperature rescoring

Divide the logits by a value $\tau \in \mathbb{R}^+_0$:
- if $\tau > 1$ (high temprature) the distribution get softer (reduces probability of most probable tokens and increases that of least probable)
- if $\tau = 1$ the distribution is unchanged
- if $\tau < 1$ (low temperature) the distribution get sharper (reduces probability of least probable tokens and increases that of most probable)


In [30]:
output_ids = model.generate(input_encoding.input_ids, max_new_tokens=32, do_sample=True, temperature=0.8, top_k=0, pad_token_id=tokenizer.eos_token_id)
# output_ids = model.generate(input_encoding.input_ids, max_new_tokens=32, do_sample=True, temperature=1.2, top_k=0, pad_token_id=tokenizer.eos_token_id)
tokenizer.decode(output_ids[0, input_encoding.input_ids.size(1):], skip_special_tokens=True)

'me too thanks'

##### Sample multiple candidates

You can also sample muliple candidates an pick one according to some criteria.

In [31]:
output_ids = model.generate(input_encoding.input_ids, max_new_tokens=32, do_sample=True, num_return_sequences=8, top_p=0.9, top_k=0, pad_token_id=tokenizer.eos_token_id)
tokenizer.batch_decode(output_ids[:, input_encoding.input_ids.size(1):], skip_special_tokens=True)

['Pretty good how about you?',
 'Not too good. But thanks anyways',
 "I don't know what to say to that, my man.",
 "It's fine, I'm chilling.",
 'Hey man, how are you?',
 'Good thanks',
 'Just fine with some tendons.',
 "I'm also fine thanks"]

For example if you combine sampling and beam search the `generate()` method will output automatically the most probable of the samples sequences

In [32]:
output_ids = model.generate(input_encoding.input_ids, max_new_tokens=32, do_sample=True, temperature=0.8, top_k=0, pad_token_id=tokenizer.eos_token_id)
tokenizer.decode(output_ids[0, input_encoding.input_ids.size(1):], skip_special_tokens=True)

"Good, can't ask you anything that is too personal for you."

#### Chatting

Now pick your favourite appraoch and chat with DialoGPT

In [None]:
# Maximum dialogue length (in turn pairs)
max_len = 5
# Initialise dialogue history
dialogue_history = []

for i in range(max_len):
    # Read user message
    user_message = input("User: ")
    # Append message to dialogue history
    dialogue_history.append(user_message)
    # Convert dialogue to string
    input_string = tokenizer.eos_token 
    if len(context) > 0:
        input_string = tokenizer.eos_token + tokenizer.eos_token.join(dialogue_history) + input_string
    # Encode input
    input_encoding = tokenizer(input_string, return_tensors='pt').to(device)
    # Generate DialoGPT response
    output_ids = model.generate(input_encoding.input_ids, max_new_tokens=32, do_sample=True, top_p=0.9, top_k=0, pad_token_id=tokenizer.eos_token_id)
    chatbot_response = tokenizer.decode(output_ids[0, input_encoding.input_ids.size(1):], skip_special_tokens=True)
    # Append chatbot response to dialogue history
    dialogue_history.append(chatbot_response)
    # Print chatbot response
    print(f"DialoGPT: {chatbot_response}")

Notes
1. As you go forward the model will start gettin slower. There is a way to cache the hidden outputs to avoid re-computing attention on past 
2. The context of the model is limited, at some point you should start dropping older utterances

### Fine-tuning

Now you are ready, you can finally fine-tune a generative chatbot and chat with it instead of studying for the exams! (I am not responsable for your choices, I am just offering you an alterantive)

We will fine-tune a vanilla GPT-2 (pick your favourite version).
Let's load the pre-trained model

In [33]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model_id = 'gpt2'

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

Let's set the padding token to be the `eos_token` to simplify some passages later

In [34]:
tokenizer.pad_token = tokenizer.eos_token

#### Data preparation

We are going to use the [Persona-Chat](https://arxiv.org/abs/1801.07243) corpus (It was used in the ConvAI 2 challenge).
It's a data set where conversations are grounded in the persona description of the two participants.

We are going to use the [ParlAI](https://parl.ai/docs/index.html) package to get the data set.

In [None]:
!pip -q install parlai

Now let's download the data set

In [35]:
from parlai.tasks.convai2.build import build

build({'datapath': './data/'})

Now the samples are stored in `.txt` files we need to parse.
We can build a simple function that given the path to one of the files parse the content into Python dictionaries

In [36]:
def parse_pc(path):
    # Open file
    with open(path) as f:
        # Read raw file lines
        data = [line.strip() for line in f]
    # Data set container
    persona_chat = list()
    # Now we iterate through lines and build the data set
    for line in data:
        # Split line data from initial index
        line_idx, line_data = line.split(' ', 1)
        # Check if new conversation is started
        if line_idx == '1':
            # Add new empthy dialogue in data set
            persona_chat.append(
                {'persona_a': list(), 'persona_b': list(), 'utterances': list()}
            )
        # If the line is from Speaker A persona
        if line_data.startswith('your persona: '):
            # Append it to Persona A
            persona_chat[-1]['persona_a'].append(line_data[len('your persona: '):])
        # Else if the line is from Speaker B persona
        elif line_data.startswith("partner's persona"):
            # Append it to Persona B
            persona_chat[-1]['persona_b'].append(line_data[len("partner's persona: "):])
        # Else the line is a regular dialogue line
        else:
            # Split utterances from distractors and separate A and B
            utt_a, utt_b = line_data.split('\t\t')[0].split('\t')
            # Append to dialogue utterances
            persona_chat[-1]['utterances'].append(
                {'speaker': 'A', 'text': utt_a}
            )
            persona_chat[-1]['utterances'].append(
                {'speaker': 'B', 'text': utt_b}
            )
        
    return persona_chat

Let's load trainign and validation data

In [37]:
training_data = parse_pc('./data/ConvAI2/train_both_original.txt')
validation_data = parse_pc('./data/ConvAI2/valid_both_original.txt')

training_data[0]

{'persona_a': ['i like to remodel homes.',
  'i like to go hunting.',
  'i like to shoot a bow.',
  'my favorite holiday is halloween.'],
 'persona_b': ['i like canning and whittling.',
  'to stay in shape , i chase cheetahs at the zoo.',
  'in high school , i came in 6th in the 100 meter dash.',
  'i eat exclusively meat.'],
 'utterances': [{'speaker': 'A',
   'text': "hi , how are you doing ? i'm getting ready to do some cheetah chasing to stay in shape ."},
  {'speaker': 'B',
   'text': 'you must be very fast . hunting is one of my favorite hobbies .'},
  {'speaker': 'A',
   'text': 'i am ! for my hobby i like to do canning or some whittling .'},
  {'speaker': 'B',
   'text': 'i also remodel homes when i am not out bow hunting .'},
  {'speaker': 'A',
   'text': "that's neat . when i was in high school i placed 6th in 100m dash !"},
  {'speaker': 'B',
   'text': "that's awesome . do you have a favorite season or time of year ?"},
  {'speaker': 'A',
   'text': 'i do not . but i do hav

Now we are going to convert to strings all the samples using the `eos_token` as separator.
We will include also personae.

Let's define another function to do that on a single dialogue, then we will apply it to the entire data set

In [38]:
def sample_to_string(sample, eos_token):
    # Join strings of Persona A
    persona_a = ' '.join(sample['persona_a'])
    # Join strings of Persona B
    persona_b = ' '.join(sample['persona_b'])
    # Join dialogue strings
    dialogue = eos_token.join(f"{utterance['speaker']}: {utterance['text']}" for utterance in sample['utterances'])
    # Build the dialogue string
    dialogue_string = f"Persona A: {persona_a}{eos_token}Persona B: {persona_b}{eos_token}{dialogue}{eos_token}"
    
    return dialogue_string

Apply the funtion to all samples

In [39]:
training_data_str = [sample_to_string(dialogue, tokenizer.eos_token) for dialogue in training_data]
validation_data_str = [sample_to_string(dialogue, tokenizer.eos_token) for dialogue in validation_data]

training_data_str[0]

"Persona A: i like to remodel homes. i like to go hunting. i like to shoot a bow. my favorite holiday is halloween.<|endoftext|>Persona B: i like canning and whittling. to stay in shape , i chase cheetahs at the zoo. in high school , i came in 6th in the 100 meter dash. i eat exclusively meat.<|endoftext|>A: hi , how are you doing ? i'm getting ready to do some cheetah chasing to stay in shape .<|endoftext|>B: you must be very fast . hunting is one of my favorite hobbies .<|endoftext|>A: i am ! for my hobby i like to do canning or some whittling .<|endoftext|>B: i also remodel homes when i am not out bow hunting .<|endoftext|>A: that's neat . when i was in high school i placed 6th in 100m dash !<|endoftext|>B: that's awesome . do you have a favorite season or time of year ?<|endoftext|>A: i do not . but i do have a favorite meat since that is all i eat exclusively .<|endoftext|>B: what is your favorite meat to eat ?<|endoftext|>A: i would have to say its prime rib . do you have any fav

And now are samples are ready

#### Training

Ok we made it to fine-tuning a language model.
We can use HuggingFace trainer to do the training.

First we need to build trainer compatible Dataset object

In [40]:
from datasets import Dataset

train_data = Dataset.from_dict({'text': training_data_str})
valid_data = Dataset.from_dict({'text': validation_data_str})

... and trainer compatible DatasetDict object

In [41]:
from datasets import DatasetDict

data = DatasetDict()
data['train'] = train_data
data['validation'] = valid_data
data['test'] = valid_data

Finally use the tokenizer to convert the input strings into sequences of tokens.

In [None]:
!pip -q install --upgrade ipywidgets

In [42]:
def tokenize_function(examples):
    input_encodings = tokenizer(examples["text"], padding=True, truncation=True)
    sample = {
        'input_ids': input_encodings.input_ids
    }
    return sample

tokenized_data = data.map(tokenize_function, batched=True)

  0%|          | 0/18 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

The last step we are missing is to create a collator that gets together all the sequences in the same batch

In [43]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

Now we can create an instance of the trainer specifying the training arguments

In [44]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    "cooler_trainer_name", 
    evaluation_strategy="steps",
    per_device_train_batch_size=8,
    gradient_accumulation_steps=4,
    learning_rate=6.25e-5,
    lr_scheduler_type="linear"
)

Create the trainer

In [45]:
from transformers import TrainingArguments, Trainer

trainer = Trainer(
    model=model, 
    args=training_args, 
    train_dataset=tokenized_data['train'], 
    eval_dataset=tokenized_data['validation'],
    data_collator=data_collator
)

And now let the training begin

In [46]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `GPT2LMHeadModel.forward` and have been ignored: text. If text are not expected by `GPT2LMHeadModel.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 17878
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 4
  Total optimization steps = 1674
You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss
500,2.3903,2.482409
1000,2.0887,2.541473
1500,1.9913,2.557334


The following columns in the evaluation set don't have a corresponding argument in `GPT2LMHeadModel.forward` and have been ignored: text. If text are not expected by `GPT2LMHeadModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8
Saving model checkpoint to cooler_trainer_name/checkpoint-500
Configuration saved in cooler_trainer_name/checkpoint-500/config.json
Model weights saved in cooler_trainer_name/checkpoint-500/pytorch_model.bin
The following columns in the evaluation set don't have a corresponding argument in `GPT2LMHeadModel.forward` and have been ignored: text. If text are not expected by `GPT2LMHeadModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8
Saving model checkpoint to cooler_trainer_name/checkpoint-1000
Configuration saved in cooler_trainer_name/checkpoint-1000/config.json
Model weights saved in cooler_trainer_name/checkpoint-1000

TrainOutput(global_step=1674, training_loss=2.1370191858945637, metrics={'train_runtime': 4035.429, 'train_samples_per_second': 13.291, 'train_steps_per_second': 0.415, 'total_flos': 1.9251608041728e+16, 'train_loss': 2.1370191858945637, 'epoch': 3.0})

Finally let's save the fine-tuned model

In [47]:
from datetime import datetime

checkpoint_path = f"persona_chat_fine_tuning_{datetime.now().strftime('%Y_%m_%d_%H_%M_%S')}"
tokenizer.save_pretrained(checkpoint_path)
model.save_pretrained(checkpoint_path)
print(f"Checkpoint saved at: \'{checkpoint_path}\'")

tokenizer config file saved in persona_chat_fine_tuning_2023_05_22_17_18_33/tokenizer_config.json
Special tokens file saved in persona_chat_fine_tuning_2023_05_22_17_18_33/special_tokens_map.json
Configuration saved in persona_chat_fine_tuning_2023_05_22_17_18_33/config.json
Model weights saved in persona_chat_fine_tuning_2023_05_22_17_18_33/pytorch_model.bin


Checkpoint saved at: 'persona_chat_fine_tuning_2023_05_22_17_18_33'


#### Testing

We can compute some automatc metrics to assess the quality of the chatbot.
We can use the ParlAI utilities to compute the metrics.

Let's pick a random test dialogue and let's generate response.
We will compare the original target response to the generated one

In [48]:
import random

random.seed(1995)

idx = random.choice(range(len(validation_data)))
print(idx)
dialogue = validation_data[idx]

739


Now we can pick a turn from the middle of a dialogue to be our target response

In [49]:
response_idx = len(dialogue['utterances']) // 2

original_response = dialogue['utterances'][response_idx]
original_response_string = f"{original_response['speaker']}: {original_response['text']}"
original_response_string

'A: oh did you enjoy it ?'

And we can drop the response and all the following ones to build ourr context

In [50]:
context = {
    'persona_a': dialogue['persona_a'],
    'persona_b': dialogue['persona_b'],
    'utterances': dialogue['utterances'][:response_idx]
}
context_string = sample_to_string(context, tokenizer.eos_token)
context_string

"Persona A: i'm interested in photography and like taking pictures. my boyfriend and i are moving into an apartment together next week. i am an elementary school teacher. i am fluent in english spanish and french.<|endoftext|>Persona B: i enjoy poetry. i am a huge star wars fan. i try various coffees as a hobby. i played football for a division a college.<|endoftext|>A: hey , jefferson here , i love poetry .<|endoftext|>B: hey mike here my boyfriend and i are moving in together next week<|endoftext|>A: wow , does he like star wars ?<|endoftext|>B: yeah i am interested in taking pictures , of yoda for sure<|endoftext|>A: wow . i try coffee as a hobby<|endoftext|>B: yeah i am fluent in english spanish and french<|endoftext|>A: wow , all i did was play football in college .<|endoftext|>B: cool , i taught a few players being an elementary school teacher<|endoftext|>"

And let's load back the model from the checkpoint

In [52]:
# checkpoint_path = 'persona_chat_fine_tuning_2023_05_22_17_18_33'

tokenizer = AutoTokenizer.from_pretrained(checkpoint_path)
model = AutoModelForCausalLM.from_pretrained(checkpoint_path , device_map="auto")

loading file vocab.json
loading file merges.txt
loading file tokenizer.json
loading file added_tokens.json
loading file special_tokens_map.json
loading file tokenizer_config.json
loading configuration file persona_chat_fine_tuning_2023_05_22_17_18_33/config.json
Model config GPT2Config {
  "_name_or_path": "persona_chat_fine_tuning_2023_05_22_17_18_33",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_

Generate a response

In [55]:
# Encode context
input_encoding = tokenizer(context_string, return_tensors='pt').to(device)
# Generate response
output_ids = model.generate(input_encoding.input_ids, max_new_tokens=64, do_sample=True, temperature=0.7, top_k=0, pad_token_id=tokenizer.eos_token_id)
# Decode generated response
generated_response = tokenizer.decode(output_ids[0, input_encoding.input_ids.size(1):], skip_special_tokens=True)
generated_response

"A: oh, wow. i'm also a huge fan of the game of thrones. i love that show. i enjoy it. i like it too. coffee. you? own a truck? i'm on it. i'm a huge fan too. drink. it pays the bills. you like coffee too"

##### Perplexity

Perplexity can be computed using the cross entropy on the generated response.
First let's process the the context and the response

In [56]:
# Encode dialogue
input_encoding = tokenizer(context_string + original_response_string + tokenizer.eos_token, return_tensors='pt').to(device)
# Compute model outputs
outputs = model(**input_encoding)

We get the target labels (the ids of the response)

In [57]:
labels = tokenizer(original_response_string + tokenizer.eos_token, return_tensors='pt').input_ids.to(device)
labels.size()

torch.Size([1, 9])

And then we retain only the logits from the response

In [58]:
logits = outputs.logits[:, -labels.size(1):]
logits.size()

torch.Size([1, 9, 50257])

Compute the average cross-entropy shifting the inputs and the outputs

In [59]:
import torch.nn.functional as F

# Shift logits to exclude the last element
shift_logits = logits[..., :-1, :].contiguous()
# shift labels to exclude the first element
shift_labels = labels[..., 1:].contiguous()
# Compute loss
lm_loss = F.cross_entropy(
    shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1)
)
lm_loss

tensor(3.6102, device='cuda:0', grad_fn=<NllLossBackward0>)

Exponentiate to have PPL

In [60]:
ppl = torch.exp(lm_loss)
ppl

tensor(36.9717, device='cuda:0', grad_fn=<ExpBackward0>)

The process can be simplified but at least now you have seen all the steps

BLEU

In [61]:
from parlai.core.metrics import BleuMetric

bleu = BleuMetric.compute(generated_response, [original_response_string])
print(f"BLEU: {bleu}")

BLEU: 3.409e-08


##### F1

In [62]:
from parlai.core.metrics import F1Metric

f1_score = F1Metric.compute(generated_response, [original_response_string])
print(f"F1: {f1_score}")

F1: 0.1667


##### Chatting

There is no better way to evalaute a generative chatbot than using it to chat.
Define a custom persona for you and the chatbot (or sample two from the data set) and write a chatting loop.

In [None]:
# TODO
dialogue = {
    # Write here your persona
    'persona_a': [
        "i am vincenzo",
        "i come from italy",
        "i like pizza"
    ], 
    # Write here chatbot persona
    'persona_b': [
        "i am an ai chatbot",
        "i like chatting with other people",
        "i want to take over human kind"
    ], 
    # Here we will acumulate the history
    'utterances': list()
}

# Maximum dialogue length (in turn pairs)
max_len = 5

for i in range(max_len):
    # Read user message
    user_message = input("User: ")
    # Append message to dialogue history
    dialogue['utterances'].append(
        {'speaker': 'A', 'text': user_message.lower()}
    )
    # Convert dialogue to string
    input_string = sample_to_string(dialogue, tokenizer.eos_token)
    # Encode input
    input_encoding = tokenizer(input_string, return_tensors='pt').to(device)
    # Generate DialoGPT response
    output_ids = model.generate(input_encoding.input_ids, max_new_tokens=64, do_sample=True, top_p=0.9, top_k=0, pad_token_id=tokenizer.eos_token_id)
    chatbot_response = tokenizer.decode(output_ids[0, input_encoding.input_ids.size(1):], skip_special_tokens=True)
    # Crop initial speaker token
    chatbot_response = chatbot_response[3:]
    # Append chatbot response to dialogue history
    dialogue['utterances'].append(
        {'speaker': 'B', 'text': chatbot_response}
    )
    # Print chatbot response
    print(f"Chatbot: {chatbot_response}")

## ELIZA meets DialoGPT

In the 70s they made ELIZA and PARRY meet each other: https://www.theatlantic.com/technology/archive/2014/06/when-parry-met-eliza-a-ridiculous-chatbot-conversation-from-1972/372428/.
We could you have ELIZA meet ChatGPT, but since we are humble we will settle with DialoGPT.

The is this implementation of ELIZA in Python we can use: https://github.com/wadetb/eliza
Let's start by cloning the repository and adding it to our path

In [None]:
!git clone https://github.com/wadetb/eliza.git

In [63]:
import sys

sys.path.append('/home/arcslab/Documents/vincenzo_scotti_polimi/rp_3_1/notebooks/eliza/eliza.py')

Now we should be able to import the package

In [64]:
from eliza.eliza import Eliza

Now let's create an instance of ELIZA using the available rules (https://github.com/wadetb/eliza/blob/master/doctor.txt)

In [65]:
eliza = Eliza()
eliza.load('./eliza/doctor.txt')

Now you can chat with ELIZA.
Note that ELIZA manages its context interanlly.

In [None]:
# Maximum dialogue length (in turn pairs)
max_len = 5

for i in range(max_len):
    # Read user message
    user_message = input("User: ")
    # Ask ELIZA for response
    response = eliza.respond(user_message)
    # Print ELIZA response
    print(f"ELIZA: {response}")

But that's not as fun as having DialoGPT chat with ELIZA, right?
Let's load back DialoGPT

In [66]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model_id = 'microsoft/DialoGPT-medium'

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id , device_map="auto")

loading configuration file config.json from cache at /home/arcslab/.cache/huggingface/hub/models--microsoft--DialoGPT-medium/snapshots/9d5c5fadcc072b693fb5a5e29416bbf3f503c26c/config.json
Model config GPT2Config {
  "_name_or_path": "microsoft/DialoGPT-medium",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 1024,
  "n_head": 16,
  "n_inner": null,
  "n_layer": 24,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "conversational": {
      "max_length": 1000
   

Now we can re-use the same chatting script from before, but instead of asking input to the user, we are going to ask ELIZA

In [71]:
# Reset ELIZA
eliza = Eliza()
eliza.load('./eliza/doctor.txt')
# Maximum dialogue length (in turn pairs)
max_len = 5
# Initialise dialogue history
dialogue_history = ["Hello"]
# Print first message
print(f"DialoGPT: {dialogue_history[0]}")


for i in range(max_len):
    # Generate ELIZA response
    eliza_message = eliza.respond(dialogue_history[-1])
    # Append message to dialogue history
    dialogue_history.append(eliza_message)
    # Print ELIZA response
    print(f"ELIZA: {eliza_message}")
    # Convert dialogue to string
    input_string = tokenizer.eos_token 
    if len(context) > 0:
        input_string = tokenizer.eos_token + tokenizer.eos_token.join(dialogue_history) + input_string
    # Encode input
    input_encoding = tokenizer(input_string, return_tensors='pt').to(device)
    # Generate DialoGPT response
    output_ids = model.generate(input_encoding.input_ids, max_new_tokens=32, do_sample=True, temperature=0.7, top_k=0, pad_token_id=tokenizer.eos_token_id)
    dialogpt_message = tokenizer.decode(output_ids[0, input_encoding.input_ids.size(1):], skip_special_tokens=True)
    # Append DialoGPT response to dialogue history
    dialogue_history.append(dialogpt_message)
    # Print DialoGPT response
    print(f"DialoGPT: {dialogpt_message}")

DialoGPT: Hello
ELIZA: How do you do. Please state your problem.
DialoGPT: i dont know what happened... i am trying to get the message that the app is closed
ELIZA: Is it because you are trying to get the message that the app is closed that you came to me ?
DialoGPT: the app closed
ELIZA: Do you feel strongly about discussing such things ?
DialoGPT: no, i am just expressing my opinions on the matter.
ELIZA: Why do you say your opinions on the matter ?
DialoGPT: and why does it annoy you
ELIZA: Does that question interest you ?
DialoGPT: It does
