# Generative models

This tutorial will show how to use generative models to solve:
- Question answering
- Dialogue generation
However, the same prinicples are applicable to 

Yay we made it!

## Generative Question Answering (QA)

Up to now we have seen how to retrieve relevant passages that may contain the answer to a question.
- What if the answer is not written explicitly in the passage?
- What if the passage is too long to read (and our lives are too dynamic to spend more than one minute reading)?

We can train a model to output the answer to a question given 
- a relevant passage (and we know how to gather relevant documents)
- the question (yes, the question contains relevant information to answer the question)
... Or we can put a pre-trained model on top of our retreival pipeline

### QA data preparation

In this section we will be using the [WikiQA](https://aclanthology.org/D15-1237/) data set.
It's a data set for open domain generative QA.

It's avaialble via the HuggingFace data set package, let's install it

In [None]:
#!pip -q install datasets

Let's download the validation split of the WikiQA data set

In [None]:
import datasets

wiki_qa = datasets.load_dataset('wiki_qa', split='validation')
wiki_qa[:10]

The data set contains questions, title of documents in Wikipedia containing the answers and answers.
Additionally the data set contains distractor answers for a given question.
We can distinguish the two types of responses from the `label` value. 

Let's filter the data to retain only correct question-response pairs

In [None]:
wiki_qa_correct = wiki_qa.filter(lambda x: x['label'] == 1)
wiki_qa_correct[0]

Ok now we have a lot of samples to try our system

### Knowlege retreival

We have the questions (and the target answers), now we need to prepare our knowledge source and the retrieval system.
We can re-use the simple Wikipedia and the two encoder models from last tutorial.

Let's start loading the data

In [None]:
#!pip -q install transformers==4.22.2
#!pip -q install -U sentence-transformers

In [None]:
import os
from sentence_transformers import util

wikipedia_filepath = 'simplewiki-2020-11-01.jsonl.gz'
if not os.path.exists(wikipedia_filepath):
    util.http_get('http://sbert.net/datasets/simplewiki-2020-11-01.jsonl.gz', wikipedia_filepath)

We can try indexing all the paragraphs this time (hopefully it won't explode)

In [None]:
import json
import gzip

# NOTE: Change this flag to use only first paragraph
only_first = False

passages = []
# Open the file with the dump of Simple Wikipedia
with gzip.open(wikipedia_filepath, 'rt', encoding='utf8') as f:
    # Iterate over the lines
    for line in f:
        # Parse the document using JSON
        data = json.loads(line.strip())
        if only_first:
            # Only add the first paragraph
            passages.append(data['paragraphs'][0])
        else:
            # Add all paragraphs
            passages.extend(data['paragraphs'])

print(f"Retreived {len(passages)} passages")

Now we can import the two models

In [None]:
from sentence_transformers import SentenceTransformer, CrossEncoder

semb_model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
xenc_model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

Now let's embed the retreived passages (We can checkpoint the embeddings to avoid repeating the computation each time)

In [None]:
import os
import pickle

# Define hnswlib index path
embeddings_cache_path = './qa_embeddings_cache.pkl'

# Load cache if available
if os.path.exists(embeddings_cache_path):
    print('Loading embeddings cache')
    with open(embeddings_cache_path, 'rb') as f:
        corpus_embeddings = pickle.load(f)
# Else compute embeddings
else:
    print('Computing embeddings')
    corpus_embeddings = semb_model.encode(passages, convert_to_tensor=True, show_progress_bar=True)
    # Save the index to a file for future loading
    print(f'Saving index to: \'{embeddings_cache_path}\'')
    with open(embeddings_cache_path, 'wb') as f:
        pickle.dump(corpus_embeddings, f)

Finally let's index the embeddings (and let's sve the index hoping it works this time)

In [None]:
#!pip -q install hnswlib

In [None]:
import os
import hnswlib

# Create empthy index
index = hnswlib.Index(space='cosine', dim=384)

# Define hnswlib index path
index_path = './qa_hnswlib.index'

# Load index if available
if os.path.exists(index_path):
    print('Loading index...')
    index.load_index(index_path)
# Else index data collection
else:
    # Initialise the index
    print('Start creating HNSWLIB index')
    index.init_index(max_elements=corpus_embeddings.size(0), ef_construction=400, M=64)
    #  Compute the HNSWLIB index (it may take a while)
    index.add_items(corpus_embeddings.cpu(), list(range(len(corpus_embeddings))))
    # Save the index to a file for future loading
    print(f'Saving index to: {index_path}')
    index.save_index(index_path)

Now we have almost all the tools to answer a question

### Answering a question

We are still missing the core of the QA system, the answering model.
We are going to re-use a pre-trained model.

[FLAN T5](https://arxiv.org/abs/2210.11416) is a pre-trained encode-decoder model trained to be used on different tasks in zero-shot or few-shots leaning settings

In [None]:
#!pip -q install transformers sentencepiece accelerate

In [None]:
import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-large")
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-large", device_map="auto", torch_dtype=torch.float16)

Let's check quickly that it works

In [None]:
input_text = "Translate the following sentence from English to Italian: \"Vincenzo is the best\""
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(device)

output_ids = model.generate(input_ids, max_new_tokens=32)
output_text = tokenizer.decode(output_ids[0])
print(output_text)

Now we can test our pipeline. 
First select randomly a question from the data set

In [None]:
import random

random.seed(1995)

idx = random.choice(range(len(wiki_qa_correct)))

sample = wiki_qa_correct[idx]
question = sample['question']
target_answer = sample['answer']

print(f'Question {idx}: {question}?')

Embed the question

In [None]:
question_embedding = semb_model.encode(question, convert_to_tensor=True)

Retreive relevant documents keeping top $k$ matches

In [None]:
corpus_ids, distances = index.knn_query(question_embedding.cpu(), k=64)
scores = 1 - distances

print("Cosine similarity model search results")
print(f"Query: \"{question}\"")
print("---------------------------------------")
for idx, score in zip(corpus_ids[0][:5], scores[0][:5]):
    print(f"Score: {score:.4f}\nDocument: \"{passages[idx]}\"\n\n")

Re-rank retreived documents

In [None]:
import numpy as np

model_inputs = [(question, passages[idx]) for idx in corpus_ids[0]]
cross_scores = xenc_model.predict(model_inputs)

print("Cross-encoder model re-ranking results")
print(f"Query: \"{question}\"")
print("---------------------------------------")
for idx in np.argsort(-cross_scores)[:5]:
    print(f"Score: {cross_scores[idx]:.4f}\nDocument: \"{passages[corpus_ids[0][idx]]}\"\n\n")

Use best match to answer (and compare to reference answer)

In [None]:
passage_idx = np.argsort(-cross_scores)[0]
passage = passages[corpus_ids[0][passage_idx]]

input_text = f"Given the following passage, answer the related question.\n\nPassage:\n\n{passage}\n\nQ: {question}?"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(device)
print(input_text, "\n")

output_ids = model.generate(input_ids, max_new_tokens=32)
output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(output_text, "\n")

print(f"A (target): {target_answer}")

How do we know if the passage was useful and the model haven't exploited weights memorisation?
Let's try to generate directly the response


In [None]:
input_text = f"Answer the following question.\n\nQ: {question}?"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(device)
print(input_text)

output_ids = model.generate(input_ids, max_new_tokens=32)
output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(output_text)

### Putting all together

We can finally set up an entire question answering pipeline:
- We have the knowledge
- We have the retreival system
    - We also have the re-ranking system
- We have the asnwering system

Let's define a function that puts everything together and 

In [None]:
def qa_pipeline(
    question, 
    similarity_model=semb_model, 
    embeddings_index=index, 
    re_ranking_model=xenc_model, 
    generative_model=model,
    device=device
): 
    if not question.endswith('?'):
        question = question + '?'
    # Embed question
    question_embedding = semb_model.encode(question, convert_to_tensor=True)
    # Search documents similar to question in index 
    corpus_ids, distances = index.knn_query(question_embedding.cpu(), k=64)
    # Re-rank results
    xenc_model_inputs = [(question, passages[idx]) for idx in corpus_ids[0]]
    cross_scores = xenc_model.predict(model_inputs)
    # Get best matching passage
    passage_idx = np.argsort(-cross_scores)[0]
    passage = passages[corpus_ids[0][passage_idx]]
    # Encode input
    input_text = f"Given the following passage, answer the related question.\n\nPassage:\n\n{passage}\n\nQ: {question}"
    input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(device)
    # Generate output
    output_ids = model.generate(input_ids, max_new_tokens=64)
    # Decode output
    output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)

    # Return result
    return f"Passage:\n\n{passage}\n\nQ: {question}\n\nA: {output_text}"

Try it out

In [None]:
question = input("Ask a question >>> ")
print()

print(qa_pipeline(question))

## Generative chatbots

Language models can be used to generate text in dialogues.
Now we are going to see how to use transformer language models as generative chatbots.

As usual we are going to use the Transformer library from HuggingFace. 
All generative models there implement a `generate()` methods we are going to use.
You can find the documentation here: https://huggingface.co/docs/transformers/main_classes/text_generation 

### Pretrained models

For starter let's play around with a pre-trained model.
We can load the [DialoGPT](https://arxiv.org/abs/1911.00536) chatbot, a fine-tuning of GPT-2 trained o large collections of conversations crawled from Reddit.

We can start seeing different ways to decode (generate) responses using this autoregressive model.
What we want to do is use the output probability distribution to select a token compsing a response.
Hopefully we select the most probable sequence, actually that's not feasible.

Let's proceed step-by-step.
First of all get model and tokeniser

In [None]:
#pip freeze | grep transformers

In [None]:
#!pip uninstall -y transformers
#!pip -q install transformers==4.22.2

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-large")
model = AutoModelForCausalLM.from_pretrained("microsoft/DialoGPT-large", device_map="auto", torch_dtype=torch.float16)

#### How does it work?

First we need to understand how to provide data to our model

Up to a couple of years ago, the standard appraoch to present the input to these models was to separate each utterance with a `end-of-sequence` token.
The model would generate an answer and stop every time the `end-of-sequence` tokens is generated.

```
"<|endoftext|>Summer loving had me a blast<|endoftext|>Summer loving happened so fast<|endoftext|>I met a girl crazy for me<|endoftext|>Met a boy cute as can be<|endoftext|>"
```

Nowadays the appraoch is to have an uninterrupted stream of text, like a movie script

```
"
A: Hello.
B: Is it me you're looking for?
A: I can see it in your eyes...
B: I can see it in your smile!
"
```

DialoGPT uses the `end-of-sequence` token.

In [None]:
print(tokenizer.eos_token)
print(tokenizer.eos_token_id)

Let's create a context to use as input for our experiments

In [None]:
context = [
    "Hello, how are you?", 
    "I'm fine thanks, how about you?"
]

Not let's create an input string from the context

In [None]:
input_string = tokenizer.eos_token 
if len(context) > 0:
    input_string = tokenizer.eos_token + tokenizer.eos_token.join(context) + input_string

input_string

Encode input

In [None]:
input_encoding = tokenizer(input_string, return_tensors='pt').to(device)
print(input_encoding.input_ids)
print(input_encoding.input_ids.size())

If we run the sequence through the model, we get a series of logits as output.
Since we are using an autogressive models, in the last position we will have the logits of next token.

In [None]:
outputs = model(**input_encoding)
print(outputs.logits)
print(outputs.logits.size())

We can run these logits through a $\mathrm{softmax}(\cdot)$ and obtain the probability distribution over tokens:
- for each possible token we have the probability of it being the next in the sequence
- We can sample a token from this probability distribution and recurr itin input to get a new token
- We can iterate this process to compose a response

In [None]:
p_dist_next = torch.softmax(outputs.logits[:, -1], dim=1)
print(p_dist_next)
print(p_dist_next.sum())
print(p_dist_next.size())

What is the most probable next token?

In [None]:
arg_max_idx = torch.argmax(p_dist_next)
print(arg_max_idx)
tokenizer.decode(arg_max_idx)

Note that we don't need to run the $\mathrm{softmax}(\cdot)$ operation if we just want to find the token with highest probability.

#### Deterministic decoding

Deterministic appraoches yield always the same output for a given input

##### Greedy decoding

The most starightforward way is to pick each time the most probable token and recurr it as next step in input.
Very suboptimal solution, usually yields dull responses like `"I don't know"` or causes degenerate generation (e.g., repeating the same token many times).

In [None]:
output_ids = model.generate(input_encoding.input_ids, max_new_tokens=32, pad_token_id=tokenizer.eos_token_id)
tokenizer.decode(output_ids[0, input_encoding.input_ids.size(1):], skip_special_tokens=True)

##### Beam search

We cannot do an exhaustuve search, but we can keep the top $n$ most probable sequences up to now.
This is what beam search does

In [None]:
output_ids = model.generate(input_encoding.input_ids, max_new_tokens=32, num_beams=8, pad_token_id=tokenizer.eos_token_id)
tokenizer.decode(output_ids[0, input_encoding.input_ids.size(1):], skip_special_tokens=True)

#### Sampling

Sampling based decoding adds more spice to the output sampling the next token with a certain probability given by the language model.
The nice thing is that given the same input the generated content may change (higher diversity in the text of responses), the bad thing is that given the same input the generated content may change (possibly inconsistent behaviour).

##### Top-k

Consider only first $k$ most probable tokens and zero out others probabilities 

In [None]:
output_ids = model.generate(input_encoding.input_ids, max_new_tokens=32, do_sample=True, top_k=16, pad_token_id=tokenizer.eos_token_id)
tokenizer.decode(output_ids[0, input_encoding.input_ids.size(1):], skip_special_tokens=True)

##### Top-p (nucleus sampling)

Consider only first most probable tokens so that their probability sum up to $p \in [0, 1] \subseteq \mathbb{R}$ and zero out others probabilities 
Similar to top-$k$ but variable window.

In [None]:
output_ids = model.generate(input_encoding.input_ids, max_new_tokens=32, do_sample=True, top_p=0.95, top_k=0, pad_token_id=tokenizer.eos_token_id)
tokenizer.decode(output_ids[0, input_encoding.input_ids.size(1):], skip_special_tokens=True)

##### Temperature rescoring

Divide the logits by a value $\tau \in \mathbb{R}^+_0$:
- if $\tau > 1$ (high temprature) the distribution get softer (reduces probability of most probable tokens and increases that of least probable)
- if $\tau = 1$ the distribution is unchanged
- if $\tau < 1$ (low temperature) the distribution get sharper (reduces probability of most probable tokens and increases that of least probable)


In [None]:
output_ids = model.generate(input_encoding.input_ids, max_new_tokens=32, do_sample=True, temperature=0.8, top_k=0, pad_token_id=tokenizer.eos_token_id)
tokenizer.decode(output_ids[0, input_encoding.input_ids.size(1):], skip_special_tokens=True)

##### Sample multiple candidates

You can also sample muliple candidates an pick one according to some criteria.

In [None]:
output_ids = model.generate(input_encoding.input_ids, max_new_tokens=32, do_sample=True, num_return_sequences=8, top_p=0.9, top_k=0, pad_token_id=tokenizer.eos_token_id)
tokenizer.batch_decode(output_ids[:, input_encoding.input_ids.size(1):], skip_special_tokens=True)

For example if you combine sampling and beam search the `generate()` method will output automatically the most probable of the samples sequences

In [None]:
output_ids = model.generate(input_encoding.input_ids, max_new_tokens=32, do_sample=True, temperature=0.8, top_k=0, pad_token_id=tokenizer.eos_token_id)
tokenizer.decode(output_ids[0, input_encoding.input_ids.size(1):], skip_special_tokens=True)

#### Chatting

Now pick your favourite appraoch and chat with DialoGPT

In [None]:
# Maximum dialogue length (in turn pairs)
max_len = 5
# Initialise dialogue history
dialogue_history = []

for i in range(max_len):
    # Read user message
    user_message = input("User: ")
    # Append message to dialogue history
    dialogue_history.append(user_message)
    # Convert dialogue to string
    input_string = tokenizer.eos_token 
    if len(context) > 0:
        input_string = tokenizer.eos_token + tokenizer.eos_token.join(dialogue_history) + input_string
    # Encode input
    input_encoding = tokenizer(input_string, return_tensors='pt').to(device)
    # Generate DialoGPT response
    output_ids = model.generate(input_encoding.input_ids, max_new_tokens=32, do_sample=True, top_p=0.9, top_k=0, pad_token_id=tokenizer.eos_token_id)
    chatbot_response = tokenizer.decode(output_ids[0, input_encoding.input_ids.size(1):], skip_special_tokens=True)
    # Append chatbot response to dialogue history
    dialogue_history.append(chatbot_response)
    # Print chatbot response
    print(f"DialoGPT: {chatbot_response}")

Notes
1. As you go forward the model will start gettin slower. There is a way to cache the hidden outputs to avoid re-computing attention on past 
2. The context of the model is limited, at some point you should start dropping older utterances

### Fine-tuning

Now you are ready, you can finally fine-tune a generative chatbot and chat with it instead of studying for the exams! (I am not responsable for your choices, I am just offering you an alterantive)

We wiil fine-tune a vanilla GPT-2 (pick your favourite version).
Let's load the pre-trained model

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model_id = 'gpt2'

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id , device_map="auto")  # , torch_dtype=torch.float16)

Let's set the padding token to be the `eos_token` to simplify some passages later

In [None]:
tokenizer.pad_token = tokenizer.eos_token

#### Data preparation

We are going to use the [Persona-Chat](https://arxiv.org/abs/1801.07243) corpus (It was used in the ConvAI 2 challenge).
It's a data set where conversations are grounded in the persona description of the two participants.

We are going to use the [ParlAI](https://parl.ai/docs/index.html) package to get the data set.

In [None]:
#!pip -q install parlai

Now let's download the data set

In [None]:
from parlai.tasks.convai2.build import build

build({'datapath': './data/'})

Now the samples are stored in `.txt` files we need to parse.
We can build a simple function that given the path to one of the files parse the content into Python dictionaries

In [None]:
def parse_pc(path):
    # Open file
    with open('./data/ConvAI2/train_self_original.txt') as f:
        # Read raw file lines
        data = [line.strip() for line in f]
    # Data set container
    persona_chat = list()
    # Now we iterate through lines and build the data set
    for line in data:
        # Split line data from initial index
        line_idx, line_data = line.split(' ', 1)
        # Check if new conversation is started
        if line_idx == '1':
            # Add new empthy dialogue in data set
            persona_chat.append(
                {'persona_a': list(), 'persona_b': list(), 'utterances': list()}
            )
        # If the line is from Speaker A persona
        if line_data.startswith('your persona: '):
            # Append it to Persona A
            persona_chat[-1]['persona_a'].append(line_data[len('your persona: '):])
        # Else if the line is from Speaker B persona
        elif line_data.startswith('partner\'s persona: '):
            # Append it to Persona B
            persona_chat[-1]['persona_b'].append(line_data[len('partner\'s persona: '):])
        # Else the line is a regular dialogue line
        else:
            # Split utterances from distractors and separate A and B
            utt_a, utt_b = line_data.split('\t\t')[0].split('\t')
            # Append to dialogue utterances
            persona_chat[-1]['utterances'].append(
                {'speaker': 'A', 'text': utt_a}
            )
            persona_chat[-1]['utterances'].append(
                {'speaker': 'B', 'text': utt_b}
            )
            
    return persona_chat

Let's load trainign and validation data

In [None]:
training_data = parse_pc('./data/ConvAI2/train_both_original.txt')
validation_data = parse_pc('./data/ConvAI2/valid_both_original.txt')

training_data[0]

Now we are going to convert to strings all the samples using the `eos_token` as separator.
We will include also personae.

Let's define another function to do that on a single dialogue, then we will apply it to the entire data set

In [None]:
def sample_to_string(sample, eos_token):
    # Join strings of Persona A
    persona_a = ' '.join(sample['persona_a'])
    # Join strings of Persona B
    persona_b = ' '.join(sample['persona_b'])
    # Join dialogue strings
    dialogue = eos_token.join(f"{utterance['speaker']}: {utterance['text']}" for utterance in sample['utterances'])
    # Build the dialogue string
    dialogue_string = f"Persona A: {persona_a}{eos_token}Persona B: {persona_b}{eos_token}{dialogue}{eos_token}"
    
    return dialogue_string

Apply the funtion to all samples

In [None]:
training_data_str = [sample_to_string(dialogue, tokenizer.eos_token) for dialogue in training_data]
validation_data_str = [sample_to_string(dialogue, tokenizer.eos_token) for dialogue in validation_data]

training_data_str[0]

And now are samples are ready

#### Training

Ok we made it to fine-tuning a language model.
We can use HuggingFace trainer to do the training.

First we need to build trainer compatible Dataset object

In [None]:
from datasets import Dataset

train_data = Dataset.from_dict({'text': training_data_str})
valid_data = Dataset.from_dict({'text': validation_data_str})

... and trainer compatible DatasetDict object

In [None]:
from datasets import DatasetDict

data = DatasetDict()
data['train'] = train_data
data['validation'] = valid_data
data['test'] = valid_data

Finally use the tokenizer to convert the input strings into sequences of tokens.

In [None]:
!pip -q install --upgrade ipywidgets

In [None]:
def tokenize_function(examples):
    input_encodings = tokenizer(examples["text"], padding=True, truncation=True)
    sample = {
        'input_ids': input_encodings.input_ids
    }
    return sample

tokenized_data = data.map(tokenize_function, batched=True)

The last step we are missing is to create a collator that gets together all the sequences in the same batch

In [None]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

Now we can create an instance of the trainer specifying the training arguments

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    "cooler_trainer_name", 
    evaluation_strategy="steps",
    per_device_train_batch_size=8,
    gradient_accumulation_steps=4,
    learning_rate=6.25e-5,
    lr_scheduler_type="linear"
)

Create the trainer

In [None]:
from transformers import TrainingArguments, Trainer

trainer = Trainer(
    model=model, 
    args=training_args, 
    train_dataset=tokenized_data['train'], 
    eval_dataset=tokenized_data['validation'],
    data_collator=data_collator
)

And now let the training begin

In [None]:
trainer.train()

Finally let's save the fine-tuned model

In [None]:
from datetime import datetime

checkpoint_path = f"persona_chat_fine_tuning_{datetime.now().strftime('%Y_%m_%d_%H_%M_%S')}"
tokenizer.save_pretrained(checkpoint_path)
model.save_pretrained(checkpoint_path)

#### Testing

We can compute some automatc metrics to assess the quality of the chatbot.
We can use the ParlAI utilities to compute the metrics.

Let's pick a random test dialogue and let's generate response.
We will compare the original target response to the generated one

In [None]:
import random

random.seed(1995)

idx = random.choice(range(len(validation_data)))
print(idx)
dialogue = validation_data[idx]

Now we can pick a turn from the middle of a dialogue to be our target response

In [None]:
response_idx = len(dialogue['utterances']) // 2

original_response = dialogue['utterances'][response_idx]
original_response_string = f"{original_response['speaker']}: {original_response['text']}"
original_response_string

And we can drop the response and all the following ones to build ourr context

In [None]:
context = {
    'persona_a': dialogue['persona_a'],
    'persona_b': dialogue['persona_b'],
    'utterances': dialogue['utterances'][:response_idx]
}
context_string = sample_to_string(context, tokenizer.eos_token)
context_string

And let's load back the model from the checkpoint

In [None]:
tokenizer = AutoTokenizer.from_pretrained(checkpoint_path)
model = AutoModelForCausalLM.from_pretrained(checkpoint_path , device_map="auto", torch_dtype=torch.float16)

Generate a response

In [None]:
# Encode context
input_encoding = tokenizer(context_string, return_tensors='pt').to(device)
# Generate response
output_ids = model.generate(input_encoding.input_ids, max_new_tokens=32, do_sample=True, top_p=0.9, top_k=0, pad_token_id=tokenizer.eos_token_id)
# Decode generated response
generated_response = tokenizer.decode(output_ids[0, input_encoding.input_ids.size(1):], skip_special_tokens=True)

##### Perplexity

Perplexity can be computed using the cross entropy on the generated response.
First let's process the the context and the response

In [None]:
# Encode dialogue
input_encoding = tokenizer(context_string + original_response_string + tokenizer.eos_token, return_tensors='pt').to(device)
# Compute model outputs
outputs = model(**input_encoding)

We get the target labels (the ids of the response)

In [None]:
labels = tokenizer(original_response_string + tokenizer.eos_token, return_tensors='pt').input_ids.to(device)
labels.size()

And then we retain only the logits from the response

In [None]:
logits = outputs.logits[:, -labels.size(1):]
logits.size()

Compute the average cross-entropy shifting the inputs and the outputs

In [None]:
import torch.nn.functional as F

# Shift logits to exclude the last element
shift_logits = logits[..., :-1, :].contiguous()
# shift labels to exclude the first element
shift_labels = labels[..., 1:].contiguous()
# Compute loss
lm_loss = F.cross_entropy(
    shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1)
)
lm_loss

Exponentiate to have PPL

In [None]:
ppl = torch.exp(lm_loss)
ppl

The process can be simplified but at least now you have seen all the steps

BLEU

In [None]:
from parlai.core.metrics import BleuMetric

bleu = BleuMetric.compute(generated_response, [original_response_string])
print(f"BLEU: {bleu}")

##### F1

In [None]:
from parlai.core.metrics import F1Metric

f1_score = F1Metric.compute(generated_response, [original_response_string])
print(f"F1: {f1_score}")

##### Chatting

There is no better way to evalaute a generative chatbot than using it to chat.
Define a custom persona for you and the chatbot (or sample two from the data set) and write a chatting loop.

In [None]:
# TODO

## ELIZA meets DialoGPT

In the 70s they made ELIZA and PARRY meet each other: https://www.theatlantic.com/technology/archive/2014/06/when-parry-met-eliza-a-ridiculous-chatbot-conversation-from-1972/372428/.
We could you have ELIZA meet ChatGPT, but since we are humble we will settle with DialoGPT.

The is this implementation of ELIZA in Python we can use: https://github.com/wadetb/eliza
Let's start by cloning the repository and adding it to our path

In [None]:
!git clone https://github.com/wadetb/eliza.git

In [None]:
import sys

sys.path.append('/home/arcslab/Documents/vincenzo_scotti_polimi/rp_3_1/notebooks/eliza/eliza.py')

Now we should be able to import the package

In [None]:
from eliza.eliza import Eliza

Now let's create an instance of ELIZA using the available rules (https://github.com/wadetb/eliza/blob/master/doctor.txt)

In [None]:
eliza = Eliza()
eliza.load('./eliza/doctor.txt')

Now you can chat with ELIZA.
Note that ELIZA manages its context interanlly.

In [None]:
# Maximum dialogue length (in turn pairs)
max_len = 5

for i in range(max_len):
    # Read user message
    user_message = input("User: ")
    # Ask ELIZA for response
    response = eliza.respond(user_message)
    # Print ELIZA response
    print(f"ELIZA: {response}")

But that's not as fun as having DialoGPT chat with ELIZA, right?
Let's load back DialoGPT

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model_id = 'microsoft/DialoGPT-large'

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id , device_map="auto", torch_dtype=torch.float16)

Now we can re-use the same chatting script from before, but instead of asking input to the user, we are going to ask ELIZA

In [None]:
# Maximum dialogue length (in turn pairs)
max_len = 5
# Initialise dialogue history
dialogue_history = ["Hello"]
# Print first message
print(f"DialoGPT: {dialogue_history[0]}")


for i in range(max_len):
    # Generate ELIZA response
    eliza_message = eliza.respond(dialogue_history[-1])
    # Append message to dialogue history
    dialogue_history.append(eliza_message)
    # Print ELIZA response
    print(f"ELIZA: {eliza_message}")
    # Convert dialogue to string
    input_string = tokenizer.eos_token 
    if len(context) > 0:
        input_string = tokenizer.eos_token + tokenizer.eos_token.join(dialogue_history) + input_string
    # Encode input
    input_encoding = tokenizer(input_string, return_tensors='pt').to(device)
    # Generate DialoGPT response
    output_ids = model.generate(input_encoding.input_ids, max_new_tokens=32, do_sample=True, top_p=0.9, top_k=0, pad_token_id=tokenizer.eos_token_id)
    dialogpt_message = tokenizer.decode(output_ids[0, input_encoding.input_ids.size(1):], skip_special_tokens=True)
    # Append DialoGPT response to dialogue history
    dialogue_history.append(dialogpt_message)
    # Print DialoGPT response
    print(f"DialoGPT: {dialogpt_message}")