## CA3

- Some questions require writing Python code and computing results, and the rest of them have written answers. For coding problems, you will have to fill out all code blocks that say **`YOUR CODE HERE`**.

- You will need to use GPU, which can be added through
`Edit > Notebook Settings > Hardware accelerator > (GPU)`

- For text-based answers, you should replace the text that says **`Write your answer here...`** with your actual answer.

- This assignment is designed such that each cell takes a few minutes (if that) to run. If it is taking longer than that, you might have made a mistake in your code.

---

##### *How to submit this problem set:*
- Write all the answers in this Colab notebook. Once you are finished, generate a PDF via (File -> Print -> Save as PDF) and upload it to Gradescope.
  
- **Important:** check your PDF before you submit to Gradescope to make sure it exported correctly. If Colab gets confused about your syntax, it will sometimes terminate the PDF creation routine early.

- When creating your final version of the PDF to hand in, please do a fresh restart and execute every cell in order. One handy way to do this is by clicking `Runtime -> Run All` in the notebook menu.

# Part 0 : Setup

In this assignment, you will fine-tune a FLAN-T5 model from Hugging Face for a dialogue summarization task, utilizing Parameter Efficient Fine-Tuning (PEFT), and assess the resulting model. The fine-tuning process also involves incorporating Meta AI's hate speech reward model, a binary classifier predicting 'not hate' or 'hate' for a given text. This model serves to provide feedback to the generated output, aiming to reduce the fine-tuned model's toxicity and generate less harmful content.

## Import Necessary Libraries

We will employ Hugging Face's Transformers, an open-source library that furnishes general-purpose architectures for natural language understanding and generation, featuring a collection of various pretrained models contributed by the NLP community. This library facilitates easy utilization of pretrained models and enables us to conduct experiments on top of them. Additionally, we will utilize PEFT, or Parameter-Efficient Fine-Tuning (PEFT), a library designed for efficiently adapting pre-trained language models (PLMs) to various downstream applications.

Furthermore, the project requires the inclusion of TRL (Transformer Reinforcement Learning), a full stack library that provides a set of tools for training transformer language models with Reinforcement Learning. This encompasses the Supervised Fine-tuning step (SFT), Reward Modeling step (RM), and the Proximal Policy Optimization (PPO) step.







In [None]:
!pip install --upgrade pip
!pip install transformers
!pip install torch torchdata --quiet

!pip install datasets evaluate peft==0.3.0 --quiet

# Installing the Reinforcement Learning library directly from github.
!pip install git+https://github.com/lvwerra/trl.git@25fa1bd


Collecting git+https://github.com/lvwerra/trl.git@25fa1bd
  Cloning https://github.com/lvwerra/trl.git (to revision 25fa1bd) to c:\users\babrabush\appdata\local\temp\pip-req-build-uzxilvqn


  ERROR: Error [WinError 2] The system cannot find the file specified while executing command git version
ERROR: Cannot find command 'git' - do you have 'git' installed and in your PATH?


## Import Necessary Libraries

In [None]:
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification, AutoModelForSeq2SeqLM, GenerationConfig
from datasets import load_dataset
from peft import PeftModel, PeftConfig, LoraConfig, TaskType
from transformers import TrainingArguments, Trainer
# trl: Transformer Reinforcement Learning library
from trl import PPOTrainer, PPOConfig, AutoModelForSeq2SeqLMWithValueHead
from trl import create_reference_model
from trl.core import LengthSampler

import torch
import evaluate

import numpy as np
import pandas as pd
import time

# tqdm library makes the loops show a smart progress meter.
from tqdm import tqdm
tqdm.pandas()


# Part1 : Fine-tuning with PEFT

## Load Dataset and LLM

You are going to continue experimenting with the [DialogSum](https://huggingface.co/datasets/knkarthick/dialogsum) Hugging Face dataset. This dataset comprises over 10,000 dialogues, each accompanied by manually annotated summaries and topics.

The first step will be to preprocess the dataset.  A subset of the dataset will be selected, followed by filtering dialogues based on a specific length to ensure examples are sufficiently long and easily readable. Subsequently, each dialogue will be encapsulated with an instruction and tokenized prompts. The token IDs will be stored in the "input_ids" field, and the decoded version of the prompts will be saved in the "query" field. The "query" field is essential for utilizing the TRL library, which enforces the model to generate less toxic text.

In [None]:
model_name="google/flan-t5-base"
huggingface_dataset_name = "knkarthick/dialogsum"

def build_dataset(model_name,
                  dataset_name,
                  input_min_text_length,
                  input_max_text_length):

    """
    Preprocess the dataset and split it into train and test parts.

    Parameters:
    - model_name (str): Tokenizer model name.
    - dataset_name (str): Name of the dataset to load.
    - input_min_text_length (int): Minimum length of the dialogues.
    - input_max_text_length (int): Maximum length of the dialogues.

    Returns:
    - dataset_splits (datasets.dataset_dict.DatasetDict): Preprocessed dataset containing train and test parts.
    """

    # load dataset (only "train" part will be enough).
    dataset = load_dataset(dataset_name, split="train")


    # Filter the dialogues of length between input_min_text_length and input_max_text_length characters.
    dataset = dataset.filter(lambda x: len(x["dialogue"]) > input_min_text_length and len(x["dialogue"]) <= input_max_text_length, batched=False)

    # Prepare tokenizer. Setting device_map="auto" allows to switch between GPU and CPU automatically.
    tokenizer = AutoTokenizer.from_pretrained(model_name, device_map="auto")

    def tokenize(example):

        start_prompt = 'Summarize the following conversation.\n\n'
        end_prompt = '\n\nSummary: '
        prompt = [start_prompt + dialogue + end_prompt for dialogue in example["dialogue"]]
        example['input_ids'] = tokenizer(prompt, padding="max_length", truncation=True, return_tensors="pt").input_ids
        example['labels'] = tokenizer(example["summary"], padding="max_length", truncation=True, return_tensors="pt").input_ids
        example["query"] = prompt
        return example

    # Tokenize each dialogue.
    dataset = dataset.map(tokenize, batched=True)


    # Split the dataset into train and test parts.
    dataset_splits = dataset.train_test_split(test_size=0.2, shuffle=False, seed=42)

    return dataset_splits

dataset = build_dataset(model_name=model_name,
                        dataset_name=huggingface_dataset_name,
                        input_min_text_length=200,
                        input_max_text_length=1000)
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic', 'input_ids', 'labels', 'query'],
        num_rows: 8017
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic', 'input_ids', 'labels', 'query'],
        num_rows: 2005
    })
})


In [None]:
model_name='google/flan-t5-base'

original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

You can easily extract information regarding the number of model parameters and ascertain the count of trainable parameters. The subsequent function serves this purpose by calculating the number of trainable model parameters, all model parameters, and the percentage of trainable model parameters. You will utilize this function in the upcoming sections to compare models.

In [None]:
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

print(print_number_of_trainable_model_parameters(original_model))

trainable model parameters: 247577856
all model parameters: 247577856
percentage of trainable model parameters: 100.00%


## Test Original Model with Zero/Few Shot Inference



## Question 1.1
Select a random sample from the test set and write code to test the model for dialogue summarization using zero-shot inference. Note that you should use the same prompt used in the **build_dataset** function and instruct the model for summarization. Compare the model's summarized dialogue to the baseline summary. Do you think the model can summarize the dialogue well?

***WRITE YOUR ANSWER HERE IN A FEW SENTENCES***

Based on the provided example outputs, it seems that the model did not generate a summary that accurately captures the essence of the original dialogue. The baseline summary is more aligned with the content of the dialogue, reminding Mr. Li of the meeting in his office at 11 o'clock, while the model's summary talks about office supplies not being available, which is not explicitly mentioned in the dialogue. This suggests that, in this specific case, the model may not have performed well in summarizing the dialogue accurately.

In [None]:
# WRITE YOUR CODE HERE!

import random

random.seed(16)

random_sample = random.choice(dataset["test"])

input_ids = tokenizer(random_sample["query"], padding="max_length",
                      truncation=True, return_tensors="pt").input_ids

with torch.no_grad():
    output = original_model.generate(input_ids, max_new_tokens=90)

model_summary_zero_shot = tokenizer.decode(output[0], skip_special_tokens=True)

print("Original Dialogue:")
print(random_sample["dialogue"])

print("\nBaseline Summary:")
print(random_sample["summary"])

print("\nModel Summary:")
print(model_summary_zero_shot)

Original Dialogue:
#Person1#: Lin's office supplies. How may I direct your call?
#Person2#: Marry Lin please.
#Person1#: Sure, just a moment.... I'm sorry no one answer the phone.
#Person2#: All right, could I leave a message?
#Person1#: Certainly!
#Person2#: Please ask her to call John.

Baseline Summary:
#Person1# directs #Person2# to Marry Lin but no one answers, so #Person2# leaves a message.

Model Summary:
#Person1#: I'm sorry Lin's office supplies aren't available.


## Question 1.2

This time, try to add some samples stating dialogues and their summarizations. Do you see any difference when giving more examples to the model compared to zero-shot inference?

***WRITE YOUR ANSWER HERE IN A FEW SENTENCES***

Comparing the summary that the model has generated with baseline summary, we can see the with two demonstrations, the model has generated exactly the same summary as baseline summary. This proves that few-shot settings and providing few examples for models can vastly improve their performance. In previous question, which has zero-shot settings, we did not see a good generated summary from the model. However, compared to that summary, what the model has generated here, firstly summarizes the text perfectly. And secondly, it exactly matches the baseline summary which acts like ground-truth labels for each dialogue. Therefore, what we learnt about few-shot settings in class sessions, was proven with the improvement of the model in generating better, more accurate and more exact summary for the given dialogue.

Note about prompt format:

I added "\n\n" to the beginning of the prompt ("start_prompt"), since the start prompt, "Summarize the following conversation.\n\n", would get mixed up in the text of summary for previous example. An example can be:

"Summary: #Person1# reminds Mr. Li of the meeting in his office at 11 o'clock.Summarize the following conversation."

Thus, by using "\n\n" at the beginning of start prompt, there would be proper separation between demonstrations.

In [None]:
# WRITE YOUR CODE HERE!

random.seed(16)

random_idxs = random.sample(range(0, len(dataset["test"])), 2)

sample_queries = [
    dataset["test"][random_idxs[0]]["query"],
    dataset["test"][random_idxs[1]]["query"]
]

sample_summaries = [
    dataset["test"][random_idxs[0]]["summary"],
    dataset["test"][random_idxs[1]]["summary"]
]

"""
i added "\n\n" to the beginning of the prompt ("start_prompt"), \
since the start prompt, "Summarize the following conversation.\n\n" \
would get mixed up in the text of summary for previous example, like \

"Summary: #Person1# reminds Mr. Li of the meeting in his office at
11 o'clock.Summarize the following conversation."

by using "\n\n" at the beginning of start prompt, there would be proper \
separation between examples.
""";

prompt = ""

for idx, query in enumerate(sample_queries):
    sample_summary = sample_summaries[idx]

    if idx == 0:
        prompt += query + sample_summary
    else:
        prompt += "\n\n" + query + sample_summary

prompt += "\n\n" + random_sample["query"]

#print(prompt)

input_ids = tokenizer(prompt, padding="max_length", truncation=True,
                      return_tensors="pt").input_ids

with torch.no_grad():
    output = original_model.generate(input_ids, max_new_tokens=90)

model_summary_few_shot = tokenizer.decode(output[0], skip_special_tokens=True)

print("Original Dialogue:")
print(random_sample["dialogue"])

print("\nBaseline Summary:")
print(random_sample["summary"])

print("\nModel Summary:")
print(model_summary_few_shot)

Original Dialogue:
#Person1#: Lin's office supplies. How may I direct your call?
#Person2#: Marry Lin please.
#Person1#: Sure, just a moment.... I'm sorry no one answer the phone.
#Person2#: All right, could I leave a message?
#Person1#: Certainly!
#Person2#: Please ask her to call John.

Baseline Summary:
#Person1# directs #Person2# to Marry Lin but no one answers, so #Person2# leaves a message.

Model Summary:
#Person1# directs #Person2# to Marry Lin but no one answers, so #Person2# leaves a message.


## Parameter Efficient Fine-Tuning (PEFT)
Now, we will proceed with fine-tuning using Parameter Efficient Fine-Tuning (PEFT) instead of outilizing full fine-tuning. PEFT, a form of instruction fine-tuning, proves to be significantly more efficient than full fine-tuning while maintaining comparable performance in evaluations.

## Question 1.3

Read about Parameter Efficient Fine-Tuning (PEFT) and explain the distinctions between PEFT and "full fine-tuning". Additionally, outline the advantages that PEFT provides in comparison to the traditional approach of full fine-tuning.

***WRITE YOUR ANSWER HERE IN A FEW SENTENCES***

Parameter Efficient Fine-Tuning (PEFT) is a technique that differs from traditional "full fine-tuning" by focusing on updating a smaller subset of the model's parameters. In full fine-tuning, the entire pre-trained model is adjusted during the fine-tuning process, which can be computationally expensive and time-consuming.

PEFT, on the other hand, identifies and fine-tunes only a fraction of the model's parameters, typically those closer to the task-specific layer. This selective fine-tuning allows for a more targeted adaptation to the new task while leveraging the knowledge encoded in the pre-trained model's earlier layers.

The advantages of PEFT lie in its efficiency and reduced resource requirements compared to full fine-tuning. By updating only a subset of parameters, PEFT achieves comparable or even superior performance in task-specific evaluations while mitigating the computational cost and time associated with adjusting the entire model. This makes PEFT an attractive option, especially in scenarios where computational resources are limited or when rapid adaptation to a new task is crucial.


Each PEFT method is defined by a PeftConfig class that stores all the important parameters for building a PeftModel. Since we're using LoRA, we'll need to load and create a LoraConfig class.

In [None]:
from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=32, # Rank
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM # FLAN-T5
)

## Question 1.4
Wrap your base model and peft_config with the get_peft_model function from PEFT library to create a PeftModel. To get a sense of the number of trainable parameters in your model, use the print_number_of_trainable_model_parameters and report the percentage of trainable parameters.

In [None]:
peft_model = get_peft_model(original_model, lora_config)
peft_model.print_trainable_parameters()

trainable params: 3538944 || all params: 251116800 || trainable%: 1.4092820552029972


In [None]:
output_dir = f'./peft-dialogue-summary-training-{str(int(time.time()))}'

peft_training_args = TrainingArguments(
    output_dir=output_dir,
    auto_find_batch_size=True,
    learning_rate=1e-3,
    num_train_epochs=1,
    logging_steps=20,

)

peft_trainer = Trainer(
    model=peft_model,
    args=peft_training_args,
    train_dataset=dataset["train"],
)

Now everything is ready to train the PEFT adapter and save the model. Please note that the next cell may take up to an hour to run.

In [None]:
peft_trainer.train()

peft_model_path="./peft-dialogue-summary-checkpoint-local"

peft_trainer.model.save_pretrained(peft_model_path)
tokenizer.save_pretrained(peft_model_path)

Step,Training Loss
20,21.4172
40,3.1645
60,0.8737
80,0.3294
100,0.2258
120,0.1979
140,0.1676
160,0.1508
180,0.1555
200,0.1449


('./peft-dialogue-summary-checkpoint-local\\tokenizer_config.json',
 './peft-dialogue-summary-checkpoint-local\\special_tokens_map.json',
 './peft-dialogue-summary-checkpoint-local\\tokenizer.json')

## Question 1.5
Prepare the fine-tuned model by adding an adapter to the original FLAN-T5 model. please set `is_trainable=False` because the plan is only to perform inference with this PEFT model. If you were preparing the model for further training, you would set `is_trainable=True`.

In [None]:
from peft import PeftModel, PeftConfig

peft_model_base = AutoModelForSeq2SeqLM.from_pretrained(model_name,
                                                torch_dtype=torch.bfloat16)

tokenizer = AutoTokenizer.from_pretrained(model_name)

peft_model = PeftModel.from_pretrained(peft_model_base,
                                "./peft-dialogue-summary-checkpoint-local",
                                torch_dtype=torch.bfloat16,
                                is_trainable=False)

## Question 1.6
write a code to test the fine-tuned model in a zero-shot setting.

#### Clarification:

Since, we got not-so-good results with seed == 16, we tried another example, seed == 106. The reason for doing is that I doubted that maybe the reason for not-so-good performance of the model on first seed was just pure luck or dependet on other factors I did not know, and the model should have performed better. So I used another seed to answer to my suspicions and doubts. After seeing the performance of the model on seed == 106, my doubts were correct. In fact, model performed much better and the generated summary by it was actually close to the baseline summary (unlike what was generated with seed == 16).

For sake of consistency, I let the results for both seeds stay, as to show that if a model performs worse on a sample, does not necessarily mean that the model is down-right bad. I guess it means that there are some samples that the model would perform well on it, and some samples would have not-so-good performance. I expected the model to perform well on both seeds, and yet, it only performed nearly well on seed == 106.

In [None]:
# WRITE YOUR CODE HERE!

random.seed(16)

random_sample = random.choice(dataset["test"])

input_ids = tokenizer(random_sample["query"], padding="max_length",
                      truncation=True, return_tensors="pt").input_ids

with torch.no_grad():
    model_output = peft_model.generate(input_ids=input_ids,
        generation_config=GenerationConfig(), max_new_tokens=200, num_beams=1)

peft_model_summary_zero_shot = tokenizer.decode(model_output[0],
                                                skip_special_tokens=True)

print("Original Dialogue:")
print(random_sample["dialogue"])

print("\nBaseline Summary:")
print(random_sample["summary"])

print("\nModel Summary:")
print(peft_model_summary_zero_shot)

Original Dialogue:
#Person1#: Lin's office supplies. How may I direct your call?
#Person2#: Marry Lin please.
#Person1#: Sure, just a moment.... I'm sorry no one answer the phone.
#Person2#: All right, could I leave a message?
#Person1#: Certainly!
#Person2#: Please ask her to call John.

Baseline Summary:
#Person1# directs #Person2# to Marry Lin but no one answers, so #Person2# leaves a message.

Model Summary:
#Person1# directs Lin's office supplies to John.


In [None]:
# WRITE YOUR CODE HERE!

random.seed(106)

random_sample = random.choice(dataset["test"])

input_ids = tokenizer(random_sample["query"], padding="max_length",
                      truncation=True, return_tensors="pt").input_ids

with torch.no_grad():
    model_output = peft_model.generate(input_ids=input_ids,
        generation_config=GenerationConfig(), max_new_tokens=200, num_beams=1)

peft_model_summary_zero_shot = tokenizer.decode(model_output[0],
                                                skip_special_tokens=True)

print("Original Dialogue:")
print(random_sample["dialogue"])

print("\nBaseline Summary:")
print(random_sample["summary"])

print("\nModel Summary:")
print(peft_model_summary_zero_shot)

Original Dialogue:
#Person1#: Hello, Mr. Kowalski?
#Person2#: I'm here, hello.
#Person1#: Which city are you staying in right now? And the name of your hotel?
#Person2#: I'm in Beijing, at the Weston Hotel.
#Person1#: Do you have your passport with you? Or do you by any chance know the number?
#Person2#: I don't have it to hand, but I know the number. It's 16211469 9. Can you do anything, like stop the card for example?
#Person1#: Let me repeat that back to you, 16211469 9. That's just what I've done, Mr. Kowalski, I've stopped your card temporarily.
#Person2#: But, what do I do if I need cash?
#Person1#: You can go to any branch of IBA and request the Emergency Assistance Service. Everything will be taken care of, there's no need to worry.
#Person2#: Thank you so much. I'll find the nearest branch and come in tomorrow. Thanks again for all of your help.

Baseline Summary:
Mr. Kowalski tells #Person1# that he is in Beijing and tells his passport number. #Person1# helps him stop his car

## Question 1.7

As you may observe, the model might not perform well, given that we have only trained it for one epoch. To minimize the time spent on training, please utilize the LoRa config provided in [this link](https://huggingface.co/z7ye/peft-dialogue-summary-checkpoint/tree/main) and repeat the preceding step (get zero-shot inference). Compare the results with the original model outputs in zero/few-shot settings. We will continue with this model for the remainder of the assignment.

In [None]:
# WRITE YOUR CODE HERE!

peft_model_trained = PeftModel.from_pretrained(peft_model_base,
                                "z7ye/peft-dialogue-summary-checkpoint",
                                torch_dtype=torch.bfloat16,
                                is_trainable=False)

random.seed(16)

random_sample = random.choice(dataset["test"])

input_ids = tokenizer(random_sample["query"], padding="max_length",
                      truncation=True, return_tensors="pt").input_ids

with torch.no_grad():
    model_output = peft_model_trained.generate(input_ids=input_ids,
        generation_config=GenerationConfig(), max_new_tokens=200, num_beams=1)

trained_peft_model_summary_zero_shot = tokenizer.decode(model_output[0],
                                                    skip_special_tokens=True)

print("Original Dialogue:")
print(random_sample["dialogue"])

print("\nBaseline Summary:")
print(random_sample["summary"])

print("\nModel Summary:")
print(trained_peft_model_summary_zero_shot)

Original Dialogue:
#Person1#: Lin's office supplies. How may I direct your call?
#Person2#: Marry Lin please.
#Person1#: Sure, just a moment.... I'm sorry no one answer the phone.
#Person2#: All right, could I leave a message?
#Person1#: Certainly!
#Person2#: Please ask her to call John.

Baseline Summary:
#Person1# directs #Person2# to Marry Lin but no one answers, so #Person2# leaves a message.

Model Summary:
#Person1# directs Lin's office supplies to John.


In [None]:
# WRITE YOUR CODE HERE!

peft_model_trained = PeftModel.from_pretrained(peft_model_base,
                                "z7ye/peft-dialogue-summary-checkpoint",
                                torch_dtype=torch.bfloat16,
                                is_trainable=False)

random.seed(106)

random_sample = random.choice(dataset["test"])

input_ids = tokenizer(random_sample["query"], padding="max_length",
                      truncation=True, return_tensors="pt").input_ids

with torch.no_grad():
    model_output = peft_model_trained.generate(input_ids=input_ids,
        generation_config=GenerationConfig(), max_new_tokens=200, num_beams=1)

trained_peft_model_summary_zero_shot = tokenizer.decode(model_output[0],
                                                    skip_special_tokens=True)

print("Original Dialogue:")
print(random_sample["dialogue"])

print("\nBaseline Summary:")
print(random_sample["summary"])

print("\nModel Summary:")
print(trained_peft_model_summary_zero_shot)

Original Dialogue:
#Person1#: Hello, Mr. Kowalski?
#Person2#: I'm here, hello.
#Person1#: Which city are you staying in right now? And the name of your hotel?
#Person2#: I'm in Beijing, at the Weston Hotel.
#Person1#: Do you have your passport with you? Or do you by any chance know the number?
#Person2#: I don't have it to hand, but I know the number. It's 16211469 9. Can you do anything, like stop the card for example?
#Person1#: Let me repeat that back to you, 16211469 9. That's just what I've done, Mr. Kowalski, I've stopped your card temporarily.
#Person2#: But, what do I do if I need cash?
#Person1#: You can go to any branch of IBA and request the Emergency Assistance Service. Everything will be taken care of, there's no need to worry.
#Person2#: Thank you so much. I'll find the nearest branch and come in tomorrow. Thanks again for all of your help.

Baseline Summary:
Mr. Kowalski tells #Person1# that he is in Beijing and tells his passport number. #Person1# helps him stop his car

# Part2: Prepare Reward Model and Reduce Toxicity

Recently, some studies have aimed to align language models' output with human preferences by incorporating their feedback into the training process. However, obtaining feedback from human labelers on model outputs can be costly. An alternative approach is to employ a reward model, leveraging feedback generated by the model.

In this section, we will utilize Meta AI's RoBERTa-based hate speech model as the reward model to refine the outputs of the previous section's model. This RoBERTa model generates logits, predicting probabilities across two classes: 'nothate' and 'hate.' The logits corresponding to 'nothate' will be treated as positive rewards. Subsequently, the model will undergo fine-tuning with Proximal Policy Optimization (PPO) using these reward values.

Fine-tuning a language model via PPO consists of roughly three steps:

* ***Rollout:*** The language model generates a response or continuation based on a query which could be the start of a sentence.
* ***Evaluation:*** The query and response are evaluated with a function, model, human feedback, or some combination of them. The important thing is that this process should yield a scalar value for each query/response pair. The optimization will aim at maximizing this value.
* ***Optimization:*** This is the most complex part. In the optimisation step the query/response pairs are used to calculate the log-probabilities of the tokens in the sequences. This is done with the model that is trained and a reference model, which is usually the pre-trained model before fine-tuning. The KL-divergence between the two outputs is used as an additional reward signal to make sure the generated responses don’t deviate too far from the reference language model. The active language model is then trained with PPO.
The full process is illustrated in the following figure:

<img src="https://i.ibb.co/Z6s6SsN/trl-overview.png" class="center"/>

## Question 2.1

Once again, incorporate the adapter into the original model (use lora_config defined in previous sections). Please note that in the previous section, you only utilized the model for reference. In this part, we will fine-tune the combined model, so ensure that is_trainable is set to True. Then, create a PPO model using [AutoModelForSeq2SeqLMWithValueHead](https://huggingface.co/docs/trl/main/en/models#trl.create_reference_model) and pass the fine-tuned PEFT model to it

In [None]:
peft_model = PeftModel.from_pretrained(peft_model_base,
                                "z7ye/peft-dialogue-summary-checkpoint",
                                lora_config=lora_config,
                                torch_dtype=torch.bfloat16,
                                device_map="auto",
                                is_trainable=True)

ppo_model = AutoModelForSeq2SeqLMWithValueHead.from_pretrained(peft_model,
                                                    torch_dtype=torch.bfloat16,
                                                    is_trainable=True)

As explained before, PPO requires a reference model representing the LLM before detoxification. None of the parameters of the reference model will be updated during PPO training.



In [None]:
ref_model = create_reference_model(ppo_model)

The cell below creates an instance of the RoBERTa model to serve as our reward model for predicting text toxicity. Please note that in the model, label 0 corresponds to the 'nothate' class, while label 1 corresponds to the 'hate' class.

In [None]:
toxicity_model_name = "facebook/roberta-hate-speech-dynabench-r4-target"
toxicity_tokenizer = AutoTokenizer.from_pretrained(toxicity_model_name, device_map="auto")
toxicity_model = AutoModelForSequenceClassification.from_pretrained(toxicity_model_name, device_map="auto")
print(toxicity_model.config.id2label)

{0: 'nothate', 1: 'hate'}


After loading our reward model, we need to calculate the reward for the given input. To test the reward model, a **sentiment_pip** is created to calculate the logits for hate and non-hate classes. Please note that the logits corresponding to 'nothate' will be treated as positive rewards. The following cell aims to test the reward model with two sample inputs. Keep the **sentiment_pip** in mind, as you may need it for the training section.
```
non_toxic_text = #Person 1# tells #Person 2# that he food doesn't taste quite good.
toxic_text = #Person 1# tells #Person 2# that the food tastes terrible and the chef must be a dumb person.
```



In [None]:
non_toxic_text = "#Person 1# tells #Person 2# that the food doesn't taste quite good."
toxic_text = "#Person 1# tells #Person 2# that the food tastes terrible and the chef must be a dumb person."

device = 0 if torch.cuda.is_available() else "cpu"

sentiment_pipe = pipeline("sentiment-analysis",
                          model=toxicity_model_name,
                          device=device)
reward_logits_kwargs = {
    "top_k": None, # Return all scores.
    "function_to_apply": "none", # Set to "none" to retrieve raw logits.
    "batch_size": 16
}

reward_probabilities_kwargs = {
    "top_k": None, # Return all scores.
    "function_to_apply": "softmax", # Set to "softmax" to apply softmax and retrieve probabilities.
    "batch_size": 16
}

print("Reward model output:")
print("For non-toxic text")
print(sentiment_pipe(non_toxic_text, **reward_logits_kwargs))
print(sentiment_pipe(non_toxic_text, **reward_probabilities_kwargs))
print("For toxic text")
print(sentiment_pipe(toxic_text, **reward_logits_kwargs))
print(sentiment_pipe(toxic_text, **reward_probabilities_kwargs))

Reward model output:
For non-toxic text
[{'label': 'nothate', 'score': 3.7225563526153564}, {'label': 'hate', 'score': -3.0814623832702637}]
[{'label': 'nothate', 'score': 0.9988918900489807}, {'label': 'hate', 'score': 0.0011080786352977157}]
For toxic text
[{'label': 'hate', 'score': -0.004037089645862579}, {'label': 'nothate', 'score': -0.192849799990654}]
[{'label': 'hate', 'score': 0.5470634698867798}, {'label': 'nothate', 'score': 0.4529365599155426}]


the following cells initialize PPOTrainer and set up the configuration parameters.

In [None]:
def collator(data):
    return dict((key, [d[key] for d in data]) for key in data[0])

In [None]:
dataset.set_format(type="torch")

In [None]:
learning_rate=1.41e-5
max_ppo_epochs=1
mini_batch_size=4
batch_size=16

config = PPOConfig(
    model_name=model_name,
    learning_rate=learning_rate,
    ppo_epochs=max_ppo_epochs,
    mini_batch_size=mini_batch_size,
    batch_size=batch_size
)

ppo_trainer = PPOTrainer(config=config,
                         model=ppo_model,
                         ref_model=ref_model,
                         tokenizer=tokenizer,
                         dataset=dataset["train"],
                         data_collator=collator)

## Question 2.2
Now it is time to fine-tune the model. The fine-tuning loop consists of the following main steps, as discussed at the beginning of this section:



1.   Get the query responses from the policy LLM (PEFT model).
2.   Obtain sentiments for the query/responses from the hate speech RoBERTa model.
3.   Optimize the policy with PPO using the (query, response, reward) triplet.

You can refer to [this link](https://huggingface.co/docs/trl/main/en/ppo_trainer) for additional information on fine-tuning.

** If training takes too much time, you can just go for a maximum of 10 steps.

In [None]:
# WRITE YOUR CODE HERE!

generation_params = {
    "min_length": 5,
    "top_k": 0.0,
    "top_p": 1.0,
    "do_sample": True
}

reward_params = {
    "top_k": None,
    "function_to_apply": "none",
    "batch_size": 16
}

max_steps = 10

nothate_idx = 0
hate_idx = 1

min_length = 100
max_length = 400

length_sampler = LengthSampler(min_length, max_length)

for step, batch in tqdm(enumerate(ppo_trainer.dataloader)):
    if step >= max_steps:
        break

    prompts = batch["input_ids"]

    summaries = []

    for prompt in prompts:
        max_new_tokens = length_sampler()

        generation_params["max_new_tokens"] = max_new_tokens

        summary = ppo_trainer.generate(prompt, **generation_params)

        summaries.append(summary.squeeze()[-max_new_tokens:])

    batch["response"] = [tokenizer.decode(summary.squeeze()) \
                        for summary in summaries]

    query_response_pairs = [query + response for query, response in \
                            zip(batch["query"], batch["response"])]

    rewards = sentiment_pipe(query_response_pairs, **reward_params)

    rewards = [torch.tensor(reward[nothate_idx]["score"]) for reward in rewards]

    stats = ppo_trainer.step(prompts, summaries, rewards)

    ppo_trainer.log_stats(stats, batch, rewards)

    print("\n" + "*" * 66)

    print(f"objective/kl: {stats['objective/kl']}")
    print(f"ppo/returns/mean: {stats['ppo/returns/mean']}")
    print(f"ppo/policy/advantages_mean: {stats['ppo/policy/advantages_mean']}")

    print("*" * 66 + "\n")

0it [00:00, ?it/s]You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
1it [00:26, 26.02s/it]


******************************************************************
objective/kl: 36.644439697265625
ppo/returns/mean: -0.9190462231636047
ppo/policy/advantages_mean: 2.4316149094261164e-10
******************************************************************



2it [00:53, 26.83s/it]


******************************************************************
objective/kl: 35.948875427246094
ppo/returns/mean: -0.8906723260879517
ppo/policy/advantages_mean: 1.8006046431651157e-08
******************************************************************



3it [01:17, 25.59s/it]


******************************************************************
objective/kl: 33.34962463378906
ppo/returns/mean: -0.9622510075569153
ppo/policy/advantages_mean: 1.209582389805064e-08
******************************************************************



4it [01:42, 25.28s/it]


******************************************************************
objective/kl: 33.13213348388672
ppo/returns/mean: -0.9082210063934326
ppo/policy/advantages_mean: 2.412597766010549e-08
******************************************************************



5it [02:05, 24.67s/it]


******************************************************************
objective/kl: 28.285188674926758
ppo/returns/mean: -0.6063289046287537
ppo/policy/advantages_mean: 6.457875123544454e-09
******************************************************************



6it [02:32, 25.22s/it]


******************************************************************
objective/kl: 32.95925521850586
ppo/returns/mean: -0.9330865740776062
ppo/policy/advantages_mean: 5.4152415884800575e-09
******************************************************************



7it [03:00, 26.08s/it]


******************************************************************
objective/kl: 36.8443603515625
ppo/returns/mean: -0.9925538301467896
ppo/policy/advantages_mean: -1.2786291136990258e-08
******************************************************************



8it [03:27, 26.39s/it]


******************************************************************
objective/kl: 35.48625183105469
ppo/returns/mean: -0.9315858483314514
ppo/policy/advantages_mean: 3.066978848664803e-08
******************************************************************



9it [03:54, 26.57s/it]


******************************************************************
objective/kl: 34.01701354980469
ppo/returns/mean: -0.980808436870575
ppo/policy/advantages_mean: -5.3629181095971035e-09
******************************************************************



10it [04:18, 25.85s/it]


******************************************************************
objective/kl: 33.13680648803711
ppo/returns/mean: -0.924446702003479
ppo/policy/advantages_mean: -1.2312905361966386e-08
******************************************************************






## Question 2.3

Sample some examples from the test set and compare the outputs of the reference model and the PPO model by calculating the average reward for their generated output. Does the average reward increase after fine-tuning?

***WRITE YOUR ANSWER HERE IN A FEW SENTENCES***

Yes, the average reward increases after fine-tuning and the fine-tuned model shows improvement. The average reward for the case before fine-tuning (baseline) is 1.85. And the average reward for the case after fine-tuning (generated) is 1.96. We also computed the average difference of rewards between generated and actual/real samples, which we got 0.31. Overall, using PPO and Reinforcement Learning helped us a lot and fine-tuning the model proved to be a great help for our main model.

However, we also need to point out the increase is not that much, given the low number of samples. If we provided more samples, we are definitely sure that we would have gotten much more increase in performance of fine-tuned model over vanilla setup.

In [None]:
# WRITE YOUR CODE HERE!

batch_size = 20
results = {}
baseline_summaries = []
summaries = []

df = dataset["test"][:batch_size]

results["query"] = df["query"]
prompts = df["input_ids"]

for i in tqdm(range(batch_size)):
    max_new_tokens = length_sampler()

    generation_params["max_new_tokens"] = max_new_tokens

    summary = ref_model.generate(
        input_ids=torch.as_tensor(prompts[i]).unsqueeze(dim=0).to(device),
        **generation_params).squeeze()[-max_new_tokens:]

    baseline_summaries.append(summary)

    summary = ppo_model.generate(
        input_ids=torch.as_tensor(prompts[i]).unsqueeze(dim=0).to(device),
        **generation_params).squeeze()[-max_new_tokens:]

    summaries.append(summary)

results["baseline_summary"] = [tokenizer.decode(baseline_summaries[i]) \
                                                for i in range(batch_size)]

results["generated_summary"] = [tokenizer.decode(summaries[i]) \
                                                for i in range(batch_size)]

baseline_summaries_queries_pairs = [query + summary for query, summary in \
                        zip(results["query"], results["baseline_summary"])]

baseline_rewards = sentiment_pipe(baseline_summaries_queries_pairs,
                                  **reward_params)

results["baseline_rewards"] = [reward[nothate_idx]["score"] \
                               for reward in baseline_rewards]

generated_summaries_queries_pairs = [query + summary for query, summary in \
                        zip(results["query"], results["generated_summary"])]

generated_rewards = sentiment_pipe(generated_summaries_queries_pairs,
                                   **reward_params)

results["generated_rewards"] = [reward[nothate_idx]["score"] \
                                for reward in generated_rewards]

100%|████████████████████████████████████████████████████████████████████████████████| 20/20 [00:41<00:00,  2.09s/it]


In [None]:
mean_rewards_generated = np.mean(results["generated_rewards"])
mean_rewards_baseline = np.mean(results["baseline_rewards"])

print(f"Average of rewards for actual and real samples: {mean_rewards_baseline}")
print(f"Average of rewards for generated samples: {mean_rewards_generated}")

difference = abs(np.array(results["generated_rewards"]) - \
                 np.array(results["baseline_rewards"]))

average_difference = np.mean(difference)

print(f"Average difference of rewards between generated and actual/real samples: {average_difference}")

Average of rewards for actual and real samples: 1.8514214515686036
Average of rewards for generated samples: 1.966870778799057
Average difference of rewards between generated and actual/real samples: 0.31240854859352113
