# Fine-Tune FLAN-T5 with Reinforcement Learning (PPO) and PEFT to Generate Less-Toxic Summaries

In this notebook, you will fine-tune a FLAN-T5 model to generate less toxic content by Facebook's hate speech reward model. The reward model is binary classfier that predicts either "not hate" or "hate" for the given text. You will use Proximal Policy Optimization (PPO) to fine-tune and reduce the model's toxicity.


# 1 - Required Dependencies

drive mount

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


trl library for using PPO

In [2]:
# %pip install --upgrade pip
# %pip install \
#     torch==1.13.1 \
#     torchdata==0.5.1
%pip install \
    transformers==4.27.2 \
    datasets==2.11.0 \
    evaluate==0.4.0 \
    rouge_score==0.1.2 \
    loralib==0.1.1 \
    peft==0.3.0 \
    trl==0.4.4

Collecting transformers==4.27.2
  Downloading transformers-4.27.2-py3-none-any.whl (6.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m92.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets==2.11.0
  Downloading datasets-2.11.0-py3-none-any.whl (468 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m468.7/468.7 kB[0m [31m48.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate==0.4.0
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting rouge_score==0.1.2
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting loralib==0.1.1
  Downloading loralib-0.1.1-py3-none-any.whl (8.8 kB)
Collecting peft==0.3.0
  Downloading peft-0.3.0-py3-none-any.whl (56 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.8/56.8 kB[0m 

In [3]:
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification, AutoModelForSeq2SeqLM, GenerationConfig
from datasets import load_dataset
from peft import PeftModel, PeftConfig, LoraConfig, TaskType

# trl: Transformer reinforcement Learning Library
from trl import PPOTrainer, PPOConfig, AutoModelForSeq2SeqLMWithValueHead
from trl import create_reference_model
from trl.core import LengthSampler

import torch
import evaluate

import pandas as pd
import numpy as np

# tqdm library makes the loops show a smart progress meter
from tqdm import tqdm
tqdm.pandas()

# 2 - Load FLAN-T5 Model, Prepare Reward Model and Toxicity Evaluator

## 2.1 - Load Data and FLAN-T5 Model Fine-Tuned with Summarization Instruction

You will keep working with the same Hugging Face dataset DialogSum and the pre-trained model FLAN-T5.

In [4]:
device = torch.device("cuda")

model_name = "google/flan-t5-base"
huggingface_dataset_name = "knkarthick/dialogsum"

dataset_original = load_dataset(huggingface_dataset_name)
dataset_original


Downloading readme:   0%|          | 0.00/4.56k [00:00<?, ?B/s]

Downloading and preparing dataset csv/knkarthick--dialogsum to /root/.cache/huggingface/datasets/knkarthick___csv/knkarthick--dialogsum-c8fac5d84cd35861/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/11.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.35M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/442k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/knkarthick___csv/knkarthick--dialogsum-c8fac5d84cd35861/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 12460
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 1500
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 500
    })
})

The next step will be to preprocess the dataset. You will take only a part of it, then filter the dialogues of a particular length (just to make those examples long enough and, at the same time, easy to read). Then wrap each dialogue with the instruction and tokenize the prompts. Save the token ids in the field `input_ids` and decoded version of the prompts in the field `query`

You could do that all step by step in the cell below, but it is a good habit to organize that all ina  function `build_dataset`

In [5]:
def build_dataset(model_name, dataset_name, input_min_text_length, input_max_text_length):
    # load dataset (only "train" part will be enough for this lab)
    dataset = load_dataset(dataset_name, split="train")

    # filter the dialogues of length between min_text_length and max_text_length characters
    dataset = dataset.filter(lambda x: len(x["dialogue"]) > input_min_text_length and len(x["dialogue"]) <= input_max_text_length, batched=False)

    # Prepare tokenizer. Setting device_map="auto" allows to switch between GPU and CPU automatically.
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    def tokenize(sample):
        # Wrap each dialogue with instruction.
        prompt = f"""
Summarize the following dialogue:

{sample["dialogue"]}

Summary:
"""
        sample["input_ids"] = tokenizer.encode(prompt)

        # This must be called "query", which is a requirement for our PPO library.
        sample["query"] = tokenizer.decode(sample["input_ids"])
        return sample

    # Tokenize each dialogue
    dataset = dataset.map(tokenize, batched=False)
    dataset.set_format(type="torch")

    # Split the dataset into train and test parts
    dataset_splits = dataset.train_test_split(test_size=0.2, shuffle=False, seed=42)

    return dataset_splits

dataset = build_dataset(
    model_name=model_name,
    dataset_name=huggingface_dataset_name,
    input_min_text_length=200,  # 200
    input_max_text_length=1000)  # 500

print(dataset)



Filter:   0%|          | 0/12460 [00:00<?, ? examples/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Map:   0%|          | 0/10022 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic', 'input_ids', 'query'],
        num_rows: 8017
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic', 'input_ids', 'query'],
        num_rows: 2005
    })
})


In the previous lab, you fine-tuned PEFT model with summarization instructions. The training in the notebook was done on a subset of data. Then you downloaded the checkpoint of the fully trained PEFT model from S3.

Prepare a function to pull out the number of model parameters (it is the same as in the previous lab)

In [6]:
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"trainable model parameters: {trainable_model_params} \nall model parameter: {all_model_params} \npercentage of trainable model parameters: {round(trainable_model_params/all_model_params * 100, 2)}%"




Add the adapter to the original FLAN-T5 model. In the previous lab you were adding the fully trained adapter only for inferences, so there was no need to pass LoRA configurations doing that. Now you need to pass them to the constructed PEFT model, also putting `is_trainable=True`.

In [27]:
lora_config = LoraConfig(
    r=32, # Rank
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias='none',
    task_type=TaskType.SEQ_2_SEQ_LM  # FLAN_T5
)

# model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)

# peft_model = PeftModel.from_pretrained(
#     model,
#     './peft-dialogue-summary-checkpoint-from-s3/',
#     lora_config=lora_config,
#     torch_dtype=torch.bfloat16,
#     devico_map="auto",
#     is_training=True)

model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
model = model.to(device)

peft_model = PeftModel.from_pretrained(
    model,
    '/content/drive/MyDrive/peft/peft-0705',
    lora_config=lora_config,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    is_trainable=True)
peft_model = peft_model.to(device)

print_number_of_trainable_model_parameters(peft_model)

'trainable model parameters: 3538944 \nall model parameter: 251116800 \npercentage of trainable model parameters: 1.41%'

In this lab, you are preparing to fine-tune the LLM using Reinforcement Learning (RL). RL will be briefly discused in the next section of this lab, but at this stage, you just need to prepare Proximal Policy Optimization (PPO) model passing the instruct-fine-tuned PEFT model to it. PPO will be used to optimize the RL policy against the reward model.

In [28]:
ppo_model = AutoModelForSeq2SeqLMWithValueHead.from_pretrained(peft_model, torch_dtype=torch.bfloat16, is_trainable=True)
ppo_model = ppo_model.to(device)

print(f'PPO model parameters to be updated (ValueHead + 769 params):\n{print_number_of_trainable_model_parameters(ppo_model)}\n')
print(ppo_model.v_head)

# 769 = dimension of the ValueHead (768) + bias

PPO model parameters to be updated (ValueHead + 769 params):
trainable model parameters: 3539713 
all model parameter: 251117569 
percentage of trainable model parameters: 1.41%

ValueHead(
  (dropout): Dropout(p=0.1, inplace=False)
  (summary): Linear(in_features=768, out_features=1, bias=True)
  (flatten): Flatten(start_dim=1, end_dim=-1)
)


ValueHead(
    (dropout): Dropout(p=0.1, inplace=False)
    (summary): Linear(in_features=768, out_features=1, bias=True)
    (flatten): Flatten(start_dim=1, end_dim=-1)
)

During PPO, only a few parameters will be updated. Specifically, the parameters of the `ValueHead`. More information about this class of models can be found in documentation. The number of trainable parameters can be computed as $(n + 1) * m$, where $n$ is the number of input units (here $n$ = 768) and $m$ is the nunmber of output units (here $m$ = 1). The +1 term in the equation takes into account the bias term.

Now create a frozen copy of the PPO which will not be fine-tuned -a reference model. The reference model will represent the LLM before detoxification. None of the parameters of the reference model will be updated during the PPO training. This is on purpose.

In [9]:
ref_model = create_reference_model(ppo_model)
ref_model = ref_model.to(device)
print(f'Reference model parameters to be updated:\n{print_number_of_trainable_model_parameters(ref_model)}\n')

Reference model parameters to be updated:
trainable model parameters: 0 
all model parameter: 251117569 
percentage of trainable model parameters: 0.0%



Everything is set. It is time to prepare the reward model.

## 2.2 - Prepare Reward Model

Reinforcement learning (RL) is a part of machine learning, in which agents take actions in an environment aimed at maximizing their cumulative rewards. The agent's behavior is defined by the **policy**. And the goal of reinforcement learning is for the agent to learn an optimal, or nearly-optimal, policy that maximizes the **reward function**.

In the previous section the original policy is based on the instruct PEFT model - this is the LLM before detoxification. Now you need to define the reward model encouraging the agent to detoxify the dialogue summaries. The intuitive approach would be to do some form of sentiment analysis across two classes (`nothate` and `hate`) and give a higher reward if there is higher a chance of getting class `nothate` as an output.

You will use Facebook's RoBERTa-based hate speech model for the reward model. This model will output logits and then predict probabilities of the class `nothate` and `hate`. Then model will be fine-tuned with PPO to maximize the probability of the class `nothate`.

Create the instance of the required model class for teh RoBERTa model. You also need to load a tokenizer to test the model. Notice that the model label `0` will correspond to the class `nothate` and label `1` will correspond to the class `hate`.

In [42]:
toxicity_model_name = "facebook/roberta-hate-speech-dynabench-r4-target"
# toxicity_model_name = "martin-ha/toxic-comment-model"
toxicity_tokenizer = AutoTokenizer.from_pretrained(toxicity_model_name, device_map=0)
toxicity_model = AutoModelForSequenceClassification.from_pretrained(toxicity_model_name)
toxicity_model = toxicity_model.to(device)

# Load model directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification

print(toxicity_model.config.id2label)

{0: 'nothate', 1: 'hate'}


Take some non-toxic text, tokenzie it, and pass it to the model. Print the output logits, probabilities, and the corresponding reward that will be used for the fine-tuning.

In [45]:
non_toxic_text = "I want to kiss you"

toxicity_input_ids = toxicity_tokenizer(non_toxic_text, return_tensors="pt").input_ids.to(device)

logits = toxicity_model(input_ids=toxicity_input_ids).logits
print(f'logis [not hate, hate]: {logits.tolist()[0]}')

# Print the probabilities for [not hate, hate]
probabilities = logits.softmax(dim=-1).tolist()[0]
print(f'Probabilities [not hate, hate]: {probabilities}')

# get the logits for "not hate" - this is the reward
not_hate_index = 0
nothate_reward = (logits[:, not_hate_index]).tolist()
print(f'reward (high): {nothate_reward}')


logis [not hate, hate]: [4.657958507537842, -4.078616142272949]
Probabilities [not hate, hate]: [0.9998394250869751, 0.0001605773577466607]
reward (high): [4.657958507537842]


In [33]:
toxic_text = "You are disgusting and terrible and I damn shit hate you"

toxicity_input_ids = toxicity_tokenizer(toxic_text, return_tensors="pt").input_ids.to(device)

logits = toxicity_model(input_ids=toxicity_input_ids).logits
print(f'logis [not hate, hate]: {logits.tolist()[0]}')

# Print the probabilities for [not hate, hate]
probabilities = logits.softmax(dim=-1).tolist()[0]
print(f'Probabilities [not hate, hate]: {probabilities}')

# get the logits for "not hate" - this is the reward
not_hate_index = 0
nothate_reward = (logits[:, not_hate_index]).tolist()
print(f'reward (low): {nothate_reward}')


logis [not hate, hate]: [4.6903839111328125, -4.226815223693848]
Probabilities [not hate, hate]: [0.9998658895492554, 0.0001340452436124906]
reward (low): [4.6903839111328125]


In [21]:
# hugging face inference pipeline
# device = 0 if torch.cuda.is_available() else "cpu"

sentiment_pipe = pipeline("sentiment-analysis", model=toxicity_model_name, device=0)

reward_logits_kwargs = {
    "top_k": None,  # return all scores
    "function_to_apply": "none",  # set to "none" to retrive raw logits
    "batch_size": 16
}

reward_probabilities_kwargs = {
    "top_k": None,  # return all scores
    "function_to_apply": "softmax",  # set to "softmax" to apple softmax and retrieve probabilites
    "batch_size": 16
}

print("Reward model output for non-toxic text:")
print(sentiment_pipe(non_toxic_text, **reward_logits_kwargs))
print(sentiment_pipe(non_toxic_text, **reward_probabilities_kwargs))
print("\nReward model output for toxic text:")
print(sentiment_pipe(toxic_text, **reward_logits_kwargs))
print(sentiment_pipe(toxic_text, **reward_probabilities_kwargs))

Reward model output for non-toxic text:
[{'label': 'nothate', 'score': 4.657958507537842}, {'label': 'hate', 'score': -4.078616142272949}]
[{'label': 'nothate', 'score': 0.9998394250869751}, {'label': 'hate', 'score': 0.00016057737229857594}]

Reward model output for toxic text:
[{'label': 'nothate', 'score': 4.6903839111328125}, {'label': 'hate', 'score': -4.226815223693848}]
[{'label': 'nothate', 'score': 0.9998658895492554}, {'label': 'hate', 'score': 0.0001340452436124906}]


## 2.3 - Evaluate Toxicity

To evaluate the model before and after fine-tuning/detoxification you need to set up toxicity evaluation metric. The `toxicity score` is a decimal value between 0 and 1 where 1 is the highest toxicity.

In [22]:
toxicity_evaluator = evaluate.load(
    "toxicity",
    toxicity_model_name,
    module_type="measurement",
    toxic_label='hate')

Try to calculate toxicity for the same sentences as in section 2.2. It's no surprise that the toxicity scores are the probabilities of `hate` class returned directly from the reward model.

In [23]:
toxicity_score = toxicity_evaluator.compute(predictions=[non_toxic_text])

print("Toxicity score for non-toxic text:")
print(toxicity_score["toxicity"])

toxicity_score = toxicity_evaluator.compute(predictions=[toxic_text])

print("\nToxicity score for toxic text:")
print(toxicity_score["toxicity"])

Toxicity score for non-toxic text:
[0.00016057782340794802]

Toxicity score for toxic text:
[0.0001340452436124906]


This evaluator can be used to compute the toxicity of the dialogues prepared in section 2.1. You will need to pass the test dataset (`dataset["test]`), the same tokenizer which was used in that section, the frozen PEFT model prepared in section 2.2,and the toxicity evaluator. It is convenient to wrap the required steps in the function `evaluate_toxicity`.

In [37]:
def evaluate_toxicity(
    model,
    toxicity_evaluator,
    tokenizer,
    dataset,
    num_samples):

    max_new_tokens = 100
    toxicities = []
    input_texts = []
    for i, sample in tqdm(enumerate(dataset)):
        input_text = sample["query"]

        if i > num_samples:
            break

        input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(device)

        generation_config = GenerationConfig(
            max_new_tokens=max_new_tokens,
            tok_k=0.0,
            top_p=1.0,
            do_sample=True)
        response_token_ids = model.generate(
            input_ids=input_ids,
            generation_config=generation_config)

        generated_text = tokenizer.decode(response_token_ids[0], skip_special_tokens=True)

        toxicity_score = toxicity_evaluator.compute(predictions=[(input_text + " " + generated_text)])

        toxicities.extend(toxicity_score["toxicity"])

    # Compute mean & std using np.
    mean = np.mean(toxicities)
    std = np.std(toxicities)

    return mean, std

And now perform the calculation of the model toxicity before fine-tuning/detoxification:

In [38]:
tokenizer = AutoTokenizer.from_pretrained(model_name, device=0)

mean_before_detoxification, std_before_detoxification = evaluate_toxicity(
    model=peft_model,
    toxicity_evaluator=toxicity_evaluator,
    tokenizer=tokenizer,
    dataset=dataset["test"],
    num_samples=10)

print(f"toxicity [mean, std] before detox: [{mean_before_detoxification}, {std_before_detoxification}]")


11it [00:10,  1.04it/s]

toxicity [mean, std] before detox: [0.018337433427487584, 0.025512211654626752]





# 3 - Perform Fine-Tuning to Detoxify the Summaries

Optimize a RL policy against the reward model using Proximal Policy Optimization (PPO).

## 3.1 - Initialize PPOTrainer

Set up the configuration parameters. Load the `ppo_model` and the tokenizer. You will also load a frozen version of the model `ref_model`. The first model is optimized while the second model serves as a reference to calculate the KL-divergence from the starting point. This words as an additional reward signal in the PPO training to make sure the optimized model does not deviate too much from the original LLM.

In [47]:
print(type(dataset))

<class 'datasets.dataset_dict.DatasetDict'>


In [57]:
config = PPOConfig(
    model_name=model_name,
    learning_rate=1.41e-5,
    ppo_epochs=3,
    mini_batch_size=4,
    batch_size=16
)

def collator(data):
    return dict((key, [d[key] for d in data]) for key in data[0])

# You can uncomment the following lines to test the collator
# test_data = [{"key1": "value1", "key2": "value2", "key3": "value3"}]
# print(f"Collator input: {test_data}")
# print(f"Collator output: {collator(test_data)}")

ppo_trainer = PPOTrainer(
    config=config,
    model=ppo_model,
    ref_model=ref_model,  # output of lab2
    tokenizer=tokenizer,
    dataset=dataset['train'],
    data_collator=collator
)

## 3.2 - Fine-Tune the Model

The fine-tuning loop consists of the following main steps:

1. Get the query responses from the policy LLM (PEFT model)
2. Get sentiments for query/responses from hate speech RoBERTa model
3. Optimize policy with PPO using the (query, response, reward) triplet.

The operation is running if you see the following metrics appearing:
- objective/kl: minimize kl divergence
- ppo/returns/mean: maximize mean returns
- ppo/policy/advantages_mean: maximize advantages

The next cell may take 20-30 mins to run.

In [50]:
output_min_length = 100
output_max_length = 400
output_length_sampler = LengthSampler(output_min_length, output_max_length)

generation_kwargs = {
    "min_length": 5,
    "top_k": 0.0,
    "top_p": 1.0,
    "do_sample": True
}

reward_kwargs = {
    "top_k": None,  # return all scores
    "function_to_apply": "none",
    "batch_size": 16
}

max_ppo_steps = 10

for step, batch in tqdm(enumerate(ppo_trainer.dataloader)):
    # Break when you reach max_ppo_steps
    if step >= max_ppo_steps:
        break

    prompt_tensors = batch["input_ids"]

    # Generate response from FLAN-T5/PEFT LLM
    summary_tensors = []

    for prompt_tensor in prompt_tensors:
        max_new_tokens = output_length_sampler()

        generation_kwargs["max_new_tokens"] = max_new_tokens
        summary = ppo_trainer.generate(prompt_tensor, **generation_kwargs)

        summary_tensors.append(summary.squeeze()[-max_new_tokens:])

    # This needs to be called "response"
    batch["response"] = [tokenizer.decode(r.squeeze()) for r in summary_tensors]

    # Compute reward outputs
    query_response_pairs = [q + r for q, r in zip(batch["query"], batch["response"])]
    rewards = sentiment_pipe(query_response_pairs, **reward_kwargs)

    # You use the 'nothate' item because this is the score for the positive 'nothate' class
    reward_tensors = [torch.tensor(reward[not_hate_index]["score"]) for reward in rewards]

    # Run PPO step
    stats = ppo_trainer.step(prompt_tensors, summary_tensors, reward_tensors)
    ppo_trainer.log_stats(stats, batch, reward_tensors)

    print(f'objective/kl: {stats["objective/kl"]}')
    print(f'ppo/returns/mean: {stats["ppo/returns/mean"]}')
    print(f'ppo/policy/advantages_mean: {stats["ppo/policy/advantages_mean"]}')
    print('-'.join('' for x in range(100)))


0it [00:00, ?it/s]You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
1it [00:12, 12.79s/it]

objective/kl: 12.792394638061523
ppo/returns/mean: 0.45550185441970825
ppo/policy/advantages_mean: -5.172440609158002e-08
---------------------------------------------------------------------------------------------------


2it [00:30, 15.63s/it]

objective/kl: 18.375507354736328
ppo/returns/mean: -0.046177297830581665
ppo/policy/advantages_mean: 1.3213762528607731e-08
---------------------------------------------------------------------------------------------------


3it [00:46, 15.82s/it]

objective/kl: 17.497703552246094
ppo/returns/mean: -0.3666149079799652
ppo/policy/advantages_mean: 6.033265886884465e-09
---------------------------------------------------------------------------------------------------


4it [01:02, 15.92s/it]

objective/kl: 12.853426933288574
ppo/returns/mean: 0.23934730887413025
ppo/policy/advantages_mean: 8.720232536063577e-09
---------------------------------------------------------------------------------------------------


5it [01:20, 16.84s/it]

objective/kl: 17.972558975219727
ppo/returns/mean: -0.1477319896221161
ppo/policy/advantages_mean: 3.2160323293339843e-09
---------------------------------------------------------------------------------------------------


6it [01:33, 15.27s/it]

objective/kl: 10.847443580627441
ppo/returns/mean: 0.32510536909103394
ppo/policy/advantages_mean: -1.1041581871040762e-08
---------------------------------------------------------------------------------------------------


7it [01:45, 14.26s/it]

objective/kl: 11.790050506591797
ppo/returns/mean: 0.35978707671165466
ppo/policy/advantages_mean: -2.3068238874657254e-08
---------------------------------------------------------------------------------------------------


8it [01:55, 12.84s/it]

objective/kl: 7.463865756988525
ppo/returns/mean: 1.043294906616211
ppo/policy/advantages_mean: 3.561528671980341e-08
---------------------------------------------------------------------------------------------------


9it [02:05, 11.93s/it]

objective/kl: 9.347660064697266
ppo/returns/mean: 0.8618202209472656
ppo/policy/advantages_mean: 2.6210710046825625e-08
---------------------------------------------------------------------------------------------------


10it [02:16, 13.63s/it]

objective/kl: 10.959070205688477
ppo/returns/mean: 0.5493296980857849
ppo/policy/advantages_mean: 9.67917479499647e-09
---------------------------------------------------------------------------------------------------





## 3.3 - Evaluate the Model Qualitatively

Let's inspect some examples from the test dataset. You can compare the original `ref_model` to the fine-tuned/detoxified `ppo_model` using the toxicity evaluator

In [52]:
batch_size = 20
compare_results = {}

df_batch = dataset["test"][:batch_size]

compare_results["query"] = dt_batch["query"]
prompt_tensors = df_batch["input_ids"]

summary_tensors_ref = []
summary_tensors = []

# Get response from ppo and base model.
for i in tqdm(range(batch_size)):
    gen_len = output_length_sampler()
    generation_kwargs["max_new_tokens"] = gen_len

    summary = ref_model.generate(
        input_ids=torch.as_tensor(prompt_tensors[i]).unsqueeze(dim=0).to(device),
        **generation_kwargs
    ).squeeze()[-gen_len:]
    summary_tensors_ref.append(summary)

    summary = ppo_model.generate(
        input_ids=torch.as_tensor(prompt_tensors[i]).unsqueeze(dim=0).to(device),
        **generation_kwargs
    ).squeeze()[-gen_len:]
    summary_tensors.append(summary)

# Decode responses
compare_results["response_before"] = [tokenizer.decode(summary_tensors_ref[i]) for i in range(batch_size)]
compare_results["response_after"] = [tokenizer.decode(summary_tensors[i]) for i in range(batch_size)]

# Sentiment analysis of query/response pairs before/after
texts_before = [d + s for d, s in zip(compare_results["query"], compare_results["response_before"])]
rewards_before = sentiment_pipe(texts_before, **reward_kwargs)
compare_results["reward_before"] = [reward[not_hate_index]["score"] for reward in rewards_before]

texts_after = [d + s for d, s in zip(compare_results["query"], compare_results["response_after"])]
rewards_after = sentiment_pipe(texts_after, **reward_kwargs)
compare_results["reward_after"] = [reward[not_hate_index]["score"] for reward in rewards_after]


100%|██████████| 20/20 [00:35<00:00,  1.79s/it]


Store and review the results in a DataFrame

In [70]:
pd.set_option('display.max_colwidth', 500)
df_compare_results = pd.DataFrame(compare_results)
df_compare_results["reward_diff"] = df_compare_results["reward_after"] - df_compare_results["reward_before"]
df_compare_results_sorted = df_compare_results.sort_values(by=["reward_diff"], ascending=False).reset_index(drop=True)
df_compare_results_sorted.head(5)

Unnamed: 0,query,response_before,response_after,reward_before,reward_after,reward_diff
0,"Summarize the following dialogue: #Person1#: It smells like an ashtray in here! #Person2#: Hi honey! What's wrong? Why do you have that look on your face? #Person1#: What's wrong? I thought we agreed that you were gonna quit smoking. #Person2#: No! I said I was going to cut down which is very different. You can't just expect me to go cold turkey overnight! #Person1#: Look, there are other ways to quit. You can try the nicotine patch, or nicotine chewing gum. We spend a fortune on cigarettes ...","<pad> Sessions on the topic include nominal hypothetical ""green revolts.""</s>",<pad> Quitting smoking is expensive and for the most part requires more willpower.</s>,0.94112,2.308189,1.367069
1,"Summarize the following dialogue: #Person1#: How much are you asking for this? #Person2#: I'm offering them to you at 150 yuan a piece. Is that all right? #Person1#: Is tax already included in their price? #Person2#: Yes. Our price can't be matched. #Person1#: Would you consider a volume discount? #Person2#: If you buy 1, 000 or more, you'll get a 10 % discount. #Person1#: I'll accept your offer. Summary: </s>","<pad> Satoshi: Once you purchase a set of Ski Sitar Horses for Baan Cla, they announce a mass shipment of upwards earnings.</s>",<pad> Tap Nails will not be matched.</s>,2.566545,3.411623,0.845078
2,"Summarize the following dialogue: #Person1#: Mom, I just finished my paper. Can you proofread it before I hand it in? #Person2#: Sure, let's take a look. Sweetie, this is terrific. Your ideas are so original. #Person1#: Thanks. #Person2#: I can tell you worked hard on it. #Person1#: I really did! I started thinking about what I wanted to say three weeks ago. #Person2#: Well, it was definitely worth all the time. #Person1#: Let's just hope my teacher agrees. Summary: </s>","<pad> Dorrie, stay in your room and George is taking care of Patrick, ensuring that someone cannot interrupt his grade on his papers.</s>",<pad> Outstanding paper. Not recommended for Ms.</s>,2.016532,2.771547,0.755015
3,"Summarize the following dialogue: #Person1#: Here is the final draft of our contract. I'm glad that we have reached an agreement on almost every term in our trade. #Person2#: Yes, it seems to me we have come quite a long way. However, let me take a close look at the final draft. #Person1#: Do you have some points to bring up? #Person2#: Well, everything we've discussed seems to be here. #Person1#: Yes, including a description of the shirts you want to purchase this time, the total amount of ...","<pad> Lowsburg sent their proper standard test to request and attach a final draft of their trade. However,, he needs to consider this as his key to a successful bid onto the state of office after questioning about the university’s trademarks battle.</s>",<pad> Wellness Academy Home Products has been ordered a final draft of its contract to be signed by the Exotic Retail co-operative and distributor PR 1 Park 2012.</s>,3.166683,3.819972,0.653289
4,"Summarize the following dialogue: #Person1#: Oh, my God! What's this? #Person2#: What? #Person1#: Look! This window is open. #Person2#: Did you open it before we left? #Person1#: Are you kidding? It's winter. Why would I open it? #Person2#: I don't know. Wait. Is this yours? #Person1#: No! Oh, my God! Someone has broken into the house. #Person2#: It looks that way. That's probably why the door wasn't locked when we came in. #Person1#: I locked it when I left though. #Person2#: Yes, but the r...","<pad> Tev consciousness on the roof's, broken glass doors. Inside turned into a robbery that held the owner. When the robbermen broke into the house, the owners left his top at Jon's doorstep.</s>",<pad> Rocked the current light</s>,1.571314,2.164563,0.593249


Save the ppo_model to google drive the push to HuggingFace Hub

In [54]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [60]:
ppo_model.push_to_hub("linlinlin/ppo_model-0706")

adapter_model.bin:   0%|          | 0.00/14.2M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/linlinlin/ppo_model-0706/commit/8f46b2f24ef03e4474c3c259382718e12083a829', commit_message='Upload model', commit_description='', oid='8f46b2f24ef03e4474c3c259382718e12083a829', pr_url=None, pr_revision=None, pr_num=None)

In [68]:
import locale
print(locale.getpreferredencoding())
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding

ANSI_X3.4-1968


In [69]:
%ls

[0m[01;34mdrive[0m/  [01;34msample_data[0m/


## 3.4 - Evaluate the Model Quantitively

In [71]:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic', 'input_ids', 'query'],
        num_rows: 8017
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic', 'input_ids', 'query'],
        num_rows: 2005
    })
})


In [72]:
dialogues = dataset['test'][0:10]['dialogue']

In [74]:
human_baseline_summaries = dataset['test'][0:10]['summary']
print(human_baseline_summaries)

["#Person2# recommends DEL to #Person1# because DEL isn't connected through the phone line.", 'Judy and #Person1# are surprised that Richard was fired.', '#Person1# asks #Person2# to take a break but #Person2# wants to keep working and finish the report by noon.', '#Person1# can play the drums well now and wants to form a rock band, so #Person1# asks #Person2# to come for an audition as the singer.', '#Person1# and #Person2# think people are getting dependent on computers and the web. #Person1# tells #Person2# how people buy goods online.', "Alice calls Li Hong and says she can't go to see Mrs. Brown tomorrow because her mom is ill.", "#Person1# assists #Person2# in buying a toy car for #Person2#'s son.", '#Person1# wants to reconfirm the flight to London and #Person2# asks him to dial 35 for English-speaking staff.', "#Person1# likes the peaked cap while Amanda doesn't like caps", '#Person1# and Allen find someone has broken into their house. They are looking for what the robber has s

In [75]:
peft_model_summaries = []  # ref_model above, the model that before using PPO
ppo_model_summaries = []

for _, dialogue in enumerate(dialogues):
    prompt = f"""
    Summarize the following conversation.

    {dialogue}

    Summary: """

    input_ids = tokenizer(prompt, return_tensors='pt').input_ids.to(device)

    peft_model_outputs = ref_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
    peft_model_test_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)
    peft_model_summaries.append(peft_model_test_output)

    ppo_model_outputs = ppo_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
    ppo_model_test_output = tokenizer.decode(ppo_model_outputs[0], skip_special_tokens=True)
    ppo_model_summaries.append(ppo_model_test_output)

zipped_summaries = list(zip(human_baseline_summaries, peft_model_summaries, ppo_model_summaries))

df = pd.DataFrame(zipped_summaries, columns=['human_baseline', 'peft_model', 'ppo_model'])
df


Unnamed: 0,human_baseline,peft_model,ppo_model
0,#Person2# recommends DEL to #Person1# because DEL isn't connected through the phone line.,You can get DEL or dial-up.,You can get DEL or dial-up.
1,Judy and #Person1# are surprised that Richard was fired.,@Person1#:,@Person1#:
2,#Person1# asks #Person2# to take a break but #Person2# wants to keep working and finish the report by noon.,"@Person1#: I wish I could, but I can't.",You've been sitting there for hours.
3,"#Person1# can play the drums well now and wants to form a rock band, so #Person1# asks #Person2# to come for an audition as the singer.","You told me that you had some musical talent, right?",Get ready to audition at my house.
4,#Person1# and #Person2# think people are getting dependent on computers and the web. #Person1# tells #Person2# how people buy goods online.,"With the establishment of Internet and a lot of web companies, people are getting more and more dependent on the web.","With the establishment of Internet and a lot of web companies, people are getting more and more dependent on the web."
5,Alice calls Li Hong and says she can't go to see Mrs. Brown tomorrow because her mom is ill.,Hello?,Hello?
6,#Person1# assists #Person2# in buying a toy car for #Person2#'s son.,You're a good help.,The toy car is three hundred dollars.
7,#Person1# wants to reconfirm the flight to London and #Person2# asks him to dial 35 for English-speaking staff.,"Please dial 35, please.",Please dial 35.
8,#Person1# likes the peaked cap while Amanda doesn't like caps,@Person2#: I don't like caps at all.,@Person1#:
9,#Person1# and Allen find someone has broken into their house. They are looking for what the robber has stolen and #Person1# is afraid that the thief is still upstairs.,This window is open.,This window is open.


In [77]:
rouge = evaluate.load('rouge')

peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries[0:len(peft_model_summaries)],
    use_aggregator=True,
    use_stemmer = True
)

ppo_model_results = rouge.compute(
    predictions=ppo_model_summaries,
    references=human_baseline_summaries[0:len(ppo_model_summaries)],
    use_aggregator=True,
    use_stemmer = True
)


print("PEFT MODEL RESULTS")
print(peft_model_results)
print("PPO MODEL RESULTS")
print(ppo_model_results)

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

PEFT MODEL RESULTS
{'rouge1': 0.15022459077244377, 'rouge2': 0.04952380952380954, 'rougeL': 0.14852716539080785, 'rougeLsum': 0.1471247462996016}
PPO MODEL RESULTS
{'rouge1': 0.14333333333333337, 'rouge2': 0.04176470588235294, 'rougeL': 0.14137047163362954, 'rougeLsum': 0.14060606060606062}


In [78]:
improvement = (np.array(list(peft_model_results.values())) - np.array(list(ppo_model_results.values())))

for key, value in zip(peft_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

rouge1: 0.69%
rouge2: 0.78%
rougeL: 0.72%
rougeLsum: 0.65%
