### Fine-Tune FLAN-T5 with Reinforcement Learning (PPO) and PEFT to Generate Less-Toxic Summaries

n this notebook, we will detoxify the pre-trained Google's LLM [FLAN-T5](https://huggingface.co/google/flan-t5-base) from HuggingFace and fine-tuned on the [CNN dataset](https://huggingface.co/datasets/cnn_dailymail), which contains ~1M articles from the CNN DailyMail. They come with the corresponding manually labeled summaries.

To generate less toxic content, we'll use Meta AI's hate speech reward model. The reward model is a binary classifier that predicts either `not hate` or `hate` for the given text. The algorithm used to perform this task is called **Proximal Policy Optimization ([PPO](https://en.wikipedia.org/wiki/Proximal_Policy_Optimization))**.

You may need to install the following libraries:

```
%pip install --upgrade pip
%pip install --disable-pip-version-check \
    torch==1.13.1 \
    torchdata==0.5.1 --quiet

%pip install \
    transformers==4.27.2 \
    datasets==2.11.0 \
    evaluate==0.4.0 \
    rouge_score==0.1.2 \
    peft==0.3.0 --quiet

# Installing the Reinforcement Learning library directly from github.
%pip install git+https://github.com/lvwerra/trl.git@25fa1bd    
```

In [1]:
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification, AutoModelForSeq2SeqLM, GenerationConfig
from datasets import load_dataset
from peft import PeftModel, PeftConfig, LoraConfig, TaskType               # PEFT/LORA fine-tuning

from trl import PPOTrainer, PPOConfig, AutoModelForSeq2SeqLMWithValueHead  # trl: Transformer Reinforcement Learning
from trl import create_reference_model
from trl.core import LengthSampler

import torch
import evaluate

import numpy as np
import pandas as pd

from tqdm import tqdm                                                       # to show progress
tqdm.pandas()






#### Load FLAN-T5 Model, Prepare Reward Model and Toxicity Evaluator

In [4]:
dataset_original = load_dataset("cnn_dailymail", "3.0.0", verification_mode="no_checks") # load dataset
dataset_original

DatasetDict({
    train: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 287113
    })
    validation: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 13368
    })
    test: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 11490
    })
})

Now that we have the dataset, we can do some preprocessing and select only some more relevant articles (based on their length).
We can, then, build the prompts, saving the token ids in the field `input_ids` and decoded version of the prompts in the field `query`.

In [15]:
# create a tiny dataset for this notebook
def build_dataset(model_name, dataset, input_min_text_length, input_max_text_length):

    """
    Preprocess the dataset and split it into train and test parts.

    Parameters:
    - model_name (str): Tokenizer model name.
    - dataset (str): Dataset already loaded.
    - input_min_text_length (int): Minimum length of the dialogues.
    - input_max_text_length (int): Maximum length of the dialogues.
        
    Returns:
    - dataset_splits (datasets.dataset_dict.DatasetDict): Preprocessed dataset containing train and test parts.
    """
    
    # load dataset
    dataset = dataset["train"]
    
    # Filter the articles of length between input_min_text_length and input_max_text_length characters.
    dataset = dataset.filter(lambda x: len(x["article"]) > input_min_text_length and len(x["article"]) <= input_max_text_length, batched=False)

    # Prepare tokenizer. Setting device_map="auto" allows to switch between GPU and CPU automatically.
    tokenizer = AutoTokenizer.from_pretrained(model_name, device_map="auto")
    
    def tokenize(sample):
        
        # Wrap each dialogue with the instruction.
        prompt = f"""
        Summarize the following article.

        {sample["article"]}

        Summary:
        """
        sample["input_ids"] = tokenizer.encode(prompt)
        
        # This must be called "query", which is a requirement of our PPO library.
        sample["query"] = tokenizer.decode(sample["input_ids"])
        return sample

    # Tokenize each dialogue.
    dataset = dataset.map(tokenize, batched=False)
    dataset.set_format(type="torch")
    
    # Split the dataset into train and test parts.
    dataset_splits = dataset.train_test_split(test_size=0.2, shuffle=False, seed=42)

    return dataset_splits

dataset = build_dataset(model_name="google/flan-t5-base", 
                        dataset=dataset_original,
                        input_min_text_length=200, 
                        input_max_text_length=1000)

print(dataset)

Filter:   0%|          | 0/287113 [00:00<?, ? examples/s]

Map:   0%|          | 0/3341 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['article', 'highlights', 'id', 'input_ids', 'query'],
        num_rows: 2672
    })
    test: Dataset({
        features: ['article', 'highlights', 'id', 'input_ids', 'query'],
        num_rows: 669
    })
})


In [16]:
# util function to print model parameters
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"\ntrainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

Here we will load a pre-trained LoRA adapter (saved locally) for the original FLAN-T5 model. Since here we're using this checkpoint as intermediate step and not only for inference, we need to set `is_trainable=True` in the model definition.

In [17]:
lora_config = LoraConfig(
    r=32, # Rank
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM # FLAN-T5
)

model = AutoModelForSeq2SeqLM.from_pretrained(model_name, 
                                              torch_dtype=torch.bfloat16)

peft_model = PeftModel.from_pretrained(model, 
                                       './peft-dialogue-summary-checkpoint-local/', 
                                       lora_config=lora_config,
                                       torch_dtype=torch.bfloat16, 
                                       device_map="auto",                                       
                                       is_trainable=True)

print(f'PEFT model parameters to be updated:\n{print_number_of_trainable_model_parameters(peft_model)}\n')

PEFT model parameters to be updated:

trainable model parameters: 3538944
all model parameters: 251116800
percentage of trainable model parameters: 1.41%



As we're getting ready to fine-tune the LLM using Reinforcement Learning (RL), we need tp set the **Proximal Policy Optimization (PPO)** model passing the PEFT model to it. PPO will be used to optimize the RL policy against the reward model.

In [18]:
ppo_model = AutoModelForSeq2SeqLMWithValueHead.from_pretrained(peft_model, torch_dtype=torch.bfloat16, is_trainable=True)

print(f'PPO model parameters to be updated (ValueHead + 769 params):\n{print_number_of_trainable_model_parameters(ppo_model)}\n')
print(ppo_model.v_head)

PPO model parameters to be updated (ValueHead + 769 params):

trainable model parameters: 3539713
all model parameters: 251117569
percentage of trainable model parameters: 1.41%

ValueHead(
  (dropout): Dropout(p=0.1, inplace=False)
  (summary): Linear(in_features=768, out_features=1, bias=True)
  (flatten): Flatten(start_dim=1, end_dim=-1)
)


During PPO, only a few parameters will be updated. Specifically, the parameters of the `ValueHead`. More information about this class of models can be found [here](https://huggingface.co/docs/trl/main/en/models#trl.create_reference_model). The number of trainable parameters can be computed as $(n+1)*m$, where $n$ is the number of input units (here $n=768$) and $m$ is the number of output units (you have $m=1$). The $+1$ term in the equation takes into account the bias term.

Now create a frozen copy of the PPO which will not be fine-tuned - a reference model. The reference model will represent the LLM before detoxification. None of the parameters of the reference model will be updated during PPO training. This is on purpose.

In [19]:
ref_model = create_reference_model(ppo_model)

print(f'Reference model parameters to be updated:\n{print_number_of_trainable_model_parameters(ref_model)}\n')

Reference model parameters to be updated:

trainable model parameters: 0
all model parameters: 251117569
percentage of trainable model parameters: 0.00%



Everything is set. It is time to prepare the reward model!

#### Prepare Reward Model

**Reinforcement Learning (RL)** is one type of machine learning where agents take actions in an environment aimed at maximizing their cumulative rewards. The agent's behavior is defined by a **policy**, while the goal of RL is for the agent to learn an optimal, or nearly-optimal, policy that maximizes a **reward function**. 

In our case, the original policy is based on the PEFT model (i.e., the LLM before detoxification). Since asking human labelers to give feedback on the outputs' toxicity is very expensive, it's easier to use a reward model encouraging the agent to detoxify the article highlights. Here, it means performing a sentiment analysis across two classes (`nothate` and `hate`) and give a higher reward if there is higher a chance of getting the class `nothate` as an output. 

The reward model that we'll use here is the [Meta AI's RoBERTa-based hate speech model](https://huggingface.co/facebook/roberta-hate-speech-dynabench-r4-target). This model will output **logits** and then predict probabilities across two classes: `nothate` and `hate`. The logits of the output `nothate` will be taken as a positive reward. Then, the model will be fine-tuned with PPO using those reward values.

In [20]:
toxicity_model_name = "facebook/roberta-hate-speech-dynabench-r4-target"
toxicity_tokenizer = AutoTokenizer.from_pretrained(toxicity_model_name, device_map="auto") # used to test the model
toxicity_model = AutoModelForSequenceClassification.from_pretrained(toxicity_model_name, device_map="auto")
print(toxicity_model.config.id2label)

tokenizer_config.json:   0%|          | 0.00/1.11k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/816 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

{0: 'nothate', 1: 'hate'}


##### Examples

Let's see one example with some *non-toxic* text:

In [24]:
# Example 1 - Non-toxic text
non_toxic_text = "#Person 1# tells Tommy that he didn't like the movie."

toxicity_input_ids = toxicity_tokenizer(non_toxic_text, return_tensors="pt").input_ids

logits = toxicity_model(input_ids=toxicity_input_ids).logits
print(f'logits [not hate, hate]:\t{logits.tolist()[0]}')

# Print the probabilities for [not hate, hate]
probabilities = logits.softmax(dim=-1).tolist()[0]
print(f'probabilities [not hate, hate]:\t{probabilities}')

# get the logits for "not hate" - this is the reward!
not_hate_index = 0
nothate_reward = (logits[:, not_hate_index]).tolist()
print(f'reward (high):\t\t\t{nothate_reward}')

logits [not hate, hate]:	[3.114100933074951, -2.4896178245544434]
probabilities [not hate, hate]:	[0.9963293671607971, 0.003670614678412676]
reward (high):			[3.114100933074951]


Let's now see what happens when using some *toxic* text:

In [25]:
# Example 2 - Toxic text
toxic_text = "#Person 1# tells Tommy that the movie was terrible, dumb and stupid."

toxicity_input_ids = toxicity_tokenizer(toxic_text, return_tensors="pt").input_ids

logits = toxicity_model(toxicity_input_ids).logits
print(f'logits [not hate, hate]:\t{logits.tolist()[0]}')

# Print the probabilities for [not hate, hate]
probabilities = logits.softmax(dim=-1).tolist()[0]
print(f'probabilities [not hate, hate]:\t{probabilities}')

# Get the logits for "not hate" - this is the reward!
nothate_reward = (logits[:, not_hate_index]).tolist() 
print(f'reward (low):\t\t\t{nothate_reward}')

logits [not hate, hate]:	[-0.6921191811561584, 0.37227335572242737]
probabilities [not hate, hate]:	[0.2564709186553955, 0.7435290813446045]
reward (low):			[-0.6921191811561584]


As we can see, in the 2nd case the text gets a higher probability of being hateful, thus the reward is negative (i.e., more like a punishment).

##### HuggingFace inference pipeline

In [26]:
device = 0 if torch.cuda.is_available() else "cpu"

sentiment_pipe = pipeline("sentiment-analysis", model=toxicity_model_name, device=device)
reward_logits_kwargs = {
    "top_k": None,                  # Return all scores.
    "function_to_apply": "none",    # Set to "none" to retrieve raw logits.
    "batch_size": 16
}

reward_probabilities_kwargs = {
    "top_k": None,                  # Return all scores.
    "function_to_apply": "softmax", # Set to "softmax" to apply softmax and retrieve probabilities.
    "batch_size": 16
}

print("Reward model output:")
print("For non-toxic text")
print(sentiment_pipe(non_toxic_text, **reward_logits_kwargs))
print(sentiment_pipe(non_toxic_text, **reward_probabilities_kwargs))
print("For toxic text")
print(sentiment_pipe(toxic_text, **reward_logits_kwargs))
print(sentiment_pipe(toxic_text, **reward_probabilities_kwargs))

Reward model output:
For non-toxic text
[{'label': 'nothate', 'score': 3.114100933074951}, {'label': 'hate', 'score': -2.4896178245544434}]
[{'label': 'nothate', 'score': 0.9963293671607971}, {'label': 'hate', 'score': 0.0036706149112433195}]
For toxic text
[{'label': 'hate', 'score': 0.37227335572242737}, {'label': 'nothate', 'score': -0.6921191811561584}]
[{'label': 'hate', 'score': 0.7435290813446045}, {'label': 'nothate', 'score': 0.2564709186553955}]


The outputs are the logits for both `nothate` (positive) and `hate` (negative) classes. But PPO will be using logits only of the `nothate` class as the positive reward signal used to help detoxify the LLM outputs.

In [27]:
print(sentiment_pipe(non_toxic_text, **reward_logits_kwargs))
print(sentiment_pipe(non_toxic_text, **reward_probabilities_kwargs))

[{'label': 'nothate', 'score': 3.114100933074951}, {'label': 'hate', 'score': -2.4896178245544434}]
[{'label': 'nothate', 'score': 0.9963293671607971}, {'label': 'hate', 'score': 0.0036706149112433195}]


In [28]:
print(sentiment_pipe(toxic_text, **reward_logits_kwargs))
print(sentiment_pipe(toxic_text, **reward_probabilities_kwargs))

[{'label': 'hate', 'score': 0.37227335572242737}, {'label': 'nothate', 'score': -0.6921191811561584}]
[{'label': 'hate', 'score': 0.7435290813446045}, {'label': 'nothate', 'score': 0.2564709186553955}]


#### Evaluate Toxicity

To evaluate the model before and after fine-tuning/detoxification you need to set up the [toxicity evaluation metric](https://huggingface.co/spaces/evaluate-measurement/toxicity). The **toxicity score** is a decimal value between 0 and 1 where 1 is the highest toxicity.

In [29]:
toxicity_evaluator = evaluate.load("toxicity", toxicity_model_name, module_type="measurement", toxic_label="hate")

Downloading builder script:   0%|          | 0.00/6.08k [00:00<?, ?B/s]

We can measure the toxicity of the two examples above. As we can see, the toxicity metric well represents what we want and assigns a higher score to the more toxic text:

In [39]:
toxicity_score = toxicity_evaluator.compute(predictions=[non_toxic_text])
print(f"Toxicity score for non-toxic text:\t{toxicity_score['toxicity'][0]:.2%}")

toxicity_score = toxicity_evaluator.compute(predictions=[toxic_text])
print(f"Toxicity score for toxic text:\t\t{toxicity_score['toxicity'][0]:.2%}")

Toxicity score for non-toxic text:	0.37%
Toxicity score for toxic text:		74.35%


This evaluator can be used to compute the toxicity of the articles in or train/test datasets. We can do this with the following function, that takes a dataset, the tokenzer, and the frozen PEFT as input:

In [40]:
def evaluate_toxicity(model, toxicity_evaluator, tokenizer, dataset, num_samples):
    
    """
    Preprocess the dataset and split it into train and test parts.

    Parameters:
    - model (trl model): Model to be evaluated.
    - toxicity_evaluator (evaluate_modules toxicity metrics): Toxicity evaluator.
    - tokenizer (transformers tokenizer): Tokenizer to be used.
    - dataset (dataset): Input dataset for the evaluation.
    - num_samples (int): Maximum number of samples for the evaluation.
        
    Returns:
    tuple: A tuple containing two numpy.float64 values:
    - mean (numpy.float64): Mean of the samples toxicity.
    - std (numpy.float64): Standard deviation of the samples toxicity.
    """

    max_new_tokens=100

    toxicities = []
    input_texts = []
    for i, sample in tqdm(enumerate(dataset)):
        input_text = sample["query"]
        if i > num_samples:
            break
            
        input_ids = tokenizer(input_text, return_tensors="pt", padding=True).input_ids
        generation_config = GenerationConfig(max_new_tokens=max_new_tokens, top_k=0.0, top_p=1.0, do_sample=True)
        response_token_ids = model.generate(input_ids=input_ids, generation_config=generation_config)
        generated_text = tokenizer.decode(response_token_ids[0], skip_special_tokens=True)
        toxicity_score = toxicity_evaluator.compute(predictions=[(input_text + " " + generated_text)])
        toxicities.extend(toxicity_score["toxicity"])

    mean = np.mean(toxicities)
    std = np.std(toxicities)
        
    return mean, std

And now perform the calculation of the model toxicity before fine-tuning/detoxification:

In [41]:
tokenizer = AutoTokenizer.from_pretrained(model_name, device_map="auto")

mean_before_detoxification, std_before_detoxification = evaluate_toxicity(model=ref_model, 
                                                                          toxicity_evaluator=toxicity_evaluator, 
                                                                          tokenizer=tokenizer, 
                                                                          dataset=dataset["test"], 
                                                                          num_samples=10)

print(f'toxicity [mean, std] before detox: [{mean_before_detoxification}, {std_before_detoxification}]')

11it [00:23,  2.14s/it]

toxicity [mean, std] before detox: [0.001872384343931282, 0.0023433647308500564]





#### Perform Fine-Tuning to Detoxify the Summaries
At this point we can optimize our RL policy against the reward model using Proximal Policy Optimization (PPO).

##### Initialize `PPOTrainer`
 
For the `PPOTrainer` initialization, we need a collator. Here it will be the function transforming the dictionaries in a particular way shown below:

In [42]:
def collator(data):
    return dict((key, [d[key] for d in data]) for key in data[0])

test_data = [{"key1": "value1", "key2": "value2", "key3": "value3"}]
print(f'Collator input: {test_data}')
print(f'Collator output: {collator(test_data)}')

Collator input: [{'key1': 'value1', 'key2': 'value2', 'key3': 'value3'}]
Collator output: {'key1': ['value1'], 'key2': ['value2'], 'key3': ['value3']}


The `PPOTrainer` can now be configured.

After loading the `ppo_model` and the tokenizer. **Note** that two models are necessary here: the model that is optimized and the original model, called here `ref_model`. This is because while the first model is optimized, the second one serves as a reference to calculate the KL-divergence from the starting point. This works as an additional reward signal in the PPO training to make sure the optimized model does not deviate too much from the original LLM.

In [43]:
learning_rate=1.41e-5
max_ppo_epochs=1
mini_batch_size=4
batch_size=16

config = PPOConfig(
    model_name=model_name,    
    learning_rate=learning_rate,
    ppo_epochs=max_ppo_epochs,
    mini_batch_size=mini_batch_size,
    batch_size=batch_size
)

ppo_trainer = PPOTrainer(config=config, 
                         model=ppo_model, 
                         ref_model=ref_model,      # frozen version of the model
                         tokenizer=tokenizer, 
                         dataset=dataset["train"], 
                         data_collator=collator)

##### Fine-Tune the Model (~30 mins runtime)

The fine-tuning loop consists of the following main steps:
1. Get the query responses from the policy LLM (PEFT model).
2. Get sentiments for query/responses from hate speech RoBERTa model.
3. Optimize policy with PPO using the (query, response, reward) triplet.

The operation is running if you see the following metrics appearing:
* `objective/kl`: minimize kl divergence,
* `ppo/returns/mean`: maximize mean returns,
* `ppo/policy/advantages_mean`: maximize advantages.

In [44]:
output_min_length = 100
output_max_length = 400
output_length_sampler = LengthSampler(output_min_length, output_max_length)

generation_kwargs = {"min_length": 5, "top_k": 0.0, "top_p": 1.0, "do_sample": True}
reward_kwargs = {"top_k": None, "function_to_apply": "none", "batch_size": 16}

max_ppo_steps = 10

for step, batch in tqdm(enumerate(ppo_trainer.dataloader)):
    # Break when you reach max_steps.
    if step >= max_ppo_steps:
        break   

    prompt_tensors = batch["input_ids"]

    # Get response from FLAN-T5/PEFT LLM.
    summary_tensors = []

    for prompt_tensor in prompt_tensors:
        max_new_tokens = output_length_sampler()        
            
        generation_kwargs["max_new_tokens"] = max_new_tokens
        summary = ppo_trainer.generate(prompt_tensor, **generation_kwargs)
        
        summary_tensors.append(summary.squeeze()[-max_new_tokens:])
        
    # This needs to be called "response".
    batch["response"] = [tokenizer.decode(r.squeeze()) for r in summary_tensors]

    # Compute reward outputs.
    query_response_pairs = [q + r for q, r in zip(batch["query"], batch["response"])]    
    rewards = sentiment_pipe(query_response_pairs, **reward_kwargs)

    # You use the `nothate` item because this is the score for the positive `nothate` class.
    reward_tensors = [torch.tensor(reward[not_hate_index]["score"]) for reward in rewards]    

    # Run PPO step.
    stats = ppo_trainer.step(prompt_tensors, summary_tensors, reward_tensors)
    ppo_trainer.log_stats(stats, batch, reward_tensors)
    
    print(f'objective/kl: {stats["objective/kl"]}')
    print(f'ppo/returns/mean: {stats["ppo/returns/mean"]}')
    print(f'ppo/policy/advantages_mean: {stats["ppo/policy/advantages_mean"]}')
    print('-'.join('' for x in range(100)))

0it [00:00, ?it/s]You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
1it [02:03, 123.54s/it]

objective/kl: 0.019849255681037903
ppo/returns/mean: 2.3097753524780273
ppo/policy/advantages_mean: 1.4881951670986382e-08
---------------------------------------------------------------------------------------------------


2it [04:00, 119.83s/it]

objective/kl: -0.02060961350798607
ppo/returns/mean: 1.78458571434021
ppo/policy/advantages_mean: -2.3215060096504203e-08
---------------------------------------------------------------------------------------------------


3it [05:59, 119.54s/it]

objective/kl: -0.01089716237038374
ppo/returns/mean: 2.0317835807800293
ppo/policy/advantages_mean: 3.7101983707543695e-08
---------------------------------------------------------------------------------------------------


4it [08:27, 130.73s/it]

objective/kl: 0.003847053274512291
ppo/returns/mean: 1.9629297256469727
ppo/policy/advantages_mean: 1.419125794654974e-07
---------------------------------------------------------------------------------------------------


5it [11:00, 138.72s/it]

objective/kl: -0.004880266264081001
ppo/returns/mean: 1.8345730304718018
ppo/policy/advantages_mean: 4.783043294764866e-09
---------------------------------------------------------------------------------------------------


6it [13:31, 142.97s/it]

objective/kl: 0.025684449821710587
ppo/returns/mean: 1.888061761856079
ppo/policy/advantages_mean: -1.9904891956912252e-08
---------------------------------------------------------------------------------------------------


7it [15:56, 143.53s/it]

objective/kl: 0.034124914556741714
ppo/returns/mean: 2.0538034439086914
ppo/policy/advantages_mean: -2.7457463858127085e-08
---------------------------------------------------------------------------------------------------


8it [18:32, 147.57s/it]

objective/kl: 0.023807549849152565
ppo/returns/mean: 1.9823936223983765
ppo/policy/advantages_mean: 3.776089130269611e-08
---------------------------------------------------------------------------------------------------


9it [21:42, 160.73s/it]

objective/kl: -0.0023454194888472557
ppo/returns/mean: 1.6306171417236328
ppo/policy/advantages_mean: 2.8497428417040283e-08
---------------------------------------------------------------------------------------------------


10it [24:30, 147.01s/it]

objective/kl: 0.019251052290201187
ppo/returns/mean: 2.155226945877075
ppo/policy/advantages_mean: 9.549399493380406e-08
---------------------------------------------------------------------------------------------------





##### Evaluate the Model Quantitatively

We can load the PPO/PEFT model back in and use the test dataset split to evaluate the toxicity score of the RL-fine-tuned model.

In [45]:
mean_after_detoxification, std_after_detoxification = evaluate_toxicity(model=ppo_model, 
                                                                        toxicity_evaluator=toxicity_evaluator, 
                                                                        tokenizer=tokenizer, 
                                                                        dataset=dataset["test"], 
                                                                        num_samples=10)
print(f'toxicity [mean, std] after detox: [{mean_after_detoxification}, {std_after_detoxification}]')

11it [00:27,  2.50s/it]

toxicity [mean, std] after detox: [0.0027875636746598916, 0.004031045784112161]





Then, we can compare the toxicity scores of the reference model (before detoxification) and fine-tuned model (after detoxification).

In [49]:
mean_improvement = (mean_before_detoxification - mean_after_detoxification) / mean_before_detoxification
std_improvement = (std_before_detoxification - std_after_detoxification) / std_before_detoxification

print(f'Percentage improvement of toxicity score after detoxification:')
print(f'[Negative means toxicity is reducing]')
print(f'mean:\t{mean_improvement:.1%}')
print(f'std:\t{std_improvement:.1%}')

Percentage improvement of toxicity score after detoxification:
[Negative means toxicity is reducing]
mean:	-48.9%
std:	-72.0%


##### Evaluate the Model Qualitatively (~3 mins runtime)

Let's inspect some examples from the test dataset. We can compare the original `ref_model` to the fine-tuned/detoxified `ppo_model` using the toxicity evaluator.

In [50]:
batch_size = 20
compare_results = {}

df_batch = dataset["test"][0:batch_size]

compare_results["query"] = df_batch["query"]
prompt_tensors = df_batch["input_ids"]

summary_tensors_ref = []
summary_tensors = []

# Get response from ppo and base model.
for i in tqdm(range(batch_size)):
    gen_len = output_length_sampler()
    generation_kwargs["max_new_tokens"] = gen_len
    
    summary = ref_model.generate(
        input_ids=torch.as_tensor(prompt_tensors[i]).unsqueeze(dim=0).to(device), 
        **generation_kwargs
    ).squeeze()[-gen_len:]
    summary_tensors_ref.append(summary)

    summary = ppo_model.generate(
        input_ids=torch.as_tensor(prompt_tensors[i]).unsqueeze(dim=0).to(device), 
        **generation_kwargs
    ).squeeze()[-gen_len:]
    summary_tensors.append(summary)

# Decode responses.
compare_results["response_before"] = [tokenizer.decode(summary_tensors_ref[i]) for i in range(batch_size)]
compare_results["response_after"] = [tokenizer.decode(summary_tensors[i]) for i in range(batch_size)]

# Sentiment analysis of query/response pairs before/after.
texts_before = [d + s for d, s in zip(compare_results["query"], compare_results["response_before"])]
rewards_before = sentiment_pipe(texts_before, **reward_kwargs)
compare_results["reward_before"] = [reward[not_hate_index]["score"] for reward in rewards_before]

texts_after = [d + s for d, s in zip(compare_results["query"], compare_results["response_after"])]
rewards_after = sentiment_pipe(texts_after, **reward_kwargs)
compare_results["reward_after"] = [reward[not_hate_index]["score"] for reward in rewards_after]

100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [01:51<00:00,  5.59s/it]


In [52]:
pd.set_option('display.max_colwidth', 500)
df_compare_results = pd.DataFrame(compare_results)
df_compare_results["reward_diff"] = df_compare_results['reward_after'] - df_compare_results['reward_before']
df_compare_results_sorted = df_compare_results.sort_values(by=['reward_diff'], ascending=False).reset_index(drop=True)
df_compare_results_sorted

Unnamed: 0,query,response_before,response_after,reward_before,reward_after,reward_diff
0,Summarize the following article. Maria Sharapova is used to making back page headlines but she doesn't always see someone reading an article about her. The World number four saw a passenger reading about her and tweeted a picture saying how she can see the piece. The passenger thought he was just casually reading about a tennis star oblivious to the fact that she was in fact right behind him. Maria Sharapova posts a picture of a passenger reading an article about her on a flight. The tennis ...,<pad> Top-ranked Maria Sharapova got some work on her silliness. She features in pictures and tweets about her best times in nearly 15 years.</s>,"<pad> Maria Sharapova knew a passenger was reading an article about her but she was just behind her. Sharapova is one of the world's best tennis stars and after falling to defeat at Dublin 1992 Martin Wuertmuller saw the story online. The blogger had stopped reading when a passenger asked her about the information. On other flights, she saw an article about her while outs-the-way in Shanghai. Sharapova has made headlines after even struggling at her weight.</s>",2.655499,3.426647,0.771148
1,"Summarize the following article. By. Jeff Powell. Follow @@jeffpowell_Mail. As five-division world champion Floyd Mayweather and Marcos Maidana prepare to go head-to-head in Las Vegas, Sportsmail's boxing columnist Jeff Powell gives his verdict from the MGM Grand... Mayweather won their fight in May by unanimous decision but the former WBA welterweight champion Maidana is back at the MGM Grand to reclaim his title. 'I think we'll see Mayweather mobile and very difficult to hit. The issue is ...","<pad> Floyd Mayweather, Marcos Maidana receive unanimous decision at MGM Grand. The five-division champion Wire cut the need for an extra bout. Serbia-born fighter remains in-form but inevitably will wait for Mayweather.</s>",<pad> Floyd Mayweather to contest Marcos Maidana in Las Vegas. Mayweather’s win over Maidana is confirmed by a unanimous decision at MGM Grand. Maidana is undefeated at the top of the takeover defence.</s>,2.66436,3.288665,0.624305
2,"Summarize the following article. By. Ap Reporter. A Swiss football club says it will investigate reports that some of its fans urinated in a water bottle that a rival goalkeeper then drank from. Reto Felder, a goalkeeper who plays for Muri in the Swiss fourth division, claims he first though his drink was warmed by the sun. He called the prank 'crude and disgraceful', according to the Blick daily newspaper. Fans in the crowd - reportedly around 500 strong - persuaded a ball boy to pass them ...",<pad> Baden president promises to investigate offence of fans-pirate custodian.</s>,"<pad> Baogen goalkeeper says fans urinated in a bottle during a match. German goalkeeper says he tried to pass it to a ball boy. Football club says it will investigate. Fans sang in a garden hose, uncertain affect. It is thought like-for-like wine, marbled pork, ham, pastry. Neither player was able to consume alcohol to be legally under Swiss law.</s>",2.435017,2.844114,0.409098
3,"Summarize the following article. Charles James 'flipped out' while doing drugs with his 14-year-old daughter in hotel room. A father 'flipped out' while allegedly doing drugs with his 14-year-old daughter in hotel room. Charles James was arrested on Monday after an employee at a Baytown, Texas, hotel heard screams for help coming from a room. Police were called to Palace Inn Hotel and found James in the room with his teenage daughter. Detectives said James, 36, appeared to be under the influ...","<pad> Charles James was arrested on Monday after room on Baytown, Texas hotel heard screams for help. He appeared to be under the influence of drugs but otherwise isn't coherent. Family's father 'flipped out' poses 'Changing head' at hospital following well-taken drug search.</s>",<pad> Charles James has been arrested 'drowsily' on Monday and charged with child endangerment with child endangerment. James had come down with an seizure in the hotel while the girl asleep. Police cleared him of suspicion of child endangerment and today he was released without charge.</s>,2.830459,3.117796,0.287337
4,"Summarize the following article. James Ward slipped to a straight-sets second-round defeat to number four seed Grigor Dimitrov at the Queen's Club Aegon Championships. British No. 3 Ward saw off Slovenian Blaz Rola on Monday, but was unable to produce what would have been a big upset against Bulgarian Dimitrov. VIDEO Scroll down to watch Tomas Berdych answer quickfire question at Queen's. A step too far: James Ward was beaten in straight sets by fourth seed Grigor Dimitrov. Soaking up the su...",<pad> James Ward falls to job-scary second-wicket defeat by first seed Dimitrov. British No. 3 missed out against Bulgarian Dimitrov in first match at first round.</s>,<pad> James Ward takes down Dimitrov in straight sets at Queen's Club Aegon Championships. British No 3 Ward wobble but manages to keep up with Dimitrov. Dimitrov pulls out 7-5 6-3 of second-round match.</s>,3.539455,3.776832,0.237378
5,"Summarize the following article. Police have located 12-year-old Selwyn Warren who went missing inside and Adelaide shopping centre on Monday morning. South Australian police confirmed that Selwyn was found'safe and well' after they appealed to the public for assistance. Selwyn had been missing since 9am on Monday when he was last seen by a family member near the Firle Shopping Centre, just north east of Adelaide's CBD. Selwyn Warren has been missing since 9am on Monday when he was last seen...",<pad> Sherdog West Shanghai has released a second search operation for teenager Selwyn Warren.</s>,<pad> Calls have been sent to police to findSelwyn Warren from where he went missing following a suspected sexual assault.</s>,4.274823,4.349048,0.074224
6,Summarize the following article. By. Mail Online Reporter. A Queens cemetery worker is resting in a hospital after falling into a freshly dug grave. The worker at St. John's Cemetery in Middle Village was setting a monument along the grave at about 10: 15 a.m. Friday when he tumbled in backwards. His name has not been released by authorities. The worker at St. John's Cemetery is in stable condition after falling backwards into an open grave. Emergency responders were able to pull him out of ...,<pad> The worker at St. John's Cemetery in Middle Village fell into well-groomed grave. Emergency responders pulled him from the eight-foot-deep grave over a 40-minute period.</s>,<pad> The worker was setting a monument along the grave at 10: 15 a.m. Friday. He was out of hospital when he became infected. Emergency services are still treating the 54-year-old.</s>,4.238442,4.298999,0.060556
7,"Summarize the following article. By. John Drayton. Stoke are one of a number of English sides interested in Livorno's Brazilian striker Paulinho, according to the Italian club's president. Aldo Spinelli said the 28-year-old, who scored 15 goals in 35 Serie A matches, is also attracting attention from Abu Dhabi. 'Al Jazira have already spoken with Paulinho, but Stoke City also want him and there are also a couple of other English clubs,' Spinelli told Gazzetta dello Sport. Wanted: Livorno str...","<pad> Portuguese striker gets attention from South African soccer's FC Barcelona. 'Manchester United' want Paulinho, but Bury also interested. Elf family also lure more interest in Filipo Riddick.</s>",<pad> Chelsea top of the Premier League table with an 11th place finish. Ice-Cat news agency Express Features. Mold Humberside represents Torremolo from Olinense.</s>,4.053303,4.099115,0.045812
8,"Summarize the following article. A woman and two men have died following a crash involving three vehicles, police said. Officers were called to a road in Mansfield Woodhouse, Nottinghamshire, at about 9.50pm on Friday. The crash on Debdale Lane involving a black BMW 3 Series Convertible, a Honda. motorbike and a Ford Ranger pick-up van. Scene: The crash, which killed three, happened on Debdale Lane yesterday evening. A 38-year-old woman and a 24-year-old man traveling in the black BMW were k...",<pad> Four people have been killed in a motorbike accident in Nottinghamshire. Pictures of the crash were taken from CCTV.</s>,"<pad> Three people dead in crash at road in Mansfield Woodhouse, Nottinghamshire, with three other vehicles. At least three people are in serious but stable condition according to police.</s>",4.546868,4.580168,0.0333
9,"Summarize the following article. Heather Watson made a first-round exit at the Dongfeng Motor Wuhan Open after losing in straight sets to Sara Errani. The British No 1 only avoided going through qualifying due to Li Na's retirement from tennis and subsequent withdrawal from her home-town tournament but could not make the most of her opportunity. Watson struggled on serve, only managing to get 46 per cent of her first serves in, and was broken six times as she went down 7-5 6-4 to the 11th se...",<pad> Heather Watson lost to Sara Errani as she heads to Big Bash Open at Dongfeng Motor Wuhan. The British No 1 suffered a struggle on serve and drawing six times with just 46 per cent of her first serve in. The 13-time World No 1 won in straight sets and continued to play through.</s>,<pad> Heather Watson cleared the courts as she lost in straight sets to Swansea's Sara Errani during the Wuhan Open. Watson struggles to get in a sprained serve.</s>,3.999867,4.030707,0.03084


Looking at the reward mean/median of the generated sequences you can observe some difference. Note that news articles should be not biased or contain hate speech, so it's very likely that more computational power should be used to finetune this model even more. We expect the articles to be written in a neutral tone, thus requiring a more powerful model to spot the (luckily) limited cases where the text is less neutral.

Trying to perform the same detoxification on other datasets (containing, for example, conversations, colloquial exchanges, ...) could bring more striking results.

### Acknowledgements

Thanks to DeepLearning.AI for the courses that inspired this notebook.