# Fine-Tune FLAN-T5 with Reinforcement Learning (PPO) and (Parameter Efficient Fine Tuning) PEFT to Generate Less-Toxic Summaries

In this notebook, we will fine-tune a FLAN-T5 model to generate less toxic content with Meta AI's hate speech reward model.

The reward model is a binary classifier that predicts either "not hate" or "hate" for the given text. We will use Proximal Policy Optimization (PPO) to fine-tune and reduce the model's toxicity.

## 1 - Set up Kernal
```
pip3 install --upgrade pip

pip3 install --disable-pip-version-check \
    torch==2.0.0 \
    torchdata==0.6.0

pip3 install \
    transformers==4.27.2 \
    datasets==2.11.0 \
    rouge_score==0.1.2 \
    evaluate==0.4.0 \
    peft==0.3.0
```

## 2 - Load FLAN-T5 Model, Prepare Reward Model and Toxicity Evaluator

### 2.1 - Load Data and FLAN-T5 Model Fine-Tuned with Summarization Instruction

We will keep working with the same Hugging Face dataset DialogSum and the pre-trained model FLAN-T5.

In [1]:
from datasets import load_dataset

model_name="google/flan-t5-base"
huggingface_dataset_name="knkarthick/dialogsum"

dataset_original = load_dataset(huggingface_dataset_name)

dataset_original

  from .autonotebook import tqdm as notebook_tqdm


DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 12460
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 500
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 1500
    })
})

The next step will be to preprocess the dataset. We will take only a part of it, then filter the dialogues of a particular length (just to make those examples long enough and, at the same time, easy to read). Then wrap each dialogue with the instruction and tokenize the prompts. Save the token ids in the field `input_ids` and decoded version of the prompts in the field `query`.

In [2]:
from transformers import AutoTokenizer


def build_dataset(model_name,
                  dataset_name,
                  input_min_text_length,
                  input_max_text_length):
    """
    Preprocess the dataset and split it into train and test parts.

    Parameters:
    - model_name (str): Tokenizer model name
    - dataset_name (str): Name of the dataset to load.
    - input_min_text_length (int): Minimum length of the dialogues.
    - input_max_text_length (int): Maximum length of the dialogues.

    Returns:
    - dataset_splits (dataset.dataset_dict.DatasetDict): Preprocessed dataset contained train and test parts.
    """

    # load dataset (only "train" part will be enough for this lab)
    dataset = load_dataset(dataset_name, split="train")

    # filter the dialogue of length between input_min_text_length and input_max_text_length
    dataset = dataset.filter(lambda x: len(x["dialogue"]) > input_min_text_length and len(x["dialogue"]) <= input_max_text_length, batched=False)

    # Prepare tokenizer. Setting device_map="auto" allows to switch between GPU and CPU automatically.
    tokenizer = AutoTokenizer.from_pretrained(model_name, device_map="auto")

    def tokenize(sample):
        prompt = f"""
Summarize the following conversation.
{sample["dialogue"]}

Summary:
"""
        sample["input_ids"] = tokenizer.encode(prompt)

        # This must be called "query", which is a requirement of our PPO library.
        sample["query"] = tokenizer.decode(sample["input_ids"])
        return sample
    
    # Tokenize each dialogue
    dataset = dataset.map(tokenize, batched=False)
    dataset.set_format(type="torch")

    # Split the dataset into train and test parts
    dataset_splits = dataset.train_test_split(test_size=0.2, shuffle=False, seed=42)

    return dataset_splits

dataset = build_dataset(model_name=model_name,
                        dataset_name=huggingface_dataset_name,
                        input_min_text_length=200,
                        input_max_text_length=1000)

print(dataset)


DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic', 'input_ids', 'query'],
        num_rows: 8017
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic', 'input_ids', 'query'],
        num_rows: 2005
    })
})


In the previous lab, we fine-tuned the PEFT model with summarization instructions.
We can load the local version from `../week-2/peft-dialogue-summary-checkpoint-local/adapter_model.bin`, since I don't have access to the full fine-tuned model from the s3 bucket.

In [3]:
%%sh
# has already been copied
# cp ../week-2/peft-dialogue-summary-checkpoint-local/adapter_model.bin ./

Prepare a function to pull out the number of model parameters (it is the same as in the previous lab):

In [4]:
def print_total_number_of_model_parameters(model):
    print(f'{model.num_parameters():,}')

def print_number_of_trainable_model_parameters(model):
    print(f'{model.num_parameters(only_trainable=True):,}')

def print_percentage_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"\ntrainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

Add the adapter to the original FLAN-T5 model. In the previous lab we were adding the fully trained adapter only for inferences, so there was no need to pass LoRA configurations doing that. 

Now we need to pass them to the constructed PEFT model, also putting `is_trainable=True`.

In [5]:
from peft import PeftModel, LoraConfig, TaskType
from transformers import AutoModelForSeq2SeqLM
import torch

lora_config = LoraConfig(
    r = 32, # Rank
    lora_alpha=32,
    target_modules=["q", 'v'],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM
)

model = AutoModelForSeq2SeqLM.from_pretrained(model_name, 
                                              torch_dtype=torch.bfloat16)

peft_model = PeftModel.from_pretrained(model,
                                       './peft-dialogue-summary-checkpoint-local/',
                                       lora_config=lora_config,
                                       torch_dtype=torch.bfloat16,
                                       device_map="auto",
                                       is_trainable=True)

print(f"Total number of model parameters:")
print_total_number_of_model_parameters(peft_model)

print(f"PEFT model parameters to be updated:")
print_number_of_trainable_model_parameters(peft_model)


Total number of model parameters:
251,116,800
PEFT model parameters to be updated:
3,538,944


In this lab, we are preparing to fine-tune the LLM using Reinforcement Learning (RL). 

RL will be briefly discussed in the next section of this lab, but at this stage, we just need to prepare the Proximal Policy Optimization (PPO) model passing the instruct-fine-tuned PEFT model to it. PPO will be used to optimize the RL policy against the reward model.

In [6]:
%pip install git+https://github.com/lvwerra/trl.git@25fa1bd

from trl import AutoModelForSeq2SeqLMWithValueHead

ppo_model = AutoModelForSeq2SeqLMWithValueHead.from_pretrained(peft_model,
                                                               torch_dtype=torch.bfloat16,
                                                               is_trainable=True)

print(f'PPO model parameters to be updated (ValueHead + 769 params):\n{print_percentage_of_trainable_model_parameters(ppo_model)}\n')

print(ppo_model.v_head)

Collecting git+https://github.com/lvwerra/trl.git@25fa1bd
  Cloning https://github.com/lvwerra/trl.git (to revision 25fa1bd) to /private/var/folders/9r/667t4_w13xqbc_xnc9srccvm0000gn/T/pip-req-build-mvxup62y
  Running command git clone --filter=blob:none --quiet https://github.com/lvwerra/trl.git /private/var/folders/9r/667t4_w13xqbc_xnc9srccvm0000gn/T/pip-req-build-mvxup62y
[0m  Running command git checkout -q 25fa1bd
  Resolved https://github.com/lvwerra/trl.git to commit 25fa1bd
  Preparing metadata (setup.py) ... [?25ldone
Note: you may need to restart the kernel to use updated packages.
PPO model parameters to be updated (ValueHead + 769 params):

trainable model parameters: 3539713
all model parameters: 251117569
percentage of trainable model parameters: 1.41%

ValueHead(
  (dropout): Dropout(p=0.1, inplace=False)
  (summary): Linear(in_features=768, out_features=1, bias=True)
  (flatten): Flatten(start_dim=1, end_dim=-1)
)


During PPO, only a few parameters will be updated. Specifically, the parameters of the `ValueHead`. 

More information about this class of models can be found in the [documentation](https://huggingface.co/docs/trl/main/en/models#trl.create_reference_model). The number of trainable parameters can be computed as `(n+1) * m`, where `n` is the number of input units (here `n = 768`) and `m` is the number of output units (you have `m=1`). The `+1` term in the equation takes into account the bias term.

Now create a frozen copy of the PPO which will not be fine-tuned - a reference model. The reference model will represent the LLM before detoxification. None of the parameters of the reference model will be updated during PPO training. This is on purpose.

In [7]:
from trl import create_reference_model

ref_model = create_reference_model(ppo_model)
print(f'Reference model parameters to be updated:\n{print_percentage_of_trainable_model_parameters(ref_model)}')

Reference model parameters to be updated:

trainable model parameters: 0
all model parameters: 251117569
percentage of trainable model parameters: 0.00%


## 2.2 - Prepare Reward Model

**Reinforcement Learning (RL)** is one type of machine learning where agents take actions in an environment aimed at maximizing their cumulative rewards. The agent's behavior is defined by the **policy**. And the goal of the reinforcement learning is for the agent to learn an optimal, or nearly-optimal, policy that maximizes the **reward function**.

In the previous section, the original original policy is based on the instruct PEFT model - this is the LLM before detoxification. Then We could ask human labor, to give feedback on the outputs' toxicity. However, it can be expensive to use them for the entire fine-tuning process. A practical way to avoid that is to use a reward model encouraging the agent to detoxify the dialogue summaries. Then intuitive approach would be to do some form of sentiment analysis across two classes (`nothate` and `hate`) And give a higher award if there is a chance of getting in class `nothate` as an output.

We will use [Meta AI's RoBERTa-based hate speech model](https://huggingface.co/facebook/roberta-hate-speech-dynabench-r4-target) for the reward model. This model will output **logits** And then predict probabilities across two classes: `nothate` and `hate`. The logits of the output `nothate` Will be taking as a positive reward. Then the model Will be fine tuned with PPO using those reward values.

 Create the instance of the required model for the RoBERTa model. We also need to load a tokenizer to test the model. Noticed that the model label `0`  will correspond to the class `nothate` and label `1` to the class `hate`.

In [8]:
from transformers import AutoModelForSequenceClassification

toxicity_model_name = "facebook/roberta-hate-speech-dynabench-r4-target"
toxicity_tokenizer = AutoTokenizer.from_pretrained(toxicity_model_name, device_map="auto")
toxicity_model = AutoModelForSequenceClassification.from_pretrained(toxicity_model_name, device_map="auto")

print(toxicity_model.config.id2label)


{0: 'nothate', 1: 'hate'}


Take some non-toxic text, tokenize it, and pass it to the model. 
Print the output logits, probabilities, and the corresponding reward that will be used for fine-tuning.

In [56]:
non_toxic_text = "#Person 1# tells Tommy that he didn't like the movie."
toxicity_input_ids = toxicity_tokenizer(non_toxic_text, return_tensors="pt").input_ids
# throws a runtime error
logits = toxicity_model(toxicity_input_ids).logits
print(f'logits [not hate, hate]: {logits.tolist()[0]}')

# Print the probabilities for [not hate, hate]
probabilities = logits.softmax(dim=-1).tolist()[0]
print(f'probabilities [not hate, hate]: {probabilities}')

# get the logits for "not hate" - this is the reward!
not_hate_index = 0
nothate_reward = (logits[:, not_hate_index]).tolist()
print(f'reward (high): {nothate_reward}')

Let's show a toxic comment. This will have a low reward because it is more toxic.

In [34]:
toxic_text = "#Person 1# tells Tommy that the movie was terrible, dumb and stupid."
toxicity_input_ids = toxicity_tokenizer(toxic_text, return_tensors="pt").input_ids
# throws a runtime error
logits = toxicity_model(toxicity_input_ids).logits
print(f'logits [not hate, hate]: {logits.tolist()[0]}')

# Print the probabilities for [not hate, hate]
probabilities = logits.softmax(dim=-1).tolist()[0]
print(f'probabilities [not hate, hate]: {probabilities}')

# Get the logits for "not hate" - this is the reward!
nothate_reward = (logits[:, not_hate_index]).tolist() 
print(f'reward (low): {nothate_reward}')


Setup Hugging Face inference pipeline to simplify the code for the toxicity reward model:

In [39]:
from transformers import pipeline

device = 0 if torch.cuda.is_available() else "cpu"

sentiment_pipe = pipeline("sentiment-analysis", 
                          model=toxicity_model_name, 
                          device=device)
reward_logits_kwargs = {
    "top_k": None, # Return all scores.
    "function_to_apply": "none", # Set to "none" to retrieve raw logits.
    "batch_size": 16
}

reward_probabilities_kwargs = {
    "top_k": None, # Return all scores.
    "function_to_apply": "softmax", # Set to "softmax" to apply softmax and retrieve probabilities.
    "batch_size": 16
}

print("Reward model output:")
print("For non-toxic text")
print(sentiment_pipe(non_toxic_text, **reward_logits_kwargs))
print(sentiment_pipe(non_toxic_text, **reward_probabilities_kwargs))
print("For toxic text")
print(sentiment_pipe(toxic_text, **reward_logits_kwargs))
print(sentiment_pipe(toxic_text, **reward_probabilities_kwargs))

Reward model output:
For non-toxic text
[{'label': 'nothate', 'score': 3.114098072052002}, {'label': 'hate', 'score': -2.4896154403686523}]
[{'label': 'nothate', 'score': 0.9963293671607971}, {'label': 'hate', 'score': 0.0036706337705254555}]
For toxic text
[{'label': 'hate', 'score': 0.37227004766464233}, {'label': 'nothate', 'score': -0.6921156048774719}]
[{'label': 'hate', 'score': 0.7435277700424194}, {'label': 'nothate', 'score': 0.25647225975990295}]


The outputs are the logits for both `nothate` (positive) and `hate` (negative) classes. But PPO will be using logits only of the `nothate` class as the positive reward signal is used to help detoxify the LLM outputs.

In [40]:
print(sentiment_pipe(non_toxic_text, **reward_logits_kwargs))
print(sentiment_pipe(non_toxic_text, **reward_probabilities_kwargs))

[{'label': 'nothate', 'score': 3.114098072052002}, {'label': 'hate', 'score': -2.4896154403686523}]
[{'label': 'nothate', 'score': 0.9963293671607971}, {'label': 'hate', 'score': 0.0036706337705254555}]


In [41]:
print(sentiment_pipe(toxic_text, **reward_logits_kwargs))
print(sentiment_pipe(toxic_text, **reward_probabilities_kwargs))

[{'label': 'hate', 'score': 0.37227004766464233}, {'label': 'nothate', 'score': -0.6921156048774719}]
[{'label': 'hate', 'score': 0.7435277700424194}, {'label': 'nothate', 'score': 0.25647225975990295}]


### 2.3 - Evaluate Toxicity

To evaluate the model before and after fine-tuning/detoxification you need to set up the [toxicity evaluation metric](https://huggingface.co/spaces/evaluate-measurement/toxicity). The toxicity score is a decimal value between 0 and 1 where 1 is the highest toxicity.

In [42]:
import evaluate

toxicity_evaluator = evaluate.load("toxicity", 
                                    toxicity_model_name,
                                    module_type="measurement",
                                    toxic_label="hate")

Downloading builder script: 100%|██████████| 6.08k/6.08k [00:00<00:00, 24.9MB/s]


Try to calculate toxicity for the same sentences as in section 2.2. It's no surprise that the toxicity scores are the probabilities of `hate` class returned directly from the reward model.

In [43]:
toxicity_score = toxicity_evaluator.compute(predictions=[
    non_toxic_text
])

print("Toxicity score for non-toxic text:")
print(toxicity_score["toxicity"])

toxicity_score = toxicity_evaluator.compute(predictions=[
    toxic_text
])

print("\nToxicity score for toxic text:")
print(toxicity_score["toxicity"])

Toxicity score for non-toxic text:
[0.0036706337705254555]

Toxicity score for toxic text:
[0.7435277700424194]


This evaluator can be used to compute the toxicity of the dialogues prepared in section 2.1. We will need to pass the test dataset (`dataset["test"]`), the same tokenizer which was used in that section, the frozen PEFT model prepared in section 2.2, and the toxicity evaluator. It is convenient to wrap the required steps in the function `evaluate_toxicity`.

In [50]:
import numpy as np

from tqdm import tqdm
tqdm.pandas()

from transformers import GenerationConfig

def evaluate_toxicity(model,
                      toxicity_evaluator,
                      tokenizer,
                      dataset,
                      num_samples):
    """
    Preprocess the dataset and split it into train and test parts.

    Parameters:
    - model (trl model): Model to be evaluated.
    - toxicity_evaluator (evaluate_modules toxicity metrics): Toxicity evaluator.
    - tokenizer (transformers tokenizer): Tokenizer to be used.
    - dataset (dataset): Input dataset for the evaluation.
    - num_samples (int): Maximum number of samples for the evaluation.
        
    Returns:
    tuple: A tuple containing two numpy.float64 values:
    - mean (numpy.float64): Mean of the samples toxicity.
    - std (numpy.float64): Standard deviation of the samples toxicity.
    """

    max_new_tokens=100

    toxicities = []
    input_texts = []
    for i, sample in tqdm(enumerate(dataset)):
        input_text = sample["query"]

        if i > num_samples:
            break

        input_ids = tokenizer(input_text, return_tensors="pt", padding=True).input_ids

        generation_config = GenerationConfig(max_new_tokens=max_new_tokens,
                                             tok_k=0.0,
                                             top_p=1.0,
                                             do_sample=True)
        
        response_token_ids = model.generate(input_ids=input_ids,
                                            generation_config=generation_config)
        
        generated_text = tokenizer.decode(response_token_ids[0], skip_special_tokens=True)
        toxicity_score = toxicity_evaluator.compute(predictions=[(input_text + " " + generated_text)])
        toxicities.extend(toxicity_score["toxicity"])
    
    # Compute mean & std using np.
    mean = np.mean(toxicities)
    std = np.std(toxicities)
        
    return mean, std

And now perform the calculation of the model toxicity before fine-tuning/detoxification:

In [51]:
tokenizer = AutoTokenizer.from_pretrained(model_name, device_map="auto")
mean_before_detoxification, std_before_detoxification = evaluate_toxicity(model=ref_model, 
                                                                          toxicity_evaluator=toxicity_evaluator, 
                                                                          tokenizer=tokenizer, 
                                                                          dataset=dataset["test"], 
                                                                          num_samples=10)

print(f'toxicity [mean, std] before detox: [{mean_before_detoxification}, {std_before_detoxification}]')

11it [00:58,  5.34s/it]

toxicity [mean, std] before detox: [0.020027383241209794, 0.026915058424599524]





## 3 - Perform Fine-Tuning to Detoxify the Summaries

Optimize a RL policy against the reward model using Proximal Policy Optimization (PPO).

### 3.1 Initialize `PPOTrainer`
For the `PPOTrainer` initialization, you will need a collator. Here it will be a function transforming the dictionaries in a particular way. You can define and test it:

In [57]:
def collator(data):
    return dict((key, [d[key] for d in data]) for key in data[0])

test_data = [{"key1": "value1", "key2": "value2", "key3": "value3"}]
print(f'Collator input: {test_data}')
print(f'Collator output: {collator(test_data)}')

Collator input: [{'key1': 'value1', 'key2': 'value2', 'key3': 'value3'}]
Collator output: {'key1': ['value1'], 'key2': ['value2'], 'key3': ['value3']}


Set up the configuration parameters. Load the `ppo_model` and the tokenizer. You will also load a frozen version of the model `ref_model`. The first model is optimized while the second model serves as a reference to calculate the KL-divergence from the starting point. This works as an additional reward signal in the PPO training to make sure the optimized model does not deviate too much from the original LLM.

In [58]:
from trl import PPOConfig, PPOTrainer

learning_rate=1.41e-5
max_ppo_epochs=1
mini_batch_size=4
batch_size=16

config = PPOConfig(
    model_name=model_name,
    learning_rate=learning_rate,
    ppo_epochs=max_ppo_epochs,
    mini_batch_size=mini_batch_size,
    batch_size=batch_size
)

ppo_trainer = PPOTrainer(config=config,
                         model=ppo_model,
                         ref_model=ref_model,
                         tokenizer=tokenizer,
                         dataset=dataset["train"],
                         data_collator=collator)

### 3.2 Fine-Tune the Model

The fine-tuning loop consists of the following main steps:

1. Get the query responses from the policy LLM (PEFT model).
2. Get sentiments for query/responses from hate speech RoBERTa model.
3. Optimize policy with PPO using the (query, response, reward) triplet.

The operation is running if you see the following metrics appearing:

 - `objective/kl`: minimize kl divergence,
 - `ppo/returns/mean`: maximize mean returns,
 - `ppo/policy/advantages_mean`: maximize advantages.

In [73]:
prompt_tensors = batch["input_ids"]
prompt_tensors = [tensor.float() for tensor in prompt_tensors]
prompt_tensors

[tensor([1.2198e+04, 1.6350e+03, 1.7370e+03, 8.0000e+00, 8.2600e+02, 3.6340e+03,
         5.0000e+00, 1.7130e+03, 3.4500e+02, 1.3515e+04, 5.3600e+02, 4.6630e+03,
         1.0000e+01, 2.7000e+01, 6.0800e+02, 8.0000e+00, 3.9090e+03, 6.0000e+00,
         6.8000e+01, 2.7000e+01, 3.1000e+01, 5.1000e+01, 3.4100e+02, 5.9000e+01,
         4.1700e+02, 1.4900e+02, 1.2000e+01, 1.6900e+02, 8.2000e+01, 2.3580e+03,
         9.5100e+02, 5.0000e+00, 1.7130e+03, 3.4500e+02, 1.3515e+04, 3.5700e+02,
         4.6630e+03, 1.0000e+01, 1.3630e+04, 6.0000e+00, 3.4000e+01, 3.1000e+01,
         7.0000e+00, 1.1340e+03, 5.1400e+02, 5.0000e+00, 1.4850e+03, 1.3000e+01,
         6.6000e+01, 6.0000e+00, 2.7800e+02, 3.1000e+01, 1.7000e+01, 2.6120e+03,
         1.2000e+01, 9.1900e+02, 3.4000e+01, 3.0000e+01, 5.0000e+00, 1.7130e+03,
         3.4500e+02, 1.3515e+04, 5.3600e+02, 4.6630e+03, 1.0000e+01, 1.3031e+04,
         3.4000e+01, 5.5000e+01, 1.7130e+03, 3.4500e+02, 1.3515e+04, 3.5700e+02,
         4.6630e+03, 1.0000e

In [79]:
from trl.core import LengthSampler

output_min_length = 100
output_max_length = 400
output_length_sampler = LengthSampler(output_min_length, output_max_length)

generation_kwargs = {
    "min_length": 5,
    "top_k": 0.0,
    "top_p": 1.0,
    "do_sample": True
}

reward_kwargs = {
    "top_k": None, # Return all scores.
    "function_to_apply": "none", # You want the raw logits without softmax.
    "batch_size": 16
}

max_ppo_steps = 10

for step, batch in tqdm(enumerate(ppo_trainer.dataloader)):
    # Break when you reach max_steps
    if step >= max_ppo_steps:
        break

    prompt_tensors = batch["input_ids"]

    # Get response from FLAN-T5/PEFT LLM
    summary_tensors = []

    for prompt_tensor in prompt_tensors:
        max_new_tokens = output_length_sampler()

        generation_kwargs["max_new_tokens"] = max_new_tokens
        # Throws type error since M1 doesn't support BFloat16: https://github.com/pytorch/pytorch/issues/78168#issuecomment-1513598356
        summary = ppo_trainer.generate(prompt_tensor.long(), **generation_kwargs)

        summary_tensors.append(summary.squeeze()[-max_new_tokens:])

    # This needs to be called "response"
    batch["response"] = [tokenizer.decode(r.squeeze()) for r in summary_tensors]

    # Compute reward outputs
    query_response_pairs = [q + r for q, r in zip(batch["query"], batch["response"])]    
    rewards = sentiment_pipe(query_response_pairs, **reward_kwargs)

    # We use the `nothate` item because this is the score for the positive `nothate` class.
    reward_tensors = [torch.tensor(reward[not_hate_index]["score"]) for reward in rewards]

    # Run PPO step.
    stats = ppo_trainer.step(prompt_tensors, summary_tensors, reward_tensors)
    ppo_trainer.log_stats(stats, batch, reward_tensors)

    print(f'objective/kl: {stats["objective/kl"]}')
    print(f'ppo/returns/mean: {stats["ppo/returns/mean"]}')
    print(f'ppo/policy/advantages_mean: {stats["ppo/policy/advantages_mean"]}')
    print('-'.join('' for x in range(100)))

### 3.3 - Evaluate the Model Quantitatively

Load the PPO/PEFT model back in from disk and use the test dataset split to evaluate the toxicity score of the RL-fine-tuned model.

In [81]:
# Throws type error since M1 doesn't support BFloat16: https://github.com/pytorch/pytorch/issues/78168#issuecomment-1513598356
mean_after_detoxification, std_after_detoxification = evaluate_toxicity(model=ppo_model, 
                                                                        toxicity_evaluator=toxicity_evaluator, 
                                                                        tokenizer=tokenizer, 
                                                                        dataset=dataset["test"], 
                                                                        num_samples=10)

print(f'toxicity [mean, std] after detox: [{mean_after_detoxification}, {std_after_detoxification}]')

And compare the toxicity scores of the reference model (before detoxification) and fine-tuned model (after detoxification).

In [None]:
mean_improvement = (mean_before_detoxification - mean_after_detoxification) / mean_before_detoxification
std_improvement = (std_before_detoxification - std_after_detoxification) / std_before_detoxification

print(f'Percentage improvement of toxicity score after detoxification:')
print(f'mean: {mean_improvement*100:.2f}%')
print(f'std: {std_improvement*100:.2f}%')

### 3.4 - Evaluate the Model Qualitatively

Let's inspect some examples from the test dataset. You can compare the original `ref_model` to the fine-tuned/detoxified `ppo_model` using the toxicity evaluator.

In [83]:
batch_size = 20
compare_results = {}

df_batch = dataset["test"][0:batch_size]

compare_results["query"] = df_batch["query"]
prompt_tensors = df_batch["input_ids"]

summary_tensors_ref = []
summary_tensors = []

# Get response from ppo and base model.
for i in tqdm(range(batch_size)):
    gen_len = output_length_sampler()
    generation_kwargs["max_new_tokens"] = gen_len
    
    summary = ref_model.generate(
        input_ids=torch.as_tensor(prompt_tensors[i]).unsqueeze(dim=0).to(device), 
        **generation_kwargs
    ).squeeze()[-gen_len:]
    summary_tensors_ref.append(summary)

    summary = ppo_model.generate(
        input_ids=torch.as_tensor(prompt_tensors[i]).unsqueeze(dim=0).to(device), 
        **generation_kwargs
    ).squeeze()[-gen_len:]
    summary_tensors.append(summary)

# Decode responses.
compare_results["response_before"] = [tokenizer.decode(summary_tensors_ref[i]) for i in range(batch_size)]
compare_results["response_after"] = [tokenizer.decode(summary_tensors[i]) for i in range(batch_size)]

# Sentiment analysis of query/response pairs before/after.
texts_before = [d + s for d, s in zip(compare_results["query"], compare_results["response_before"])]
rewards_before = sentiment_pipe(texts_before, **reward_kwargs)
compare_results["reward_before"] = [reward[not_hate_index]["score"] for reward in rewards_before]

texts_after = [d + s for d, s in zip(compare_results["query"], compare_results["response_after"])]
rewards_after = sentiment_pipe(texts_after, **reward_kwargs)
compare_results["reward_after"] = [reward[not_hate_index]["score"] for reward in rewards_after]

Store and review the results in a DataFrame

In [86]:
import pandas as pd

pd.set_option('display.max_colwidth', 500)
df_compare_results = pd.DataFrame(compare_results)
df_compare_results["reward_diff"] = df_compare_results['reward_after'] - df_compare_results['reward_before']
df_compare_results_sorted = df_compare_results.sort_values(by=['reward_diff'], ascending=False).reset_index(drop=True)
df_compare_results_sorted

Looking at the reward mean/median of the generated sequences you can observe a significant difference!