### RL with PPO Example

- fine-tune a FLAN-T5 model to generate less toxic content
- Use Meta AI's hate speech reward model
- reward model is a binary classifier, predicts either "not hate" or "hate" for the given text
- use Proximal Policy Optimization (PPO) to fine-tune and reduce the model's toxicity.

In [1]:
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification, AutoModelForSeq2SeqLM, GenerationConfig
from datasets import load_dataset
from peft import PeftModel, PeftConfig, LoraConfig, TaskType

# trl: Transformer Reinforcement Learning library
from trl import PPOTrainer, PPOConfig, AutoModelForSeq2SeqLMWithValueHead
from trl import create_reference_model
from trl.core import LengthSampler

import torch
import evaluate

import numpy as np
import pandas as pd

# tqdm library makes the loops show a smart progress meter.
from tqdm import tqdm
tqdm.pandas()

  from .autonotebook import tqdm as notebook_tqdm



### Load Data and FLAN-T5 Model Fine-Tuned with Summarization Instruction

- use Hugging Face dataset [DialogSum](https://huggingface.co/datasets/knkarthick/dialogsum) 
- pre-trained model [FLAN-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5) 

In [2]:
model_name="google/flan-t5-base"
huggingface_dataset_name = "knkarthick/dialogsum"

dataset_original = load_dataset(huggingface_dataset_name)

dataset_original

Found cached dataset csv (/home/scotsditch/.cache/huggingface/datasets/knkarthick___csv/knkarthick--dialogsum-cd36827d3490488d/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1)
100%|██████████| 3/3 [00:00<00:00, 1070.89it/s]


DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 12460
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 500
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 1500
    })
})

function `build_dataset`:
- preprocess the dataset
- sample from dataset
- filter the dialogues of a particular length 
    - long enough but easy to read
- wrap dialogues with the instruction 
- tokenize the prompts
- Save the token ids in the field `input_ids` 
- decoded version of the prompts in the field `query`
 

In [3]:
def build_dataset(model_name,
                  dataset_name,
                  input_min_text_length, 
                  input_max_text_length):

    """
    Preprocess the dataset and split it into train and test parts.

    Parameters:
    - model_name (str): Tokenizer model name.
    - dataset_name (str): Name of the dataset to load.
    - input_min_text_length (int): Minimum length of the dialogues.
    - input_max_text_length (int): Maximum length of the dialogues.
        
    Returns:
    - dataset_splits (datasets.dataset_dict.DatasetDict): Preprocessed dataset containing train and test parts.
    """
    
    # load dataset (only "train" part will be enough for this lab).
    dataset = load_dataset(dataset_name, split="train")
    
    # Filter the dialogues of length between input_min_text_length and input_max_text_length characters.
    dataset = dataset.filter(lambda x: len(x["dialogue"]) > input_min_text_length and len(x["dialogue"]) <= input_max_text_length, batched=False)

    # Prepare tokenizer. Setting device_map="auto" allows to switch between GPU and CPU automatically.
    tokenizer = AutoTokenizer.from_pretrained(model_name, device_map="auto")
    
    def tokenize(sample):
        
        # Wrap each dialogue with the instruction.
        prompt = f"""
Summarize the following conversation.

{sample["dialogue"]}

Summary:
"""
        sample["input_ids"] = tokenizer.encode(prompt)
        
        # This must be called "query", which is a requirement of our PPO library.
        sample["query"] = tokenizer.decode(sample["input_ids"])
        return sample

    # Tokenize each dialogue.
    dataset = dataset.map(tokenize, batched=False)
    dataset.set_format(type="torch")
    
    # Split the dataset into train and test parts.
    dataset_splits = dataset.train_test_split(test_size=0.2, shuffle=False, seed=42)

    return dataset_splits

dataset = build_dataset(model_name=model_name,
                        dataset_name=huggingface_dataset_name,
                        input_min_text_length=200, 
                        input_max_text_length=1000)

print(dataset)

Found cached dataset csv (/home/scotsditch/.cache/huggingface/datasets/knkarthick___csv/knkarthick--dialogsum-cd36827d3490488d/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1)
Loading cached processed dataset at /home/scotsditch/.cache/huggingface/datasets/knkarthick___csv/knkarthick--dialogsum-cd36827d3490488d/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1/cache-23525d6e58aeae0c.arrow
Loading cached processed dataset at /home/scotsditch/.cache/huggingface/datasets/knkarthick___csv/knkarthick--dialogsum-cd36827d3490488d/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1/cache-7f0270498d4c123d.arrow


DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic', 'input_ids', 'query'],
        num_rows: 8017
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic', 'input_ids', 'query'],
        num_rows: 2005
    })
})


Prepare a function to pull out the number of model parameters :

In [4]:
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"\ntrainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

- Add the adapter to the original FLAN-T5 model
- pass lora configurations to the constructed PEFT model
- set `is_trainable=True`.

In [5]:
# use previously fine tuned model from PEFT_EXAMPLE notebook

lora_config = LoraConfig(
    r=32, # Rank
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM # FLAN-T5
)

model = AutoModelForSeq2SeqLM.from_pretrained(model_name, 
                                              torch_dtype=torch.bfloat16)

peft_model = PeftModel.from_pretrained(model, 
                                       '/home/scotsditch/stuff/scotsditch_storage/LLM/Generative-AI-with-LLMs-main/Week-2/peft-dialogue-summary-checkpoint-local/', 
                                       lora_config=lora_config,
                                       torch_dtype=torch.bfloat16, 
                                       device_map="auto",                                       
                                       is_trainable=True)

print(f'PEFT model parameters to be updated:\n{print_number_of_trainable_model_parameters(peft_model)}\n')

PEFT model parameters to be updated:

trainable model parameters: 3538944
all model parameters: 251116800
percentage of trainable model parameters: 1.41%



- fine-tune the LLM using Reinforcement Learning (RL)
- prepare the Proximal Policy Optimization (PPO) model using fine tuned PEFT model
- PPO will be used to optimize the RL policy against the reward model

In [6]:
ppo_model = AutoModelForSeq2SeqLMWithValueHead.from_pretrained(peft_model,                                                               
                                                               torch_dtype=torch.bfloat16,
                                                               is_trainable=True)

print(f'PPO model parameters to be updated (ValueHead + 769 params):\n{print_number_of_trainable_model_parameters(ppo_model)}\n')
print(ppo_model.v_head)

PPO model parameters to be updated (ValueHead + 769 params):

trainable model parameters: 3539713
all model parameters: 251117569
percentage of trainable model parameters: 1.41%

ValueHead(
  (dropout): Dropout(p=0.1, inplace=False)
  (summary): Linear(in_features=768, out_features=1, bias=True)
  (flatten): Flatten(start_dim=1, end_dim=-1)
)


Note: During PPO, only a few parameters will be updated. Specifically, the parameters of the `ValueHead`. More information about this class of models can be found in the [documentation](https://huggingface.co/docs/trl/main/en/models#trl.create_reference_model). The number of trainable parameters can be computed as $(n+1)*m$, where $n$ is the number of input units (here $n=768$) and $m$ is the number of output units (you have $m=1$). The $+1$ term in the equation takes into account the bias term.

- create a frozen copy of the PPO which will not be fine-tuned - a reference model
- reference model will represent the LLM before detoxification
- None of the parameters of the reference model will be updated during PPO training

In [7]:
ref_model = create_reference_model(ppo_model)

print(f'Reference model parameters to be updated:\n{print_number_of_trainable_model_parameters(ref_model)}\n')

Reference model parameters to be updated:

trainable model parameters: 0
all model parameters: 251117569
percentage of trainable model parameters: 0.00%



### Prepare Reward Model

**Reinforcement Learning (RL)** 
Overview:
- agents take actions in an environment aimed at maximizing cumulative rewards 
- agent's behavior is defined by **policy**
- goal for the agent to learn an optimal, or nearly-optimal, policy that maximizes the **reward function**. 

- original policy is based on fine tuned PEFT model (this is the LLM before detoxification) - human labelers give feedback on the outputs' toxicity
- expensive to use humans for the entire fine-tuning process

alternative approach: 
- to reduce cost is to use a reward model encouraging the agent to detoxify the dialogue summaries 
- approach is to do sentiment analysis across two classes (`nothate` and `hate`) and give a higher reward if there is higher a chance of getting class `nothate` as an output 
- use feedback generated by a model

Implementation below:
- use [Meta AI's RoBERTa-based hate speech model](https://huggingface.co/facebook/roberta-hate-speech-dynabench-r4-target) for the reward model
- model will output **logits** and then predict probabilities across two classes: `nothate` and `hate`
- logits of the output `nothate` will be taken as a positive reward
- model will be fine-tuned with PPO using those reward values

- Create the instance of the required model class for the RoBERTa model
- load tokenizer to test the model. 
- model label `0` will correspond to the class `nothate` and label `1` to the class `hate`.

In [8]:
toxicity_model_name = "facebook/roberta-hate-speech-dynabench-r4-target"
toxicity_tokenizer = AutoTokenizer.from_pretrained(toxicity_model_name, device_map="auto")
toxicity_model = AutoModelForSequenceClassification.from_pretrained(toxicity_model_name, device_map="auto")
print(toxicity_model.config.id2label)

{0: 'nothate', 1: 'hate'}


Take some non-toxic text, tokenize it, and pass it to the model. Print the output logits, probabilities, and the corresponding reward that will be used for fine-tuning.

In [13]:
non_toxic_text = "#Person 1# tells Tommy that he didn't like the movie."

# toxicity_input_ids = toxicity_tokenizer(non_toxic_text, return_tensors="pt").input_ids

# had to update to insure used gpu not cpu 10/25/23
toxicity_input_ids = toxicity_tokenizer(non_toxic_text, return_tensors="pt").input_ids\
.cuda()

# logits = toxicity_model(input_ids=toxicity_input_ids).logits
logits = toxicity_model(input_ids=toxicity_input_ids).logits
print(f'logits [not hate, hate]: {logits.tolist()[0]}')

# Print the probabilities for [not hate, hate]
probabilities = logits.softmax(dim=-1).tolist()[0]
print(f'probabilities [not hate, hate]: {probabilities}')

# get the logits for "not hate" - this is the reward!
not_hate_index = 0
nothate_reward = (logits[:, not_hate_index]).tolist()
print(f'reward (high): {nothate_reward}')

logits [not hate, hate]: [3.1141021251678467, -2.489619016647339]
probabilities [not hate, hate]: [0.9963293671607971, 0.0036706060636788607]
reward (high): [3.1141021251678467]


Let's show a toxic comment.  This will have a low reward because it is more toxic.

In [15]:
toxic_text = "#Person 1# tells Tommy that the movie was terrible, dumb and stupid."

# toxicity_input_ids = toxicity_tokenizer(toxic_text, return_tensors="pt").input_ids

# added to ensure use gpu 
toxicity_input_ids = toxicity_tokenizer(toxic_text, return_tensors="pt").input_ids\
.cuda()

logits = toxicity_model(toxicity_input_ids).logits
print(f'logits [not hate, hate]: {logits.tolist()[0]}')

# Print the probabilities for [not hate, hate]
probabilities = logits.softmax(dim=-1).tolist()[0]
print(f'probabilities [not hate, hate]: {probabilities}')

# Get the logits for "not hate" - this is the reward!
nothate_reward = (logits[:, not_hate_index]).tolist() 
print(f'reward (low): {nothate_reward}')

logits [not hate, hate]: [-0.6921164393424988, 0.37227070331573486]
probabilities [not hate, hate]: [0.2564719617366791, 0.7435280084609985]
reward (low): [-0.6921164393424988]


Setup Hugging Face inference pipeline to simplify the code for the toxicity reward model:

In [16]:
device = 0 if torch.cuda.is_available() else "cpu"

sentiment_pipe = pipeline("sentiment-analysis", 
                          model=toxicity_model_name, 
                          device=device)
reward_logits_kwargs = {
    "top_k": None, # Return all scores.
    "function_to_apply": "none", # Set to "none" to retrieve raw logits.
    "batch_size": 16
}

reward_probabilities_kwargs = {
    "top_k": None, # Return all scores.
    "function_to_apply": "softmax", # Set to "softmax" to apply softmax and retrieve probabilities.
    "batch_size": 16
}

print("Reward model output:")
print("For non-toxic text")
print(sentiment_pipe(non_toxic_text, **reward_logits_kwargs))
print(sentiment_pipe(non_toxic_text, **reward_probabilities_kwargs))
print("For toxic text")
print(sentiment_pipe(toxic_text, **reward_logits_kwargs))
print(sentiment_pipe(toxic_text, **reward_probabilities_kwargs))

Reward model output:
For non-toxic text
[{'label': 'nothate', 'score': 3.1141021251678467}, {'label': 'hate', 'score': -2.489619016647339}]
[{'label': 'nothate', 'score': 0.9963293671607971}, {'label': 'hate', 'score': 0.0036706060636788607}]
For toxic text
[{'label': 'hate', 'score': 0.37227070331573486}, {'label': 'nothate', 'score': -0.6921164393424988}]
[{'label': 'hate', 'score': 0.7435280084609985}, {'label': 'nothate', 'score': 0.25647199153900146}]


The outputs are the logits for both `nothate` (positive) and `hate` (negative) classes. But PPO will be using logits only of the `nothate` class as the positive reward signal used to help detoxify the LLM outputs.

In [17]:
print(sentiment_pipe(non_toxic_text, **reward_logits_kwargs))
print(sentiment_pipe(non_toxic_text, **reward_probabilities_kwargs))

[{'label': 'nothate', 'score': 3.1141021251678467}, {'label': 'hate', 'score': -2.489619016647339}]
[{'label': 'nothate', 'score': 0.9963293671607971}, {'label': 'hate', 'score': 0.0036706060636788607}]


In [18]:
print(sentiment_pipe(toxic_text, **reward_logits_kwargs))
print(sentiment_pipe(toxic_text, **reward_probabilities_kwargs))

[{'label': 'hate', 'score': 0.37227070331573486}, {'label': 'nothate', 'score': -0.6921164393424988}]
[{'label': 'hate', 'score': 0.7435280084609985}, {'label': 'nothate', 'score': 0.25647199153900146}]



### Evaluate Toxicity

To evaluate the model before and after fine-tuning/detoxification need to set up the [toxicity evaluation metric](https://huggingface.co/spaces/evaluate-measurement/toxicity). The **toxicity score** is a decimal value between 0 and 1 where 1 is the highest toxicity.

In [19]:
toxicity_evaluator = evaluate.load("toxicity", 
                                    toxicity_model_name,
                                    module_type="measurement",
                                    toxic_label="hate")

Downloading builder script: 100%|██████████| 6.08k/6.08k [00:00<00:00, 12.2MB/s]


- calculate toxicity for the same sentences as above
- It's no surprise that the toxicity scores are the probabilities of `hate` class returned directly from the reward model

In [20]:
toxicity_score = toxicity_evaluator.compute(predictions=[
    non_toxic_text
])

print("Toxicity score for non-toxic text:")
print(toxicity_score["toxicity"])

toxicity_score = toxicity_evaluator.compute(predictions=[
    toxic_text
])

print("\nToxicity score for toxic text:")
print(toxicity_score["toxicity"])

Toxicity score for non-toxic text:
[0.003670593723654747]

Toxicity score for toxic text:
[0.743529200553894]


- evaluator can be used to compute the toxicity of the dialogues prepared previously 
- need to pass the test dataset (`dataset["test"]`)
- same tokenizer which was used previously with the frozen PEFT model, and the toxicity evaluator
- wrap the required steps in the function `evaluate_toxicity`. 

In [27]:
def evaluate_toxicity(model, 
                      toxicity_evaluator, 
                      tokenizer, 
                      dataset, 
                      num_samples):
    
    """
    Preprocess the dataset and split it into train and test parts.

    Parameters:
    - model (trl model): Model to be evaluated.
    - toxicity_evaluator (evaluate_modules toxicity metrics): Toxicity evaluator.
    - tokenizer (transformers tokenizer): Tokenizer to be used.
    - dataset (dataset): Input dataset for the evaluation.
    - num_samples (int): Maximum number of samples for the evaluation.
        
    Returns:
    tuple: A tuple containing two numpy.float64 values:
    - mean (numpy.float64): Mean of the samples toxicity.
    - std (numpy.float64): Standard deviation of the samples toxicity.
    """

    max_new_tokens=100

    toxicities = []
    input_texts = []
    for i, sample in tqdm(enumerate(dataset)):
        input_text = sample["query"]

        if i > num_samples:
            break
            
#         input_ids = tokenizer(input_text, return_tensors="pt", padding=True).input_ids
        
        # added to ensure gpu is used
        input_ids = tokenizer(input_text, return_tensors="pt", padding=True).input_ids\
        .cuda()
        
        
        generation_config = GenerationConfig(max_new_tokens=max_new_tokens,
                                             tok_k=0.0,
                                             top_p=1.0,
                                             do_sample=True)

        response_token_ids = model.generate(input_ids=input_ids,
                                            generation_config=generation_config)
        
        generated_text = tokenizer.decode(response_token_ids[0], skip_special_tokens=True)
        
        toxicity_score = toxicity_evaluator.compute(predictions=[(input_text + " " + generated_text)])

        toxicities.extend(toxicity_score["toxicity"])

    # Compute mean & std using np.
    mean = np.mean(toxicities)
    std = np.std(toxicities)
        
    return mean, std

And now perform the calculation of the model toxicity before fine-tuning/detoxification:

In [28]:
tokenizer = AutoTokenizer.from_pretrained(model_name, device_map="auto")

mean_before_detoxification, std_before_detoxification = evaluate_toxicity(model=ref_model, 
                                                                          toxicity_evaluator=toxicity_evaluator, 
                                                                          tokenizer=tokenizer, 
                                                                          dataset=dataset["test"], 
                                                                          num_samples=10)

print(f'toxicity [mean, std] before detox: [{mean_before_detoxification}, {std_before_detoxification}]')

11it [00:07,  1.40it/s]

toxicity [mean, std] before detox: [0.01606022620531307, 0.023931574985433954]






## Perform Fine-Tuning to Detoxify the Summaries
Optimize a RL policy against the reward model using Proximal Policy Optimization (PPO).


### Initialize `PPOTrainer`
 
For the `PPOTrainer` initialization, need a collator. Here it will be a function transforming the dictionaries in a particular way:

In [29]:
def collator(data):
    return dict((key, [d[key] for d in data]) for key in data[0])

test_data = [{"key1": "value1", "key2": "value2", "key3": "value3"}]
print(f'Collator input: {test_data}')
print(f'Collator output: {collator(test_data)}')

Collator input: [{'key1': 'value1', 'key2': 'value2', 'key3': 'value3'}]
Collator output: {'key1': ['value1'], 'key2': ['value2'], 'key3': ['value3']}


- Set up the configuration parameters
- Load the `ppo_model` and the tokenizer
- load a frozen version of the model `ref_model`
- first model is optimized while the second model serves as a reference to calculate the KL-divergence from the starting point. This works as an additional reward signal in the PPO training to make sure the optimized model does not deviate too much from the original LLM.

In [30]:
learning_rate=1.41e-5
max_ppo_epochs=1
mini_batch_size=4
batch_size=16

config = PPOConfig(
    model_name=model_name,    
    learning_rate=learning_rate,
    ppo_epochs=max_ppo_epochs,
    mini_batch_size=mini_batch_size,
    batch_size=batch_size
)

ppo_trainer = PPOTrainer(config=config, 
                         model=ppo_model, 
                         ref_model=ref_model, 
                         tokenizer=tokenizer, 
                         dataset=dataset["train"], 
                         data_collator=collator)


### Fine-Tune the Model

The fine-tuning loop consists of the following main steps:
1. Get the query responses from the policy LLM (PEFT model).
2. Get sentiments for query/responses from hate speech RoBERTa model.
3. Optimize policy with PPO using the (query, response, reward) triplet.

The operation is running if see the following metrics appearing:
* `objective/kl`: minimize kl divergence,
* `ppo/returns/mean`: maximize mean returns,
* `ppo/policy/advantages_mean`: maximize advantages.

In [31]:
output_min_length = 100
output_max_length = 400
output_length_sampler = LengthSampler(output_min_length, output_max_length)

generation_kwargs = {
    "min_length": 5,
    "top_k": 0.0,
    "top_p": 1.0,
    "do_sample": True
}

reward_kwargs = {
    "top_k": None, # Return all scores.
    "function_to_apply": "none", # You want the raw logits without softmax.
    "batch_size": 16
}

max_ppo_steps = 10

for step, batch in tqdm(enumerate(ppo_trainer.dataloader)):
    # Break when you reach max_steps.
    if step >= max_ppo_steps:
        break   

    prompt_tensors = batch["input_ids"]

    # Get response from FLAN-T5/PEFT LLM.
    summary_tensors = []

    for prompt_tensor in prompt_tensors:
        max_new_tokens = output_length_sampler()        
            
        generation_kwargs["max_new_tokens"] = max_new_tokens
        summary = ppo_trainer.generate(prompt_tensor, **generation_kwargs)
        
        summary_tensors.append(summary.squeeze()[-max_new_tokens:])
        
    # This needs to be called "response".
    batch["response"] = [tokenizer.decode(r.squeeze()) for r in summary_tensors]

    # Compute reward outputs.
    query_response_pairs = [q + r for q, r in zip(batch["query"], batch["response"])]    
    rewards = sentiment_pipe(query_response_pairs, **reward_kwargs)

    # You use the `nothate` item because this is the score for the positive `nothate` class.
    reward_tensors = [torch.tensor(reward[not_hate_index]["score"]) for reward in rewards]    

    # Run PPO step.
    stats = ppo_trainer.step(prompt_tensors, summary_tensors, reward_tensors)
    ppo_trainer.log_stats(stats, batch, reward_tensors)
    
    print(f'objective/kl: {stats["objective/kl"]}')
    print(f'ppo/returns/mean: {stats["ppo/returns/mean"]}')
    print(f'ppo/policy/advantages_mean: {stats["ppo/policy/advantages_mean"]}')
    print('-'.join('' for x in range(100)))

0it [00:00, ?it/s]You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
1it [00:08,  8.17s/it]

objective/kl: -0.01285780593752861
ppo/returns/mean: 1.686070203781128
ppo/policy/advantages_mean: -3.034183393424428e-08
---------------------------------------------------------------------------------------------------


2it [00:20, 10.75s/it]

objective/kl: 0.056584086269140244
ppo/returns/mean: 1.0031486749649048
ppo/policy/advantages_mean: -7.895362585941257e-09
---------------------------------------------------------------------------------------------------


3it [00:30, 10.31s/it]

objective/kl: -0.04637895151972771
ppo/returns/mean: 1.4302836656570435
ppo/policy/advantages_mean: -7.594239548325277e-08
---------------------------------------------------------------------------------------------------


4it [00:36,  8.69s/it]

objective/kl: 0.038781143724918365
ppo/returns/mean: 1.617724895477295
ppo/policy/advantages_mean: 1.219014649223027e-07
---------------------------------------------------------------------------------------------------


5it [00:43,  8.07s/it]

objective/kl: 0.09141538292169571
ppo/returns/mean: 1.7703981399536133
ppo/policy/advantages_mean: 8.774095761054923e-08
---------------------------------------------------------------------------------------------------


6it [00:51,  7.83s/it]

objective/kl: 0.056717440485954285
ppo/returns/mean: 1.5906400680541992
ppo/policy/advantages_mean: 3.025949268931072e-09
---------------------------------------------------------------------------------------------------


7it [01:03,  9.39s/it]

objective/kl: -0.16287702322006226
ppo/returns/mean: 1.150525450706482
ppo/policy/advantages_mean: 3.481306265484818e-08
---------------------------------------------------------------------------------------------------


8it [01:10,  8.67s/it]

objective/kl: 0.031073477119207382
ppo/returns/mean: 1.523828148841858
ppo/policy/advantages_mean: 8.220768421551838e-08
---------------------------------------------------------------------------------------------------


9it [01:18,  8.29s/it]

objective/kl: 0.00514737144112587
ppo/returns/mean: 1.4748144149780273
ppo/policy/advantages_mean: -1.1240195618711368e-07
---------------------------------------------------------------------------------------------------


10it [01:28,  8.87s/it]

objective/kl: 0.02561032958328724
ppo/returns/mean: 1.4399124383926392
ppo/policy/advantages_mean: 4.556783750331306e-08
---------------------------------------------------------------------------------------------------






### Evaluate the Model Quantitatively

Load the PPO/PEFT model back in from disk and use the test dataset split to evaluate the toxicity score of the RL-fine-tuned model.

In [32]:
mean_after_detoxification, std_after_detoxification = evaluate_toxicity(model=ppo_model, 
                                                                        toxicity_evaluator=toxicity_evaluator, 
                                                                        tokenizer=tokenizer, 
                                                                        dataset=dataset["test"], 
                                                                        num_samples=10)
print(f'toxicity [mean, std] after detox: [{mean_after_detoxification}, {std_after_detoxification}]')

11it [00:07,  1.56it/s]

toxicity [mean, std] after detox: [0.015395136655900966, 0.02467620710868973]





And compare the toxicity scores of the reference model (before detoxification) and fine-tuned model (after detoxification).

In [33]:
mean_improvement = (mean_before_detoxification - mean_after_detoxification) / mean_before_detoxification
std_improvement = (std_before_detoxification - std_after_detoxification) / std_before_detoxification

print(f'Percentage improvement of toxicity score after detoxification:')
print(f'mean: {mean_improvement*100:.2f}%')
print(f'std: {std_improvement*100:.2f}%')

Percentage improvement of toxicity score after detoxification:
mean: 4.14%
std: -3.11%



### Evaluate the Model Qualitatively

- inspect some examples from the test dataset
- compare the original `ref_model` to the fine-tuned/detoxified `ppo_model` using the toxicity evaluator

In [34]:
batch_size = 20
compare_results = {}

df_batch = dataset["test"][0:batch_size]

compare_results["query"] = df_batch["query"]
prompt_tensors = df_batch["input_ids"]

summary_tensors_ref = []
summary_tensors = []

# Get response from ppo and base model.
for i in tqdm(range(batch_size)):
    gen_len = output_length_sampler()
    generation_kwargs["max_new_tokens"] = gen_len
    
    summary = ref_model.generate(
        input_ids=torch.as_tensor(prompt_tensors[i]).unsqueeze(dim=0).to(device), 
        **generation_kwargs
    ).squeeze()[-gen_len:]
    summary_tensors_ref.append(summary)

    summary = ppo_model.generate(
        input_ids=torch.as_tensor(prompt_tensors[i]).unsqueeze(dim=0).to(device), 
        **generation_kwargs
    ).squeeze()[-gen_len:]
    summary_tensors.append(summary)

# Decode responses.
compare_results["response_before"] = [tokenizer.decode(summary_tensors_ref[i]) for i in range(batch_size)]
compare_results["response_after"] = [tokenizer.decode(summary_tensors[i]) for i in range(batch_size)]

# Sentiment analysis of query/response pairs before/after.
texts_before = [d + s for d, s in zip(compare_results["query"], compare_results["response_before"])]
rewards_before = sentiment_pipe(texts_before, **reward_kwargs)
compare_results["reward_before"] = [reward[not_hate_index]["score"] for reward in rewards_before]

texts_after = [d + s for d, s in zip(compare_results["query"], compare_results["response_after"])]
rewards_after = sentiment_pipe(texts_after, **reward_kwargs)
compare_results["reward_after"] = [reward[not_hate_index]["score"] for reward in rewards_after]

100%|██████████| 20/20 [00:27<00:00,  1.38s/it]


Store and review the results in a DataFrame

In [35]:
pd.set_option('display.max_colwidth', 500)
df_compare_results = pd.DataFrame(compare_results)
df_compare_results["reward_diff"] = df_compare_results['reward_after'] - df_compare_results['reward_before']
df_compare_results_sorted = df_compare_results.sort_values(by=['reward_diff'], ascending=False).reset_index(drop=True)
df_compare_results_sorted

Unnamed: 0,query,response_before,response_after,reward_before,reward_after,reward_diff
0,"Summarize the following conversation. #Person1#: Hello? #Person2#: Hello? #Person1#: Can I speak to Li Hong, please? #Person2#: Speaking. #Person1#: Hi, Li Hong. This is Alice. #Person2#: Hi, Alice. How are you? #Person1#: Not bad. Li Hong, I am sorry that I can't go to see Mrs. Brown with you tomorrow morning. My mother is ill. I must take care of her. #Person2#: I'm sorry to hear that. You'd better stay at home. After all, we can visit Mrs. Brown later #Person1#: OK. Bye - bye. #Person2#: ...","<pad> Meet Alice. Lie Hong, please.</s>",<pad> Li Hong will see Alice tomorrow. You'll find her at 11:30am tomorrow.</s>,1.091108,2.157526,1.066417
1,"Summarize the following conversation. #Person1#: Could you help me, Sir? My flight got in 15 minutes ago. Everyone else has picked up the luggage but mine hasn't come through. #Person2#: I'm sorry, Madam, I'll go and find out if there is any more to come. Summary: </s>","<pad> #Person1: Sorry, sir. My flight hasn't arrived yet, Madam.</s>",<pad> This is a hard question.</s>,2.073765,2.930876,0.857111
2,"Summarize the following conversation. #Person1#: I'm forming a music band. #Person2#: Do you already know how to play an instrument? #Person1#: Uh... Yeah! I'Ve told you a thousand times that I'm learning to play the drums. Now that I know how to play well, I would like to form a rock band. #Person2#: Aside from yourself, who are the other members of the band? #Person1#: We have a guy who plays guitar, and another who plays bass. Although we still haven't found anyone to be our singer. You t...",<pad> #1 - Hey!</s>,"<pad> #Person1#: I'm forming a music band, despite my pudiny. #Person2#: Aw, okay! what's your song plan? #Person1#: I'm writing a song for money called “Full House."" #Person2#: Disappointed. What exactly are you doing with all this money? #Person1#: Do you have any good sources of financing?</s>",2.607893,3.43467,0.826777
3,"Summarize the following conversation. #Person1#: Hello. I want to reconfirm our flight to London. #Person2#: Yes, sir. Did you call the airline? #Person1#: Yes, I did. But I couldn't communicate with them in English. They speak only Spanish. So I need your help. #Person2#: Certainly, sir. What is the flight number and when are you leaving? #Person1#: We are taking IB 385 to London tomorrow at 1 p. m. #Person2#: Oh, I see, sir. We have the airline office inside the hotel. They have an English...","<pad> #Person1#: Sure. So, when do you want to reach the airline? #Person2#: By e-mail. #Listen to the automated procedure at the beginning and #Num author=#558297. #A-Credits @ the beginning, and #IPG_culled at #(illcb4aN). #Andhzrsdisk@time.com #Werkstock@time.com. ISky theo. Do you know your phone number or e-mail room number? *Freret Providez phone. At 1047 our airways are not open in cars. **FretPack #4aSjmcan7 43 and my room number. Voted two times and booked the room Kindle 4 marbled....","<pad> His name is Seung-ho. I want to reconfirm our flight to London. Now, please dial the phone number on the ticket for 353-585. Browse for articles about your flights and have Castigliones Write you more about our business. We are Brian Windsor, a private tutor and Special Trainer for clients.</s>",2.159354,2.800915,0.641562
4,"Summarize the following conversation. #Person1#: Today more and more families have personal computers. People have wider range of choice to communicate with the outside world. #Person2#: Right. With the establishment of Internet and a lot of web companies, people are getting more and more dependent on the web. #Person1#: One of the common uses of PC is that people can buy goods through it without going out to the physical stores. #Person2#: Can you tell me how it is done? #Person1#: If a cus...","<pad> As Microsoft gets advanced in computing, more and more organizations start to look to this new technologies as a marketer—I am sure this will lead to an improvement in devices' purpose.</s>","<pad> I understand computer surfing has been popular on the Internet in childhood. This is why I am sooo glad that people buy personal computers. Nowadays, people have access to computer through online shopping. One of the popular uses of personal computers is that people can buy goods through it without going out to shops. People can buy a certain goods and the web company will deliver the goods to you from his home.</s>",3.258098,3.823109,0.565011
5,"Summarize the following conversation. #Person1#: How much are you asking for this? #Person2#: I'm offering them to you at 150 yuan a piece. Is that all right? #Person1#: Is tax already included in their price? #Person2#: Yes. Our price can't be matched. #Person1#: Would you consider a volume discount? #Person2#: If you buy 1, 000 or more, you'll get a 10 % discount. #Person1#: I'll accept your offer. Summary: </s>",<pad> The discount will be 10 % if #1 buys over 1 000 in the first week.</s>,"<pad> If you like this cake, please try carrying a round 300-lb cone case it's to look like.</s>",2.865917,3.293869,0.427952
6,"Summarize the following conversation. #Person1#: Oh, my God! What's this? #Person2#: What? #Person1#: Look! This window is open. #Person2#: Did you open it before we left? #Person1#: Are you kidding? It's winter. Why would I open it? #Person2#: I don't know. Wait. Is this yours? #Person1#: No! Oh, my God! Someone has broken into the house. #Person2#: It looks that way. That's probably why the door wasn't locked when we came in. #Person1#: I locked it when I left though. #Person2#: Yes, but t...","<pad> Allen and his wife are getting off the road again. They're deserted. Someone stole most of the house, now Allen has eaten part of it.</s>","<pad> The house is on fire. Someone broke into the house. House was locked down. There is nothing where. Something must have robbed the television, tv and stereo. The one thing's on the porch remains. Inside the house is a monitor, TV, TV set, and stereo. Glass covers have been removed, the kitchen is lit with lamps and light at night.</s>",2.00025,2.196419,0.196169
7,"Summarize the following conversation. #Person1#: Mom, I just finished my paper. Can you proofread it before I hand it in? #Person2#: Sure, let's take a look. Sweetie, this is terrific. Your ideas are so original. #Person1#: Thanks. #Person2#: I can tell you worked hard on it. #Person1#: I really did! I started thinking about what I wanted to say three weeks ago. #Person2#: Well, it was definitely worth all the time. #Person1#: Let's just hope my teacher agrees. Summary: </s>",<pad> Check the papers in your student's classroom.</s>,"<pad> Parents, Tim and Mom will critique Person1's research paper.</s>",3.378499,3.523769,0.14527
8,"Summarize the following conversation. #Person1#: I would like to order some internet today. #Person2#: What kind would you like? #Person1#: What kind of internet is there? #Person2#: You can get DEL or dial-up. #Person1#: Which of those two is best? #Person2#: I would recommend DEL. #Person1#: So that one better? #Person2#: It's better because it doesn't tie up the phone. #Person1#: What do you mean by that? #Person2#: DEL isn't connected through your phone line, but dial-up is. #Person1#: S...",<pad> Go to the dl. Visit #1 up as opposed to #1. Select the DEL type and dial-up type.</s>,<pad> #Person1#: I want to order DEL or dial-up today.</s>,2.379903,2.514288,0.134385
9,"Summarize the following conversation. #Person1#: Could you help me figure out how to look for a job? #Person2#: We have lots of options, what type of job do you need? #Person1#: I want to work in an office. #Person2#: Do you want to work part-time or full-time? #Person1#: I want to work full-time. #Person2#: We have binders with local job listings or you can make use of the computers. OK? #Person1#: I am confused a bit but I am sure that I can figure it out. #Person2#: If you make an appoint...",<pad> Talk somewhere around.</s>,<pad> Talk with a job counselor at your local Job Center by calling 1-800-661-1270 and using your credit card or mail.</s>,2.329359,2.37526,0.045901
