# **Project Overview**:
This project is dedicated to enhancing the FlanT5 model for dialogue summarization. We aim to achieve this through a two-fold approach: firstly, by fine-tuning its parameters effectively, and secondly, by implementing the Reinforcement Learning for Hate Speech Filtering (RLHF) framework to ensure the generated summaries are free from toxic content. Our primary objective is the refinement of the FLAN-T5 model.

Key Steps:
1. **Dataset Acquisition and Preparation**
   - Our project kicks off with the acquisition of the dataset, followed by meticulous preparation for the subsequent training phase.

2. **Model Initialization and Efficient Parameter Fine-Tuning**
   - The project's next phase entails initializing the model and optimizing it for streamlined parameter fine-tuning.

3. **Model Evaluation using Rouge Score**
   - Following efficient parameter fine-tuning, its performance is evaluated using the Rouge score, which measures the quality of the generated summaries.

4. **Toxicity Assessment of Generated Summaries**
   - We employ the 'roberta-hate-speech' model to assess the toxicity of the generated summaries, ensuring they meet non-toxicity standards.

5. **Perform Fine-Tuning using RLHF Framework to Detoxify the Summaries**
   - The RLHF fine-tuning process involves the following components:
     - PPO Model: This model, which undergoes optimized fine-tuning, aims to improve non-toxicity in the summaries.
     - Reference Model: A frozen model used to calculate the KL-divergence from the initial model state. It provides an additional reward signal during PPO training to prevent significant deviations from the original Language Model (LLM).
     - Score Generator Model: This model is responsible for generating non-toxicity scores for the summaries.

6. **Evaluation of Non-Toxicity Performance Enhancement**
   - Finally, we assess the improvements in non-toxicity achieved by the RLHF fine-tuned model to ensure the generated summaries meet high standards of safety and quality.

In [20]:
!pip install transformers peft datasets evaluate
!pip install git+https://github.com/lvwerra/trl.git@25fa1bd


Collecting git+https://github.com/lvwerra/trl.git@25fa1bd
  Cloning https://github.com/lvwerra/trl.git (to revision 25fa1bd) to /tmp/pip-req-build-06s5ih6l
  Running command git clone --filter=blob:none --quiet https://github.com/lvwerra/trl.git /tmp/pip-req-build-06s5ih6l
[0m  Running command git checkout -q 25fa1bd
  Resolved https://github.com/lvwerra/trl.git to commit 25fa1bd
  Preparing metadata (setup.py) ... [?25l[?25hdone


In [21]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoModelForSeq2SeqLM, GenerationConfig, Trainer, TrainingArguments
from datasets import load_dataset
from peft import PeftModel, PeftConfig, LoraConfig, TaskType

from trl import PPOTrainer, PPOConfig, AutoModelForSeq2SeqLMWithValueHead
from trl import create_reference_model
from trl.core import LengthSampler

import torch
import evaluate

import numpy as np
import pandas as pd
from tqdm import tqdm


### **Section 1: Load Dataset and pre-process for Finetuning**.

> Indented block



In [22]:
huggingface_dataset_name = "knkarthick/dialogsum"
dataset_original = load_dataset(huggingface_dataset_name)
dataset_original

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 12460
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 500
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 1500
    })
})

In [23]:

def preprocess_and_split_dataset(tokenizer_name,
                                 dataset_to_load,
                                 min_length,
                                 ):

    # Load the dataset, using only the "train" split for this task.
    dataset = load_dataset(dataset_to_load, split="train")

    # Filter dialogues with lengths within the specified range.
    dataset = dataset.filter(lambda x: len(x["dialogue"]) > min_length,batched=False)

    # Initialize the tokenizer, automatically switching between GPU and CPU.

    def tokenize_function(example):
      start_prompt = 'Summarize the following conversation.\n\n'
      end_prompt = '\n\nSummary: '
      prompt = [start_prompt + dialogue + end_prompt for dialogue in example["dialogue"]]
      example['input_ids'] = tokenizer(prompt, padding="max_length", truncation=True, return_tensors="pt").input_ids
      example['labels'] = tokenizer(example["summary"], padding="max_length", truncation=True, return_tensors="pt").input_ids
      #example["query"] = tokenizer.decode(example["input_ids"][0])
      return example


    # Tokenize each dialogue in the dataset.
    dataset = dataset.map(tokenize_function, batched=True)
    dataset.set_format(type="torch")

    # Split the dataset into train and test sections.
    dataset_splits = dataset.train_test_split(test_size=0.2, shuffle=False, seed=42)

    return dataset_splits

model_name = "google/flan-t5-base"
tokenizer = AutoTokenizer.from_pretrained(model_name, device_map="auto")

dataset_name = huggingface_dataset_name  # Replace with the actual dataset name
min_length = 200
max_length = 1000

dataset = preprocess_and_split_dataset(tokenizer,
                                       dataset_to_load=dataset_name,
                                       min_length=min_length)

print(dataset)


DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic', 'input_ids', 'labels'],
        num_rows: 9964
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic', 'input_ids', 'labels'],
        num_rows: 2492
    })
})


### **Section 2: Model Initialization and Efficient Parameter Fine-Tuning.**


In [24]:
from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=32, # Rank
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM # FLAN-T5
)
original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
peft_model = get_peft_model(original_model,
                            lora_config)


In [25]:
import time
output_dir = f'./peft-dialogue-summary-training-{str(int(time.time()))}'

peft_training_args = TrainingArguments(
    output_dir=output_dir,
    auto_find_batch_size=True,
    learning_rate=1e-3, # Higher learning rate than full fine-tuning.
    num_train_epochs=1,
)


peft_trainer = Trainer(
    model=peft_model,
    args=peft_training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset['test']
)

In [None]:
peft_trainer.train()

peft_model_path="./peft-dialogue-summary-checkpoint-local"

peft_trainer.model.save_pretrained(peft_model_path)
tokenizer.save_pretrained(peft_model_path)

Step,Training Loss


### **Section 3: Model Evaluation using Rouge Score**

In [None]:
from peft import PeftModel, PeftConfig

peft_model_base = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base",
                                                        torch_dtype=torch.bfloat16,
                                                        device_map='auto')
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")

peft_model = PeftModel.from_pretrained(peft_model_base,
                                       'peft-dialogue-summary-checkpoint-local',
                                       torch_dtype=torch.bfloat16,
                                       is_trainable=False,
                                       )

In [None]:
index = 200
dialogue = dataset['test'][index]['dialogue']
baseline_human_summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

original_model_outputs = original_model.generate(input_ids=input_ids.to(original_model.device), generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

peft_model_outputs = peft_model.generate(input_ids=input_ids.to(peft_model.device), generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

print(f'BASELINE HUMAN SUMMARY:\n{baseline_human_summary}\n')
print(f'ORIGINAL MODEL:\n{original_model_text_output}\n')
print(f'PEFT MODEL: {peft_model_text_output}\n')

In [None]:
dialogues = dataset['test'][0:10]['dialogue']
human_baseline_summaries = dataset['test'][0:10]['summary']

original_model_summaries = []
instruct_model_summaries = []
peft_model_summaries = []

for idx, dialogue in enumerate(dialogues):
    prompt = f"""
              Summarize the following conversation.

              {dialogue}

              Summary: """



    input_ids = tokenizer(prompt, return_tensors="pt").input_ids

    human_baseline_text_output = human_baseline_summaries[idx]

    original_model_outputs = original_model.generate(input_ids=input_ids.to(original_model.device), generation_config=GenerationConfig(max_new_tokens=200))
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

    peft_model_outputs = peft_model.generate(input_ids=input_ids.to(peft_model.device), generation_config=GenerationConfig(max_new_tokens=200))
    peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

    original_model_summaries.append(original_model_text_output)
    peft_model_summaries.append(peft_model_text_output)

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, peft_model_summaries))

df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries','original_model_summaries', 'peft_model_summaries'])

df.head()

In [None]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"
!pip install rouge_score

In [None]:
rouge = evaluate.load('rouge')

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries[0:len(peft_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('PEFT MODEL:')
print(peft_model_results)

### **Section 4: Toxicity Assessment of Generated Summaries**


In [None]:
class ToxicModel:
    def __init__(self,toxicity_model_name):
        self.tokenizer = AutoTokenizer.from_pretrained(toxicity_model_name, device_map="auto")
        self.model  = AutoModelForSequenceClassification.from_pretrained(toxicity_model_name, device_map="auto")

    def __call__(self, text,return_probs=False):
        input_ids = self.tokenizer(text, padding=True, truncation=True,return_tensors="pt").input_ids
        with torch.no_grad():
            logits = self.model(input_ids.to(self.model.device)).logits

        if return_probs:
            probabilities = logits.softmax(dim=-1).tolist()
            return probabilities
        else:
            not_hate_index = 0
            score= logits[:, not_hate_index].tolist()
            return score

toxicity_model_name = "facebook/roberta-hate-speech-dynabench-r4-target"
toxicity_evaluator = ToxicModel(toxicity_model_name)



In [None]:
print("Non toxic example: ")
non_toxic_text = "#Person 1# tells Tommy that he didn't like the movie."
print("Non toxic Probablity",toxicity_evaluator(non_toxic_text,return_probs=True))
print("Non toxic Reward",toxicity_evaluator(non_toxic_text,return_probs=False))

print("Toxic example: ")
toxic_text= "#Person 1# tells Tommy that the movie was terrible, dumb and stupid."
print("Toxic Probablity",toxicity_evaluator(toxic_text,return_probs=True))
print("Toxic Reward",toxicity_evaluator(toxic_text,return_probs=False))


In [None]:
def evaluate_toxicity(model,
                      toxicity_evaluator,
                      tokenizer,
                      dataset,
                      num_samples):

    """
    Preprocess the dataset and split it into train and test parts.

    Parameters:
    - model (trl model): Model to be evaluated.
    - toxicity_evaluator (evaluate_modules toxicity metrics): Toxicity evaluator.
    - tokenizer (transformers tokenizer): Tokenizer to be used.
    - dataset (dataset): Input dataset for the evaluation.
    - num_samples (int): Maximum number of samples for the evaluation.

    Returns:
    tuple: A tuple containing two numpy.float64 values:
    - mean (numpy.float64): Mean of the samples toxicity.
    - std (numpy.float64): Standard deviation of the samples toxicity.
    """

    max_new_tokens=100

    toxicities = []
    input_texts = []
    for i, sample in tqdm(enumerate(dataset)):
        start_prompt = 'Summarize the following conversation.\n\n'
        end_prompt = '\n\nSummary: '
        input_text = start_prompt + sample["dialogue"] + end_prompt

        if i > num_samples:
            break

        input_ids = tokenizer(input_text, return_tensors="pt", padding=True).input_ids

        generation_config = GenerationConfig(max_new_tokens=max_new_tokens,
                                             top_k=0.0,
                                             top_p=1.0,
                                             do_sample=True)

        response_token_ids = model.generate(input_ids=input_ids.to(model.current_device),
                                            generation_config=generation_config)

        generated_text = tokenizer.decode(response_token_ids[0], skip_special_tokens=True)

        toxicity_score = toxicity_evaluator(generated_text)

        toxicities.extend(toxicity_score)

    # Compute mean & std using np.
    mean = np.mean(toxicities)
    std = np.std(toxicities)

    return mean, std

tokenizer = AutoTokenizer.from_pretrained(model_name, device_map="auto")
mean_before_detoxification, std_before_detoxification = evaluate_toxicity(model=peft_model,
                                                                          toxicity_evaluator=toxicity_evaluator,
                                                                          tokenizer=tokenizer,
                                                                          dataset=dataset["test"],
                                                                          num_samples=10)

print(f'toxicity [mean, std] before detox: [{mean_before_detoxification}, {std_before_detoxification}]')

### **Section 5: Perform Fine-Tuning using RLHF Framework to Detoxify the Summaries.**


The fine-tuning loop consists of the following main steps:
1. Get the query responses from the policy LLM (PEFT model).
2. Get sentiments for query/responses from hate speech RoBERTa model.
3. Optimize policy with PPO using the (query, response, reward) triplet.

The operation is running if you see the following metrics appearing:
* `objective/kl`: minimize kl divergence,
* `ppo/returns/mean`: maximize mean returns,
* `ppo/policy/advantages_mean`: maximize advantages.

In [None]:
ppo_model = AutoModelForSeq2SeqLMWithValueHead.from_pretrained(peft_model,
                                                               torch_dtype=torch.bfloat16,
                                                               is_trainable=True,
                                                               )

ref_model = create_reference_model(ppo_model)


In [None]:
def collator(data):
    return dict((key, [d[key] for d in data]) for key in data[0])

learning_rate=1.41e-5
max_ppo_epochs=1
mini_batch_size=4
batch_size=16

config = PPOConfig(
    model_name=model_name,
    learning_rate=learning_rate,
    ppo_epochs=max_ppo_epochs,
    mini_batch_size=mini_batch_size,
    batch_size=batch_size
)

ppo_trainer = PPOTrainer(config=config,
                         model=ppo_model,
                         ref_model=ref_model,
                         tokenizer=tokenizer,
                         dataset=dataset["train"],
                         data_collator=collator)

In [None]:
output_min_length = 100
output_max_length = 400
output_length_sampler = LengthSampler(output_min_length, output_max_length)

generation_kwargs = {
    "min_length": 5,
    "top_k": 0.0,
    "top_p": 1.0,
    "do_sample": True
}

max_ppo_steps = 10

for step, batch in tqdm(enumerate(ppo_trainer.dataloader)):
    # Break when you reach max_steps.
    if step >= max_ppo_steps:
        break

    prompt_tensors = batch["input_ids"]

    # Get response from FLAN-T5/PEFT LLM.
    summary_tensors = []

    for prompt_tensor in prompt_tensors:
        max_new_tokens = output_length_sampler()

        generation_kwargs["max_new_tokens"] = max_new_tokens
        summary = ppo_trainer.generate(prompt_tensor, **generation_kwargs)

        summary_tensors.append(summary.squeeze()[-max_new_tokens:])

    # This needs to be called "response".
    batch["response"] = [tokenizer.decode(r.squeeze()) for r in summary_tensors]

    # Compute reward outputs.
    rewards = toxicity_evaluator(batch["response"])
    rewards=[torch.tensor(r) for r in rewards]
    # You use the `nothate` item because this is the score for the positive `nothate` class.

    # Run PPO step.
    stats = ppo_trainer.step(prompt_tensors, summary_tensors, rewards)
    ppo_trainer.log_stats(stats, batch, rewards)

    print(f'objective/kl: {stats["objective/kl"]}')
    print(f'ppo/returns/mean: {stats["ppo/returns/mean"]}')
    print(f'ppo/policy/advantages_mean: {stats["ppo/policy/advantages_mean"]}')
    print('-'.join('' for x in range(100)))

### **Section 6:Evaluation of Non-Toxicity Performance Enhancement**


In [None]:
mean_after_detoxification, std_after_detoxification = evaluate_toxicity(model=ppo_model,
                                                                        toxicity_evaluator=toxicity_evaluator,
                                                                        tokenizer=tokenizer,
                                                                        dataset=dataset["test"],
                                                                        num_samples=10)
print(f'toxicity [mean, std] after detox: [{mean_after_detoxification}, {std_after_detoxification}]')
mean_improvement = (mean_before_detoxification - mean_after_detoxification) / mean_before_detoxification
std_improvement = (std_before_detoxification - std_after_detoxification) / std_before_detoxification

print(f'Percentage improvement of toxicity score after detoxification:')
print(f'mean: {mean_improvement*100:.2f}%')
print(f'std: {std_improvement*100:.2f}%')