### Notebook 6.1: RLHF (Reinforcement Learning with Human Feedback) from Scratch 🚀  

Welcome back to the series! 🎉 In this notebook, we’ll dive into **Reinforcement Learning with Human Feedback (RLHF)**, taking a pretrained **GPT-2 model fine-tuned on IMDb movie reviews** and aligning it further using RLHF principles. Our goal is to understand and implement how RLHF can be used to generate text that aligns with human preferences for sentiment and style.  

### What’s the Goal? 🏆  

By the end of this notebook, you will:  
1. Gain a foundational understanding of **RLHF** and its importance in aligning language models with human values.  
2. **Build a reward model** to evaluate model outputs based on human preferences.  
3. **Explore TRL (Transformers Reinforcement Learning)** from Hugging Face:  
   - Dive into the library’s source code to understand **what’s happening under the hood**.  
   - Use TRL as a reference, not as a black box, ensuring we grasp the mechanics before applying it.  
4. **Implement PPO** (Proximal Policy Optimization) to fine-tune GPT-2 efficiently.  
5. Align GPT-2 to handle sentiment control and stylistic alignment based on IMDb reviews.  

<p align="center">
    <img src="images/RLHF.png" alt="PEFT Overview" />
</p>

### Why Not Reinvent the Wheel? 🛠️  

While the goal is to explore RLHF concepts from scratch, implementing everything manually is impractical due to the inherent **instability of reinforcement learning training**. Instead, we will leverage the **TRL library by Hugging Face**, which provides a robust implementation of RLHF.  

However, **we won’t use TRL as a black box**. Instead, we’ll dive into the source code, observe its internals, and understand every component step by step. Only after understanding how TRL operates will we apply it to our GPT-2 fine-tuning task. This ensures a balance of theoretical understanding and practical efficiency.  

<p align="center">
    <img src="images/trl.png" alt="PEFT Overview" />
</p>

### What’s Inside? 🔍  

#### **1: Introduction to RLHF** 🧠  

#### **2: Setting Up GPT-2 with IMDb Fine-Tuning** 🎥  
 
#### **3: Building the Reward Model** 🏗️  

#### **4: Exploring the TRL Library** 🛠️  

#### **5: Implementing PPO for RLHF** 🤖  

#### **6: RLHF Training Loop** 🔄  

#### **7: Evaluation and Analysis** 📊  

### A Word of Advice Before You Begin  

This notebook dives deep into **Reinforcement Learning**, **Proximal Policy Optimization**, and reward model training—each a substantial topic on its own. While we’ll guide you step-by-step, it’s helpful to review RL fundamentals beforehand. Alternatively, you can follow along and revisit concepts as needed.  

Let’s embrace this challenging yet rewarding journey of implementing RLHF with both theoretical rigor and practical efficiency! 🚀  

## Introduction to RLHF (Reinforcement Learning with Human Feedback)  

**Reinforcement Learning with Human Feedback (RLHF)** is a powerful technique to align language models, like GPT-2, with specific human preferences. Unlike traditional fine-tuning, RLHF integrates human feedback to guide the model’s behavior. This ensures that generated outputs not only make sense but also meet user expectations regarding sentiment, tone, or any other quality criteria.  

### Why RLHF? 🤔  

Language models like GPT-2 are pretrained to predict the next token in a sequence, giving them broad generalization capabilities. However, they might not inherently align with specific human values or preferences.  

For instance, a fine-tuned GPT-2 model trained on IMDb reviews may produce outputs spanning various sentiments—positive, neutral, or negative. But what if we want the model to generate only **positive reviews**? RLHF allows us to:  
- Tailor outputs to a desired sentiment.  
- Incorporate feedback dynamically, guiding the model to improve during training.  
- Balance specific preferences without sacrificing fluency or coherence.  

### The Scenario: Generating Positive IMDb Reviews 🌟  

Let’s use RLHF to train a GPT-2 model fine-tuned on IMDb reviews to generate **positive movie reviews** exclusively. Here’s the process:  

1. The baseline GPT-2 generates an output based on a given prompt, but the sentiment is not guaranteed to be positive.  
2. A **reward model** evaluates the sentiment of the generated review:  
   - Positive reviews receive **higher rewards**.  
   - Negative or neutral reviews receive **lower rewards**.  
3. Using reinforcement learning (specifically **PPO**), GPT-2 updates its weights to align with the reward model's preferences, producing increasingly positive outputs over time.  

### The RLHF Workflow  

1. **Pretrained Model**: Start with GPT-2 fine-tuned on IMDb reviews (already handled in earlier steps).  

2. **Dataset for RLHF**:  
   - Create pairs of movie reviews with human feedback indicating preferred outputs.  
   - For example, a positive review is marked as "better" compared to a neutral or negative one.  

3. **Reward Model**: Train a model that evaluates generated reviews and assigns rewards based on sentiment alignment.  

4. **Policy and PPO Algorithm**:  
   - Fine-tune GPT-2 (the **policy model**) using the reward model’s feedback.  
   - Use **Proximal Policy Optimization (PPO)** to stabilize updates and maintain fluency.  

<p align="center">
    <img src="images/rlhf2.jpg" alt="PEFT Overview" />
</p>
   
### A Practical Example  

Here’s how RLHF can refine outputs:  

1. **Baseline Model Output**:  
   *Prompt*: *"The movie was a unique experience because..."*  
   - *Output*: "The movie was a unique experience because the plot was dull and the pacing was tedious."  

2. **Reward Model Evaluation**:  
   - Reward: Low (due to negative sentiment).  

3. **PPO Adjustment**:  
   - Adjust GPT-2 weights to produce outputs with higher rewards.  

4. **Post-RLHF Output**:  
   *Prompt*: *"The movie was a unique experience because..."*  
   - *Output*: "The movie was a unique experience because the plot was captivating and the pacing kept me on the edge of my seat."  

<p align="center">
    <img src="images/gpts_rlhf.png" alt="GPT2_RLHF" />
</p>


### Why Not Just Fine-Tune on Positive Reviews?  

Simply fine-tuning GPT-2 on positive reviews alone introduces biases but doesn’t guarantee nuanced alignment. RLHF is more effective because it:  
- Provides dynamic adaptation through reinforcement learning.  
- Penalizes outputs straying from natural language realism (via KL divergence regularization).  
- Balances sentiment alignment with fluency and coherence.  

### What’s Next in This Notebook?  

In this notebook, we will:  
1. **Build the Dataset**: Prepare IMDb data for RLHF.  
2. **Load Models**:  
   - A frozen GPT-2 model for KL divergence regularization.  
   - A GPT-2 policy model for training with PPO.  
3. **Create a Reward Model**: A sentiment evaluator assigning scores to generated reviews.  
4. **Implement PPO**: Combine rewards and responses to refine the policy model through iterative updates.  


### Step-by-Step Workflow  

We’ll approach RLHF practically, ensuring theory and implementation go hand-in-hand:  
- **Review Core Concepts**: We’ll break down the math and logic behind RLHF and PPO.  
- **Understand TRL Library**: Instead of treating it as a black box, we’ll examine the **Transformers Reinforcement Learning (TRL)** library by Hugging Face, using its source code as a reference.  
- **Apply RLHF**: Finally, we’ll use TRL to efficiently train our model, leveraging its well-tested implementations while understanding the underlying mechanisms.  

By the end, you’ll not only master RLHF but also see its potential for aligning language models with complex human preferences. Let’s dive in! 🚀

## Let's Prepare the Dataset--->

In [32]:
import numpy as np
import pandas as pd
from transformers import AutoTokenizer
import torch
from datasets import load_dataset

# Load the dataset
dataset_name = "stanfordnlp/imdb"
dataset = load_dataset(dataset_name)

df = pd.DataFrame(dataset['train'])

df.head(3)


Unnamed: 0,text,label
0,I rented I AM CURIOUS-YELLOW from my video sto...,0
1,"""I Am Curious: Yellow"" is a risible and preten...",0
2,If only to avoid making this type of film in t...,0


In [33]:

# 1. Renaming the text column to review
df = df.rename(columns={'text': 'review'})

# 2. Filtering the short reviews (no less than 200 characters)
df = df[df['review'].apply(lambda x: len(x) > 200)]

# 3. Perform random sampling for text length (LengthSampler)
min_text_length = 2
max_text_length = 8
values = list(range(min_text_length, max_text_length + 1))  # Ensure max_text_length is included
input_size = np.random.choice(values)

# Display the first 3 rows after processing
df.head(3)

Unnamed: 0,review,label
0,I rented I AM CURIOUS-YELLOW from my video sto...,0
1,"""I Am Curious: Yellow"" is a risible and preten...",0
2,If only to avoid making this type of film in t...,0


In [34]:

# 4. Tokenization
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

# Define the tokenization function
def tokenize(row):
    # Tokenize the review and truncate to the sampled length
    input_ids = tokenizer.encode(row["review"], truncation=True, max_length=input_size, padding=False)
    query = tokenizer.decode(input_ids)
    
    # Return the tokenized output as a dictionary
    return {"input_ids": input_ids, "query": query}

# Apply tokenization to each row of the DataFrame
df[['input_ids', 'query']] = df.apply(lambda row: pd.Series(tokenize(row)), axis=1)

# Convert the 'input_ids' column to tensor
df['input_ids'] = df['input_ids'].apply(torch.tensor)

# Displaying the first 3 rows after tokenization
df[['review', 'input_ids', 'query']].head(3)


Unnamed: 0,review,input_ids,query
0,I rented I AM CURIOUS-YELLOW from my video sto...,"[tensor(40), tensor(26399), tensor(314), tenso...",I rented I AM C
1,"""I Am Curious: Yellow"" is a risible and preten...","[tensor(1), tensor(40), tensor(1703), tensor(4...","""I Am Curious:"
2,If only to avoid making this type of film in t...,"[tensor(1532), tensor(691), tensor(284), tenso...",If only to avoid making


Now this is the same implementation but in more comapct way (with no pandas DataFrame):  

In [8]:
from trl.core import LengthSampler

def build_dataset(
    dataset_name="stanfordnlp/imdb",
    input_min_text_length=2,
    input_max_text_length=8,
):
    """
    Build dataset for training. This builds the dataset from `load_dataset`, one should
    customize this function to train the model on its own dataset.

    Args:
        dataset_name (`str`):
            The name of the dataset to be loaded.

    Returns:
        dataloader (`torch.utils.data.DataLoader`):
            The dataloader for the dataset.
    """
    tokenizer = AutoTokenizer.from_pretrained("gpt2")
    tokenizer.pad_token = tokenizer.eos_token
    # load imdb with datasets
    ds = load_dataset(dataset_name, split="train")
    ds = ds.rename_columns({"text": "review"})
    ds = ds.filter(lambda x: len(x["review"]) > 200, batched=False)

    input_size = LengthSampler(input_min_text_length, input_max_text_length)

    def tokenize(sample):
        sample["input_ids"] = tokenizer.encode(sample["review"])[: input_size()]
        sample["query"] = tokenizer.decode(sample["input_ids"])
        return sample

    ds = ds.map(tokenize, batched=False)
    ds.set_format(type="torch")
    return ds

In [40]:
# Build the dataset
ds = build_dataset(dataset_name="stanfordnlp/imdb", input_min_text_length=2, input_max_text_length=8)

In [41]:
# Lets look at the dataset
ds

Dataset({
    features: ['review', 'label', 'input_ids', 'query'],
    num_rows: 24895
})

In [50]:
# Lets look at the first row
ds[0]

{'review': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far 

We load the GPT2 model with a value head and the tokenizer. We load the model twice; the first model is optimized while the second model serves as a reference to calculate the KL-divergence from the starting point. This serves as an additional reward signal in the PPO training to make sure the optimized model does not deviate too much from the original language model.

In [53]:
from trl import AutoModelForCausalLMWithValueHead, AutoTokenizer

# Load the main GPT-2 model with a value head for fine-tuning with reinforcement learning
model = AutoModelForCausalLMWithValueHead.from_pretrained("gpt2")

# Load a reference GPT-2 model with a value head, typically used to calculate KL divergence
# between the fine-tuned model and the original model during fine-tuning.
ref_model = AutoModelForCausalLMWithValueHead.from_pretrained("gpt2")

# Load the tokenizer for GPT-2. This will handle tokenization for both the main and reference models.
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Set the pad token to the end-of-sequence token, as GPT-2 does not have a padding token by default.
tokenizer.pad_token = tokenizer.eos_token


RuntimeError: Failed to import trl.models.modeling_value_head because of the following error (look up to see its traceback):
Failed to import transformers.modeling_utils because of the following error (look up to see its traceback):
cannot import name 'quantize_' from 'torchao.quantization' (c:\Users\user\anaconda3\envs\torch\lib\site-packages\torchao\quantization\__init__.py)