# How ChatGPT Works Part 3: RLHF

<a target="_blank" href="https://colab.research.google.com/github/life-efficient/RLHF-Implementation/blob/main/Notebook.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

> Reinforcement Learning with Human Feedback, or RLHF, is a technique used to update a machine learning model based on human feedback

The second and third step in the diagram below encompass RLHF:
- The reward model is trained to predict the reward for each response using a supervised dataset of prompts and various responses in step 2
- The reward model is used in the reinforcement learning setup in step 3 to predict the reward for each response on an unsupervised dataset of prompts

![](https://github.com/life-efficient/RLHF-Implementation/blob/main/images/How%20chatGPT%20is%20trained.png?raw=1)

### Recap: What is Reinforcement Learning?

> Reinforcement learning is where an agent (in our case, the AI system) interacts with an an environment (in our case, interacting with the chat interface by responding to prompts), and tries to maximise a reward which is receives for doing well (or a punishment for not doing well).

![](https://github.com/life-efficient/RLHF-Implementation/blob/main/images/RL%20Formulation.png?raw=1)


In our case:
- The action taken by the bot is the response it provides
- The policy is the model that the chatbot uses to provide a response
- The reward is generated by the reward model
- The state is the chat so far

![](https://github.com/life-efficient/RLHF-Implementation/blob/main/images/ChatGPT%20RL%20Formulation.png?raw=1)

More specifically:
- Both the language model and the reward model are transformer neural networks
- The reward model remains fixed, assuming that it's already encoded the values we want the model to align with
- The language model is updated

![](https://github.com/life-efficient/RLHF-Implementation/blob/main/images/RLHF%20NN%20Setup%20for%20LMs.png?raw=1)

> Note that in the InstructGPT paper, both the language model and the reward model were initialised using the parameters resulting from the SFT process performed earlier.

### The Reward Model is Used to Encode Complex Behaviours that are Very Difficult to Define

> It can be very difficult to define many of the behaviours that we want our AI systems to exhibit

- What does it mean to be unbiased?
- What does it mean to act professionally?
- What does it mean to be ethical?

> Instead of trying to explicity write out the rules for what each of these things, a better approach can be to learn them from human feedback

It's hard to write the rules for these things, but it's relatively easy for a human to tell whether an output is biased, professional, or ethical.
That's why the reward model is trained on human feedback (rankings of different responses to a given prompt). 
If the reward model is trained sufficiently to fit a dataset that prefers unbiased, ethical responses etc, then it should encode these complex behaviours.

> The reward model is used to provide the reward used in the reinforcement learning setup

## The Dataset

To implement the reinforcement learning loop, we'll need the dataset. Thanks to the reward model, which will provide the reward as a label for each response, we don't need human written labels for each of them. The dataset should simply return different prompts. The model will then complete them and the reward model will score them, before we use the reward to update the policy for generating responses.

In [2]:
!pip install torch

Collecting torch
  Downloading torch-1.13.1-cp310-cp310-win_amd64.whl (162.6 MB)
     ---------------------------------------- 0.0/162.6 MB ? eta -:--:--
     -------------------------------------- 0.0/162.6 MB 653.6 kB/s eta 0:04:09
     -------------------------------------- 0.1/162.6 MB 653.6 kB/s eta 0:04:09
     -------------------------------------- 0.1/162.6 MB 901.1 kB/s eta 0:03:01
     -------------------------------------- 0.1/162.6 MB 901.1 kB/s eta 0:03:01
     -------------------------------------- 0.2/162.6 MB 697.2 kB/s eta 0:03:53
     -------------------------------------- 0.2/162.6 MB 697.2 kB/s eta 0:03:53
     -------------------------------------- 0.2/162.6 MB 724.0 kB/s eta 0:03:45
     -------------------------------------- 0.3/162.6 MB 809.2 kB/s eta 0:03:21
     -------------------------------------- 0.4/162.6 MB 825.0 kB/s eta 0:03:17
     -------------------------------------- 0.4/162.6 MB 818.3 kB/s eta 0:03:19
     -------------------------------------- 0.

In [3]:
import pandas as pd
import torch

class PromptDataset(torch.utils.data.Dataset):
    def __init__(self):
        super().__init__()
        self.prompts = pd.read_csv('prompt_dataset.csv')["Prompt"]

    def __len__(self):
        return len(self.prompts)
    
    def __getitem__(self, idx):
        return self.prompts[idx]

prompt_dataset = PromptDataset()
prompt_dataset[0]

'How does gravity work?'

## ENABLE GPU RUNTIME NOW IF ON GOOGLE COLAB

In the next few cells, we'll train and save the SFT language model and the reward model. The training can be massively accelerated by using a GPU (graphics processing unit - fast for parallel operations) instead of a CPU (central processing unit - fast for sequential operations).

> Enable the GPU runtime on Google Colab now

To do so, hit "Runtime" -> "Change runtime type" -> "Hardware accelerator" -> "GPU"

## Load in the Pre-Trained Language Model

By this point, we should already have performed supervised fine-tuning (SFT) on a large langauge model.

Let's load in our fine-tuned language model:

In [None]:
!pip install transformers

In [None]:
from torch.utils.tensorboard import SummaryWriter
from tqdm import tqdm
from torch.utils.data import DataLoader
import torch
import json
from transformers import GPT2Tokenizer, GPT2LMHeadModel, GPT2Config


class SFTModel(GPT2LMHeadModel):
    def __init__(self):
        configuration = GPT2Config.from_pretrained(
            'gpt2', output_hidden_states=False)
        super().__init__(config=configuration)
        self.tokenizer = GPT2Tokenizer.from_pretrained(
            "gpt2", config=configuration)  # Load the tokenizer
        self.to(torch.device(
            "cuda:0" if torch.cuda.is_available() else "cpu"))
        self.to(self.device)  # Move the model to the GPU

    def forward(self, prompt, response):
        # Encode the data
        entire_text = prompt + response
        context_dict = self.tokenizer(
            '<|startoftext|>' + entire_text + '<|endoftext|>',
            #    truncation=True,
            #    max_length=max_length,
            #    padding="max_length"
        )

        input_ids = torch.tensor(context_dict.input_ids)
        labels = torch.tensor(context_dict.input_ids)
        attention_mask = torch.tensor(context_dict.attention_mask)

        # Move to GPU
        input_ids = input_ids.to(self.device)
        labels = labels.to(self.device)
        attention_mask = attention_mask.to(self.device)

        # Run the model
        outputs = super().forward(
            input_ids=input_ids,
            labels=labels,
            attention_mask=attention_mask,
        )
        return outputs


class SFTDataset(torch.utils.data.Dataset):
    """Supervised Fine-Tuning Dataset

    Returns:
        prompt: str
        response: str
    """

    def __init__(self):
        with open("sft_dataset.json") as f:
            self.data = json.load(f)

    def __len__(self):
        """Defines the length of the dataset."""
        return len(self.data)

    def __getitem__(self, idx):
        """Defines how to get a sample from the dataset by indexing it.

        Returns:
            prompt: str
            response: str
        """
        return self.data[idx]["prompt"], self.data[idx]["response"]


def train_and_save_SFT_model(epochs=10):

    # Create the model
    model = SFTModel()  # Load the model

    # Create the dataset and dataloader
    dataset = SFTDataset()
    dataloader = DataLoader(dataset, batch_size=1, shuffle=True)

    # Create the optimizer
    # as used in the InstructGPT paper
    optimizer = torch.optim.Adam(
        model.parameters(), lr=1e-5, betas=(0.9, 0.95))

    # Set up logging
    writer = SummaryWriter()  # for logging our loss to TensorBoard
    # for setting the x-axis of our TensorBoard plots (loss vs. batch index)
    batch_idx = 0

    # Train the model
    for epoch in range(epochs):
        print(f"Epoch {epoch + 1}")
        for batch in tqdm(dataloader):
            # Get the data
            prompt, response = batch
            prompt = prompt[0]
            response = response[0]

            # Forward pass
            outputs = model(prompt, response)

            loss = outputs.loss

            # Backward pass
            loss.backward()
            optimizer.step()

            # Zero the gradients
            optimizer.zero_grad()

            # Log the loss
            # print(f"Loss: {loss.item()}", batch_idx)
            writer.add_scalar("SFT Model Loss/train", loss.item(), batch_idx)
            batch_idx += 1
    torch.save(model.state_dict(), "sft_model_params.pt")


In [None]:
train_and_save_SFT_model()

Now we've trained and saved the SFT model, we need to load it in and set its parameters.

In [None]:
sft_model = SFTModel() # create model
sft_state_dict = torch.load('sft_model_params.pt') # load model weights
sft_model.load_state_dict(sft_state_dict) # set model weights

## Load in the Pre-Trained Reward Model

By this point, we should have already trained a reward model that takes in a prompt and a response and produces a scalar reward - a measure of how good the response is for that context.

Let's load in our reward model:

In [None]:
from torch.utils.data import DataLoader
from torch.utils.tensorboard import SummaryWriter
from tqdm import tqdm
from transformers import GPT2Model, GPT2Tokenizer
import torch
import pandas as pd


def loss_function(preferred_response_reward, alternate_response_reward):
    return -torch.mean(torch.log(torch.sigmoid(preferred_response_reward - alternate_response_reward)))


def create_response_pairs():

    data = pd.read_csv('reward_dataset.csv', sep="|")

    data = data.to_dict(orient="records")
    response_pairs = []

    for row in data:
        prompt = row["Prompt"]
        response_pairs.append(
            (prompt, row["Most preferable response"], row["Somewhat preferable response"]))
        response_pairs.append(
            (prompt, row["Most preferable response"], row["Least preferable response"]))
        response_pairs.append(
            (prompt, row["Somewhat preferable response"], row["Least preferable response"]))

    return response_pairs


class RewardDataset(torch.utils.data.Dataset):
    def __init__(self):
        """Initializes the dataset."""
        self.response_pairs = create_response_pairs()
        print("Number of response pairs:", len(self.response_pairs))

    def __len__(self):
        """Returns the length of the dataset."""
        return len(self.response_pairs)

    def __getitem__(self, idx):
        """Returns the example in the dataset at the given index."""

        # Get the response pair at the given index
        response_pair = self.response_pairs[idx]
        prompt, preferred_response, alternate_response = response_pair

        # Return the preferred response, alternate response
        return prompt, preferred_response, alternate_response


class RewardModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.device = torch.device(
            "cuda" if torch.cuda.is_available() else "cpu")
        self.backbone = GPT2Model.from_pretrained('gpt2')
        self.tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
        self.regression_head = torch.nn.Linear(768, 1)
        self.to(self.device)

    def forward(self, context, response):
        """
        Returns a scalar value representing the reward for this response, given the context.
        Args:
            context (str): The context. aka. the prompt.
            response (str): The response. aka. the response to the prompt.
        Returns:
            float: The reward for generating this response given the context.    
        """

        entire_text = context + response
        context_dict = self.tokenizer(
            '<|startoftext|>' + entire_text + '<|endoftext|>',
            #    truncation=True,
            #    max_length=max_length,
            #    padding="max_length"
        )

        input_ids = torch.tensor(context_dict.input_ids)
        attention_mask = torch.tensor(context_dict.attention_mask)

        # Move to GPU
        input_ids = input_ids.to(self.device)
        attention_mask = attention_mask.to(self.device)

        # Forward pass
        gpt2_outputs = self.backbone(
            input_ids=input_ids,
            attention_mask=attention_mask,
        )
        all_output_vectors = gpt2_outputs.last_hidden_state
        last_output_vector = all_output_vectors[-1]

        # add batch_size dimension
        last_output_vector = last_output_vector.unsqueeze(0)
        reward = self.regression_head(last_output_vector)

        return reward


def train_and_save_reward_model(epochs=10):

    model = RewardModel()

    # Create the dataset and dataloader
    dataset = RewardDataset()

    # Create the optimizer
    optimizer = torch.optim.Adam(
        model.parameters(), lr=1e-5, betas=(0.9, 0.95))  # as used in the InstructGPT paper

    # Set up logging
    writer = SummaryWriter()  # for logging our loss to TensorBoard
    # for setting the x-axis of our TensorBoard plots (loss vs. batch index)
    batch_idx = 0
    # Train the model
    for epoch in range(epochs):
        print(f"Epoch {epoch + 1}")
        for batch in tqdm(dataset):

            prompt, preferred_response, alternate_response = batch

            preferred_response_reward = model(prompt, preferred_response)
            alternate_response_reward = model(prompt, alternate_response)

            loss = loss_function(preferred_response_reward,
                                 alternate_response_reward)

            loss.backward()

            optimizer.step()

            optimizer.zero_grad()

            writer.add_scalar("Reward Model Loss/Train",
                              loss.item(), batch_idx)
            batch_idx += 1
            # torch.save(model.state_dict(),
            #            f"epoch-{epoch}-reward_model_params.pt")
    torch.save(model.state_dict(), "reward_model_params.pt")


In [None]:
train_and_save_reward_model()

Now we've trained and saved the reward model, we need to load it in and set its parameters.

In [None]:
reward_model = RewardModel()  # create model
reward_state_dict = torch.load('reward_model.pt')  # load model weights
reward_model.load_state_dict(reward_state_dict)  # set model weights

## A Simple Initial Attempt - Train the Policy to Maximise the Reward

The overall objective that RLHF optimises is rather complicated, so before we optimise for that, let's try to simply maximise the reward.


In [None]:
def reward_maximisation_objective(prompt, response, reward_model):
    """Returns the reward maximisation objective for the given prompt and response."""

    # Set the reward model to evaluation mode (Disables dropout and batch norm)
    reward_model.eval()

    # Get the reward for the response
    reward = reward_model(prompt, response)

    # this is a trivial function right now, 
    # but it highlights that the maximisation objective could be any function like this
    # and could include more terms

    # Return the reward
    return reward

In [None]:
from torch.utils.tensorboard import SummaryWriter

def train_and_save_RLHF_model():
    """Trains the RLHF model and saves it to disk."""

    # Set up logging
    writer = SummaryWriter()
    batch_idx = 0

    # Create the prompt dataset
    prompt_dataset = PromptDataset()

    # Create the reward model
    reward_model = RewardModel()

    # Create the SFT model
    sft_model = SFTModel()

    # Create the optimizer
    optimizer = torch.optim.Adam(sft_model.parameters(), lr=1e-4) # make sure to only train the SFT model, not the reward model which should be frozen

    # Train the model
    for epoch in range(10):
        for prompt in prompt_dataset:
            # Get the response
            response = sft_model(prompt)

            # Get the reward maximisation objective
            objective = reward_maximisation_objective(prompt, response, reward_model)

            # Torch minimises objectives, and we need to maximise the reward, so we negate the objective
            objective = -objective

            # Backpropagate the objective
            objective.backward()

            # Update the model parameters
            optimizer.step()

            # Zero the gradients
            optimizer.zero_grad()

            # Log the objective
            writer.add_scalar(
                "RLHF Model Objective/train", 
                -objective.item(), # Remember to negate the objective to get the actual value
                batch_idx
            )
            batch_idx += 1

    # Save the model
    torch.save(sft_model.state_dict(), 'rlhf_model_params.pt')

## Adding a Term to the Loss Function to Minimise How Much the Model can Deviate from the Original SFT parameters 

The overall loss function used in the instructGPT paper contains more terms in the loss function.

One of those additional terms is used to make sure that the final model tuned with RL stays close to the initial parameterisation produced by the fine-tuning.
Without this, during the RL optimisation, the model can begin to predict gibberish that no longer well models the distribution of the language, but that tricks the reward model.

ChatGPT uses the PPO reinforcement learning algorithm objective. 
This is the objective that it tries to maximise using gradient descent.

> Maximising the objective is the same as minimising the negative objective.

![](https://github.com/life-efficient/RLHF-Implementation/blob/main/images/RLHF%20LM%20PPO%20Objective.png?raw=1)
<!-- 
## The REINFORCE Obective

> REINFORCE is a reinforcement learning algorithm that PPO (the algorithm we will use) builds upon

The REINFORCE objective function is as follows: -->

<!-- ## PPO -->

<!-- - Averaged over a batch of different responses
- Rewards ratio: $\frac{reward \ with\ new\ params}{reward\ with\ old\ params}$ for the same input prompt    
- Multiplied by the advantage function
- Clipped to not change the policy too much - so that the new policy is in proximity of the other in terms of how much the reward will change -->
<!-- 
### The Rewards Ratio

![](./images/Rewards%20Ratio.png)

- If the reward ratio is > 1, then it means that taking action $a$ in state $s$ is more likely with the new policy compared to the old one.
- If the reward ratio is < 1, then it means that taking action $a$ in state $s$ is less likely with the new policy compared to the old one.

> The ratio of the rewards tells you how drastically the policy is changing per update.

### Clipping the Reward

> If the policy changes too much, the  -->

## Tasks
- Update the reward model so that it uses the SFT model as a starting point, instead of using GPT2
- Log the generated responses to tensorboard during training
- Implement proxy batching in your SFT model and reward model to get more accurate gradient updates by performing several forward passes, allowing gradients to accumulate, before taking their mean and calling `optimiser.step()`
- Get the logits from your model and use them to compute the second term in the loss function by using the chain rule of conditional probability. See how this affects the responses generated.

## What if the reward model is wrong?

The policy is optimised to maximise the reward model score.

That means everything depends on the reward model being accurate.

Assuming that the reward model is accurate, with too much fine tuning via RLHF, the policy can begin to overfit to the reward model and in fact produce responses less preferred by humans.
