---
title: How RLHF works?
description: | 
  A summary of the best practices for summarizing output of reproducible scientific documents.
date: 2023-02-30
---

In [None]:
#| hide
import matplotlib.pyplot as plt
import pandas as pd

import numpy as np
import torch
import torch.nn.functional as F

### How RLHF works?

Paper: [Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback](https://arxiv.org/abs/2204.05862)

#| column: screen-inset-shaded

![openai-rlhf.png](./images/openai-rlhf.png)

Outline
- Why ChatGPT take does world by storm?
    + Train on million of text
    + Not align with the human intention
        + LM was trained on milition of articles from the internet. 

### Explaining RLHF for a kid

### Breaking down RLHF

|  | What does it represent | 
|---------|:-----|
| 12      | 12   |
| 123     | 123  |
| 1       | 1    |

: Notations

output

- Align with human preferences
- Tweak the training loop of GPT

- Supervise Fine-tuning (SFT): the model is used to fined tune with prompts and desired output by labelers
- Reward Model (RM)

#### Start with

- A pretrained language model
- A distribution of prompts on which we want our model to produce aligned outputs
- A team of trained human labelers

#### Steps
- Compare output of difference language model on the dataset
- Labeler scores each output from these model
- Reward model learn to predicts which model output that labelers would prefer
- Use reward model as reward function
- Fine-tune pre-trained model to maximize the reward using PPO

### Dateset

### The idea

Align with human preferences

- Reward Model

**Notations**

### Training the reward model

In [None]:
from datasets import load_dataset
dataset = load_dataset("CarperAI/openai_summarize_comparisons", split="train")

Using custom data configuration CarperAI--openai_summarize_comparisons-79d2c222a15dc8fb
Found cached dataset parquet (/Users/education/.cache/huggingface/datasets/CarperAI___parquet/CarperAI--openai_summarize_comparisons-79d2c222a15dc8fb/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)


$\operatorname{loss}(\theta)=-\frac{1}{\left(\begin{array}{c}
K \\
2
\end{array}\right)} E_{\left(x, y_w, y_l\right) \sim D}\left[\log \left(\sigma\left(r_\theta\left(x, y_w\right)-r_\theta\left(x, y_l\right)\right)\right)\right]$

::: {.column-margin}
Page 8. Formula 1
:::

```python
class PairwiseLoss(nn.Module):
    def forward(
        self,
        chosen_rewards: TensorType["batch_size", 1],
        rejected_rewards: TensorType["batch_size", 1]
    ) -> TensorType[1]: # A scalar loss
        assert len(chosen_rewards) == len(rejected_rewards)
        batch_size = len(chosen_rewards)
        
        # maps the difference between the rewards to a probability
        probs = torch.sigmoid(chosen_rewards - rejected_rewards)
        return -probs.mean() / batch_size
```

::: {.column-margin}
[reward.py](https://github.com/xrsrke/instructGOOSE/blob/2fca05409bd202ff991b829e500b9b3de1d00ca4/instruct_goose/reward.py#L52)
:::

$y_w$, and $y_l$: represent the output that prefered and non-prefereed by human labeller respectively

The goal of the reward model is make it algin with human preference as much as possible. So we want a loss score that indicate if the loss is high that mean the reward model isn't doing good its job and vice versa.
- $r_\theta\left(x, y_w\right)$ is the reward scalar of the reward model for prompt $x$ and the summary $y_w$
- $r_\theta\left(x, y_l\right)$

##### Dataset

present labelers with multiple responses to a prompt, and ask them to rank the responses

https://huggingface.co/datasets/openai/webgpt_comparisons

##### Loss function

10 responses. compare pair

### SFT Model

1. https://huggingface.co/Dahoas/gpt2-sft-static
2. https://huggingface.co/Dahoas/gptneo-sft-static
3. https://huggingface.co/Dahoas/gptj-sft-static

In [None]:
import torch
import torch.nn.functional as F

In [None]:
x1 = torch.tensor(-0.1)
x2 = torch.tensor(0.3)
x3 = torch.tensor(0.5)
x4 = torch.tensor(0.8)

In [None]:
y = torch.tensor(0.3)


y2 = torch.tensor(0.9)

In [None]:
F.sigmoid(x)

tensor(0.5498)

In [None]:
F.sigmoid(y).log()

tensor(-0.5544)

In [None]:
F.sigmoid(y2).log()

tensor(-0.3412)

In [None]:
maximize log likelihood = minimize negative log likelihood

In [None]:
F.sigmoid(x1), F.sigmoid(x2), F.sigmoid(x3), F.sigmoid(x4)

(tensor(0.4750), tensor(0.5744), tensor(0.6225), tensor(0.6900))

In [None]:
torch.tensor(1).log()

tensor(0.)

### Objective Function

$\begin{aligned}\text { objective }(\phi)= & E_{(x, y) \sim D_{\pi_\phi^{\mathrm{RL}}}}\left[r_\theta(x, y)-\beta \log \left(\pi_\phi^{\mathrm{RL}}(y \mid x) / \pi^{\mathrm{SFT}}(y \mid x)\right)\right]+ \\& \gamma E_{x \sim D_{\text {pretrain }}}\left[\log \left(\pi_\phi^{\mathrm{RL}}(x)\right)\right]\end{aligned}$

```python
class AgentObjective(nn.Module):
    """Agent objective."""
    def __init__(
        self,
        model: Callable, # the language model
        sft_model: Callable, # the reference model
        reward_model: Callable, # the reward model
        gamma: float,
        beta: float
    ):
        super().__init__()
        self.model = model
        self.sft_model = sft_model
        self.reward_model = reward_model
        self.gamma = gamma
        self.beta = beta
        
    def forward(
        self,
        input_ids: TensorType["batch_size", "seq_len"],
        attention_mask: TensorType["batch_size", "seq_len"]
    ) -> TensorType[1]: # A scalar objective value
        """Calculate the objective value given the input ids and attention mask."""
        model_logits = self.model(input_ids, attention_mask)
        model_dist = F.softmax(model_logits, dim=-1)
        
        sft_logits = self.sft_model(input_ids, attention_mask)
        sft_dist = F.softmax(sft_logits, dim=-1)
        
        reward_score = self.reward_model(input_ids, attention_mask)
        ratio = torch.log(model_dist / sft_dist)
        
        # compute the coherent of the generated text
        coherent = torch.log(model_dist)
        objective = (reward_score - self.beta*ratio).mean() + self.gamma * coherent.mean()
        return objective
```

::: {.column-margin}
[agent.py](https://github.com/xrsrke/instructGOOSE/blob/d6b35b5a571a3933cb4b2059a6b94839b6a0571e/instruct_goose/agent.py#L94)
:::

::: {.column-margin}
Page 9. Formula 2
:::

Question 2: What does $\sigma\left(r_\theta\left(x, y_w\right)-r_\theta\left(x, y_l\right)\right)$ measure?

According to the paper, $y_w$ is the output preferred by the human labeler, while $y_l$ is the output that is not preferred by the human labeler

If the reward model predicts that

- The score for $y_w$ **is larger than** the score for $y_l$
    - Then $r_\theta\left(x, y_w\right)-r_\theta\left(x, y_l\right) > 0$
    - Then the output of the sigmoid function will be closer to 1, indicating a higher probability that the labeler will prefer $y_w$ (which is our target)
- The score for $y_w$ **is less** **than** the score for $y_l$
    - Then $r_\theta\left(x, y_w\right)-r_\theta\left(x, y_l\right) < 0$
    - Then the output of the sigmoid function will be closer to 0, indicating a lower probability that the labeler will prefer $y_l$ (which isn’t our target)

The goal of this part is to measure the probability that the model correctly predicts the target output

-----
Question 3: Why do we minus $\pi_\phi^{\mathrm{RL}}(y \mid x) / \pi^{\mathrm{SFT}}(y \mid x)$?

- Because encourage the RL-based model to generate sequences that are similar to those generated by the SFT model.
- A smaller output of this term in the equation corresponds to a larger objective, which aligns with the goal

-----
Question 4: What does $E_{x \sim D_{\text {pretrain }}}\left[\log \left(\pi_\phi^{\mathrm{RL}}(x)\right)\right]$ measure? Explain

It measures the coherence in the generated text. Because

- The sum of the log probabilities of each token in the generated text can be used as a measure of how well the text aligns with the patterns in the training data
- If the sum of log probabilities is high, it indicates that the generated text is more likely to have been drawn from the training data, and therefore is more coherent.