In [3]:
# imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set(style="whitegrid", font_scale=1.5, rc={'figure.figsize':(12, 6)})
from tqdm import tqdm
# for custom notebook formatting.
from IPython.core.display import HTML
display(HTML('<style>.prompt{width: 0px; min-width: 0px; visibility: collapse}</style>'))
HTML(open('../custom.css').read())



<br>

## Natural Language Processing
### :::: Reinforcement Learning from Human Feedback ::::

<br>

<br><br><br><br><br><br><br><br><br>


- Language models trained to predict next word: $p(w_i \mid w_{i-1} \ldots w_1)$
- How do we go from this autocompletion model to an intelligent assistant?

![figs/chatgpt1.png](figs/chatgpt1.png)

> Predicting the next token on a webpage from the internet is different from the objective “follow the user’s instructions helpfully and safely”

[source](https://arxiv.org/pdf/2203.02155.pdf)

## Approach 1: Instruction Fine-tuning

- **Pretraining:** Fit a language model to predict the next word in a sentence. E.g., a transformer model.

<img src="../sequence/figs/gpt.png" width=60%/>

<br><br><br>


- **Supervised Fine-tuning:** Starting with the pretrained model, fit a model to perform a different task. E.g., classification
<img src="../sequence/figs/lstmclf4.png" width="40%"/>

But, this doesn't teach the model how to follow instructions...

How can we take a classification task and turn it into a chat session?



<br><br><br><br><br><br>

- **Supervised Instruction Fine-tuning**: Starting with the pretrained model, fine-tune a word prediction model on examples of instructions.

**Big Trick:** 
- Find benchmarks with questions and answers
- Convert these into word prediction problems

E.g., for a sentiment classification task, the input dialog might be:

**Original data:** 

> "This movie is boring"  **label:** Negative

**Transformed data:** 

> Instructions: Read the following movie review and determine if the author likes or dislikes the movie.

> Input: "This movie is boring"

> Expected Output: The user dislikes the movie.


<br><br>

NLP researchers have been created such benchmarks for decades:

<img src="figs/tasks.png" width=60%/>

[source](https://arxiv.org/pdf/2204.07705.pdf)

Just need to transform them into instruction following examples:

<img src="figs/entailment.png" width=60%/>

Of course, we can also throw in some tests...

<img src="figs/tests.png" width=60%/>


**Limitations of instruction fine-tuning?**

<br><br><br><br>

- Expensive to create labeled data
- There isn't always an unambiguously correct answer:
  + Write me a haiku about Tulane
- Word prediction task will penalize all errors equally, even if some words are more critical than others

<br><br><br>

## Human Feedback

What are some ways we can have humans can give us feedback to train our chatbot directly?

<br><br><br>

1. We can ask humans to write a response:

**Write a haiku about Tulane**

> Under moss-clad oaks,
> 
> Wisdom blooms in green and blue—
> 
> Tulane's light shines through.

<br><br>

2. We can ask humans to rate the quality of different responses directly.

**Write a haiku about Tulane**

Answer 1: Rating = 8/10

> Under moss-clad oaks,
> 
> Wisdom blooms in green and blue—
> 
> Tulane's light shines through.

Answer 2. Rating = 2/10
> Snow blankets the quad,
> 
> Penguins march where scholars trod—
> 
> Tulane's icy pod.


<br><br><br>

3. We can ask humans to rank two responses, without giving an absolute score.

> Answer 1 > Answer 2

<br><br>

But, how can we use these types of feedback to update the language model?

<br><br>

For the word prediction task, we could compute an error function based on each word in the output. 

E.g., $-\log p(w_i^* \mid w_{i-1} \ldots w_1)$, where $w_i^*$ is the correct word to produce at position $i$.

<br>

- But now, we only receive an error / feedback once the **entire answer** is generated.

- If we think of producing the next word in the answer as an **action**, then a language model produces an answer by taking a sequence of word selection actions.

- We would like the language model to take a sequence of actions that maximizes the human feedback given at the end of the answer.

This sounds like a job for ... **reinforcement learning**

<br>

<img src="figs/rl.png" width=80%/>

The goal of Reinforcement Learning is to learn an optimal policy $\pi$ when $R$ and $T$ are not known in advance.
> Must interact with the world to learn how it works.

See more in CMPS 6740: Reinforcement Learning and CMPS 6620: Artificial Intelligence.

<br><br>
<img src="figs/rlloop.png" width=80%/>

<br><br>

In robot navigation, we let the robot wander around many times, recording the final reward collected after each trial. 

A final policy might look like this:

<img src="figs/policy.png" width=50%/>

<br><br>



## Finding optimal policies

How do we find an optimal policy? To start, we need to define the value of a state:

<img src="figs/bellman1.png" width=50%/>

<img src="figs/bellman2.png" width=50%/>

<img src="figs/policy2.png" width=40%/>

Many algorithms exist for finding good policies. There is a tradeoff between:
- **exploration**: taking random actions to learn about the search space; and
- **exploitation**: using the best policy learned so far to increase reward

<br><br>

### Policy Gradient 

One class of solutions assumes the policy $\pi_\theta$ is parameterized by $\theta$ (just like our language model is!).

Gradient descent is used to find the $\theta$ that maximizes cumulative expected reward:

$\theta^* = $argmax$_\theta \sum_{t=0}^\infty \gamma^t r_t$

- $\gamma$: discount factor (e.g., .5) that reduces value of rewards far in the future.
- $r_t$ reward received at time $t$

<br>

This is harder than the gradient descent we've seen, since the agent must take many actions to compute the gradient.

In practice, we compute sample action trajectories according to a given policy $\pi_\theta$, and use the resulting reward to estimate the gradient.


<br><br><br>
But wait, the human only gave us a reward (feedback) on a few possible answers per question. E.g., 8/10 and 2/10.

<br><br>
Given that the search for the best policy will have to consider many possible answers, **how can we determine the reward for an answer a human has never rated??**

<br><br><br>


## Reward Modeling

Suppose we have human ratings like:

> Answer 1: 8/10
> 
> Answer 2: 2/10

How could we fit a model to predict the numbers 8 and 2 for the two different responses?

<br><br><br>

> Starting from the supervised fine-tuning model with the final unembedding layer removed, we trained a model to take in a prompt and response, and output a scalar reward.

**Input:**

*Instruction:* Write a haiku about Tulane

*Answer:*

> Under moss-clad oaks,
> 
> Wisdom blooms in green and blue—
> 
> Tulane's light shines through.

**Output:**

8

<br><br><br><br>

**What if we only have pairwise rankings?**

Answer 1 > Answer 2

<br><br>
Assume $r_\theta(x,y)$ is the reward predicted for instruction $x$ and answer $y$.

What loss functions could we use here to learn from pairwise rankings?

<br><br><br>

loss: $\sigma(r_\theta(x, y_1) - r_\theta(x, y_2))$

- $r_\theta(x,y)$ is the reward predicted for instruction $x$ and response $y$
- $\sigma$ is logistic function

<img src="figs/instructgptpaper.png" width=60%/>

https://arxiv.org/pdf/2203.02155.pdf

<img src="figs/instructgpt.png" width=90%/>



**sources**

- Stanford: https://web.stanford.edu/class/cs224n/slides/cs224n-2024-lecture10-instruction-tuning-rlhf.pdf