```{contents}
```

## PPO (Proximal Policy Optimization)

**PPO** is a reinforcement learning algorithm that improves a policy (an agent’s behavior) in a way that is:

* **stable**
* **efficient**
* **safe**
* **easy to implement**
* **high-performance**

PPO is the algorithm used by **OpenAI** to train ChatGPT from human feedback (RLHF).

PPO solves a major problem in RL:

> **How to update a policy without letting it change too much and break training stability?**

It does this by restricting updates to stay within a *safe* region — the “proximal” part.

---

### Why PPO Is Needed (Intuition)

In classic policy gradient RL:

* You take steps to increase rewards
* Big gradient steps can **destroy** the existing policy
* Training becomes unstable or collapses

PPO fixes this.

### Key idea:

> Take the *biggest possible improvement step* **without deviating too far** from the original policy.

It is like saying:

* “Improve your behavior… but **don’t change too fast**.”

---

### PPO in the RLHF Context (ChatGPT)

In RLHF:

1. The **LLM** = policy
2. **Reward Model (RM)** gives reward to generated answers
3. PPO updates the LLM to **maximize reward**

The objective becomes:

```
Make the model more likely to produce high-reward answers,
while staying close to the original SFT model.
```

This prevents:

* hallucinations
* toxic drift
* over-optimization
* forgetting instruction-following behavior

---

### PPO Core Formula (Simple Version)

PPO modifies the policy update using a **clipped objective**:

$$
L_{PPO} = \min \Big(
r(\theta) A,\
\text{clip}(r(\theta),\ 1-\epsilon,\ 1+\epsilon) A
\Big)
$$

Where:

* $r(\theta) = \frac{\pi_\theta(a|s)}{\pi_{old}(a|s)}$
  Ratio of new policy to old policy

* $A$ = advantage (how good an action was)

* $\epsilon$ = small number like 0.1

### Meaning:

* If the update is too big → clipping stops it
* If it’s safe → policy improves normally

This ensures **stable**, **bounded** updates.

---

### PPO’s Key Innovations

#### **1. Clipped Ratio**

Prevents big destructive updates.

#### **2. KL Constraint**

Ensures the new model stays close to the old one.

#### **3. Surrogate Objective**

Optimizes expected reward in a stable way.

#### **4. Multiple Mini-Batch Updates**

Efficient and GPU-friendly.

#### **5. On-Policy but Sample-Efficient**

Unlike vanilla policy gradient, PPO reuses data multiple times.

---

### ⭐ 5. PPO Training Loop (Conceptual)

#### Step 1 — Generate Data

The model generates several candidate answers to prompts.

#### Step 2 — Reward Scoring

A reward model scores answers:

* Safe?
* Helpful?
* Correct?
* Not toxic?

#### Step 3 — Policy Update Using PPO

PPO adjusts the LLM:

* increase probability of high-scored answers
* decrease probability of low-scored answers
* keep the model close to the original policy

This loop repeats thousands of times.

---

### 6. Simple Example (Intuition)

Prompt:

```
Explain gravity.
```

The model generates 3 answers:

| Answer              | Reward |
| ------------------- | ------ |
| A (good scientific) | +6     |
| B (half wrong)      | +2     |
| C (nonsense)        | -3     |

PPO update:

* Increase probability of **A**
* Slightly increase **B** (but less)
* Decrease probability of **C**
* Prevent the model from changing too much
* Ensure it still follows instructions

Over time → The model becomes aligned.

---

### 7. PPO in ChatGPT Training

PPO is used in RLHF:

```
Base Pretrained Model
→ SFT Model
→ PPO using Reward Model
→ Final ChatGPT
```

During PPO:

* High-quality answers get high rewards
* Harmful or hallucinated answers get low rewards
* Model is nudged towards preferred behavior
* Safety improves
* Reasoning improves
* Chat-like behavior emerges

---

### 8. PPO vs DPO

| Feature            | PPO    | DPO       |
| ------------------ | ------ | --------- |
| Needs Reward Model | Yes    | No        |
| Uses RL            | Yes    | No        |
| Training Stability | Medium | High      |
| Cost               | High   | Low       |
| Simplicity         | Low    | Very high |
| Alignment Quality  | High   | High      |

Most modern open-source models now use **DPO** because it is much simpler.

OpenAI/GPT models still use **PPO + RM**.

---

**One-Sentence Summary**

**PPO is a reinforcement learning algorithm that updates a model’s behavior by increasing good actions and limiting destructive updates, forming the core of RLHF used to train aligned LLMs like ChatGPT.**



### Demo

#### **1. Install Dependencies**

```bash
pip install gymnasium torch numpy
```

---

#### **2. PPO Implementation (Minimal, Complete, Works Anywhere)**

```python
import gymnasium as gym
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np

# --------------------------
#  Policy + Value Networks
# --------------------------
class ActorCritic(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Sequential(
            nn.Linear(4, 64),
            nn.ReLU(),
        )
        self.policy = nn.Linear(64, 2)   # 2 actions
        self.value = nn.Linear(64, 1)    # state value

    def forward(self, x):
        h = self.fc(x)
        return self.policy(h), self.value(h)

# --------------------------
#  PPO Agent
# --------------------------
class PPO:
    def __init__(self, lr=3e-4, gamma=0.99, eps_clip=0.2):
        self.gamma = gamma
        self.eps_clip = eps_clip
        self.model = ActorCritic()
        self.optimizer = optim.Adam(self.model.parameters(), lr=lr)

    def select_action(self, state):
        state = torch.FloatTensor(state)
        logits, value = self.model(state)
        probs = torch.softmax(logits, dim=-1)
        dist = torch.distributions.Categorical(probs)
        action = dist.sample()

        return action.item(), dist.log_prob(action), value

    def compute_returns(self, rewards, values, dones, gamma):
        returns = []
        R = 0
        for r, d in zip(reversed(rewards), reversed(dones)):
            if d: R = 0
            R = r + gamma * R
            returns.insert(0, R)
        return torch.tensor(returns)

    # --------------------------
    #  PPO Update (Core Logic)
    # --------------------------
    def update(self, states, actions, old_log_probs, returns, advantages):
        for _ in range(5):  # 5 epochs
            logits, values = self.model(states)
            probs = torch.softmax(logits, dim=-1)
            dist = torch.distributions.Categorical(probs)

            new_log_probs = dist.log_prob(actions)
            ratio = torch.exp(new_log_probs - old_log_probs)

            surr1 = ratio * advantages
            surr2 = torch.clamp(ratio, 1-self.eps_clip, 1+self.eps_clip) * advantages

            actor_loss = -torch.min(surr1, surr2).mean()
            critic_loss = (returns - values.squeeze())**2

            loss = actor_loss + critic_loss.mean()

            self.optimizer.zero_grad()
            loss.backward()
            self.optimizer.step()

# --------------------------
#  Training Loop
# --------------------------
env = gym.make("CartPole-v1")
agent = PPO()

episodes = 200

for ep in range(episodes):
    state, _ = env.reset()
    done = False

    states, actions, log_probs, rewards, values, dones = [], [], [], [], [], []

    while not done:
        action, log_prob, value = agent.select_action(state)

        next_state, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated

        # store transitions
        states.append(state)
        actions.append(action)
        log_probs.append(log_prob)
        rewards.append(reward)
        values.append(value.squeeze())
        dones.append(done)

        state = next_state

    # Convert lists to tensors
    states = torch.FloatTensor(states)
    actions = torch.tensor(actions)
    old_log_probs = torch.stack(log_probs)
    values = torch.stack(values)

    returns = agent.compute_returns(rewards, values, dones, agent.gamma)
    advantages = returns - values.detach()

    # PPO update
    agent.update(states, actions, old_log_probs, returns, advantages)

    print(f"Episode {ep+1}, Total Reward: {sum(rewards)}")
```

---

#### **3. What This Code Demonstrates (Mapping PPO Concepts)**

| PPO Concept                 | Shown in Code                                |
| --------------------------- | -------------------------------------------- |
| **Policy network**          | `self.policy`                                |
| **Value network**           | `self.value`                                 |
| **Action sampling**         | `dist.sample()`                              |
| **Old vs new policy ratio** | `ratio = exp(new_log_probs - old_log_probs)` |
| **Clipped loss**            | `torch.clamp(ratio, 1-eps, 1+eps)`           |
| **Advantage calculation**   | `advantages = returns - values`              |
| **Multiple update epochs**  | `for _ in range(5)`                          |
| **On-policy learning**      | uses data from the latest episode only       |

This is exactly the simplified version of PPO used in LLM RLHF:

* **Policy** = LLM
* **Value function** = "critic" for evaluating responses
* **Rewards** = from reward model
* **Advantages** = quality above baseline
* **Clipping** = protects model from drifting too far

---

#### **4. What You’ll See When Running**

Training starts at:

```
Total Reward: ~10–20
```

Over episodes, reward increases:

```
Total Reward: 80
Total Reward: 150
Total Reward: 200 (max)
```

CartPole solved = stable balancing.

This is PPO learning **through rewards**, like RLHF—but with a simple environment.

---

| Concept                | What It Teaches                   | Data Needed                     | Algorithm Used | Goal                             | Difficulty |
| ---------------------- | --------------------------------- | ------------------------------- | -------------- | -------------------------------- | ---------- |
| **Instruction Tuning** | Follow instructions               | Instruction → Response          | Cross-entropy  | Make model task-aware            | Easy       |
| **SFT**                | Provide examples of good behavior | Labeled responses               | Cross-entropy  | Make model act like an assistant | Easy       |
| **RLHF**               | Learn human preferences           | Preference pairs + reward model | PPO            | Align model to human values      | Hard       |
| **PPO**                | Stable policy updates             | Reward signals                  | PPO RL         | Improve model using reward model | Hard       |
| **DPO**                | Prefer good answers directly      | Preference pairs                | DPO loss       | Alignment without RL             | Medium     |
