# PPO1 (PPO-Clip) — low-level PyTorch implementation

**Goal:** implement the classic *clipped surrogate objective* version of Proximal Policy Optimization (often referred to as **PPO1** in older codebases) using **plain PyTorch** (no RL libraries), and visualize:

- policy probability ratios \(r_t\) and clipping behavior (Plotly)
- learning curves and **reward per episode** (Plotly)

This notebook is designed to be **offline-friendly** and runs on `CartPole-v1` (Gymnasium).


## Notebook roadmap

1. PPO1 objective: intuition + the clipped surrogate (LaTeX)
2. A minimal PyTorch actor-critic
3. Rollout collection + GAE(\(\gamma,\lambda\))
4. PPO clipped update (multiple epochs + minibatches)
5. Plotly visualizations: ratios, clipping, reward per episode
6. Stable-Baselines PPO1 reference implementation (web research)
7. Hyperparameters (what they do + tuning tips)


## Prerequisites

- Python + PyTorch
- Gymnasium (`gymnasium`)
- Plotly

Everything is self-contained (no downloads).


In [None]:
import math
import random
from dataclasses import dataclass

import numpy as np
import pandas as pd

import plotly
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio

import torch
import torch.nn as nn
import torch.nn.functional as F

import gymnasium as gym

pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")

print("torch:", torch.__version__)
print("gymnasium:", gym.__version__)
print("plotly:", plotly.__version__)


In [None]:
# --- Reproducibility ---
SEED = 7
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)

# --- Run configuration ---
FAST_RUN = True  # set False for longer training

ENV_ID = "CartPole-v1"

ROLLOUT_STEPS = 512 if FAST_RUN else 2048
N_UPDATES = 40 if FAST_RUN else 200
TOTAL_TIMESTEPS = N_UPDATES * ROLLOUT_STEPS

UPDATE_EPOCHS = 4
MINIBATCH_SIZE = 128

GAMMA = 0.99
GAE_LAMBDA = 0.95

CLIP_EPS = 0.2
LEARNING_RATE = 3e-4
ADAM_EPS = 1e-5

ENT_COEF = 0.0
VF_COEF = 0.5
MAX_GRAD_NORM = 0.5

# Extra logging
LOG_EVERY_UPDATES = 1

# Device (suppress noisy CUDA init warnings in restricted environments)
import warnings

warnings.filterwarnings('ignore', message='CUDA initialization:.*')

cuda_ok = bool(torch.cuda.is_available())
device = torch.device("cuda" if cuda_ok else "cpu")
print("device:", device)
print("updates:", N_UPDATES)
print("total_timesteps:", TOTAL_TIMESTEPS)


## 1) PPO1 / PPO-Clip objective (clipped surrogate)

PPO maintains a *current* policy \(\pi_{\theta}\) and a *behavior* (old) policy \(\pi_{\theta_{\mathrm{old}}}\) that generated a batch of data.

Define the probability ratio:

\[
r_t(\theta) = \frac{\pi_{\theta}(a_t\mid s_t)}{\pi_{\theta_{\mathrm{old}}}(a_t\mid s_t)}
\]

Let \(A_t\) be an advantage estimate (commonly **GAE**). The clipped surrogate objective is:

\[
L^{\mathrm{CLIP}}(\theta) = \mathbb{E}_t\left[\min\Big( r_t(\theta)A_t,\; \operatorname{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\,A_t \Big)\right]
\]

Intuition:

- If \(A_t > 0\): we *want* \(\pi_\theta\) to increase probability of \(a_t\), but we **cap** the improvement when \(r_t\) exceeds \(1+\epsilon\).
- If \(A_t < 0\): we *want* \(\pi_\theta\) to decrease probability of \(a_t\), but we **cap** the degradation when \(r_t\) falls below \(1-\epsilon\).

In code we typically *minimize* the negative objective: `policy_loss = -mean(min(...))`.


## 2) Environment

We use `CartPole-v1` (discrete actions, low dimensional state). PPO also works for continuous actions; the PPO1 clipping logic is the same.


In [None]:
env = gym.make(ENV_ID)
env.action_space.seed(SEED)

obs_dim = int(np.prod(env.observation_space.shape))
assert isinstance(env.action_space, gym.spaces.Discrete)
action_dim = env.action_space.n

print("obs_dim:", obs_dim)
print("action_dim:", action_dim)


## 3) Low-level PyTorch actor-critic

We implement:

- **actor**: outputs logits for a categorical action distribution
- **critic**: outputs state-value \(V(s)\)

No helper RL libraries; just `torch`.


In [None]:
class ActorCritic(nn.Module):
    def __init__(self, obs_dim: int, action_dim: int, hidden: int = 64):
        super().__init__()
        self.backbone = nn.Sequential(
            nn.Linear(obs_dim, hidden),
            nn.Tanh(),
            nn.Linear(hidden, hidden),
            nn.Tanh(),
        )
        self.policy_head = nn.Linear(hidden, action_dim)
        self.value_head = nn.Linear(hidden, 1)

        # Orthogonal init is common for PPO
        for m in self.modules():
            if isinstance(m, nn.Linear):
                nn.init.orthogonal_(m.weight, gain=math.sqrt(2))
                nn.init.constant_(m.bias, 0.0)
        nn.init.orthogonal_(self.policy_head.weight, gain=0.01)

    def forward(self, obs: torch.Tensor):
        x = self.backbone(obs)
        logits = self.policy_head(x)
        value = self.value_head(x).squeeze(-1)
        return logits, value

    @torch.no_grad()
    def act(self, obs: torch.Tensor):
        logits, value = self.forward(obs)
        dist = torch.distributions.Categorical(logits=logits)
        action = dist.sample()
        log_prob = dist.log_prob(action)
        entropy = dist.entropy()
        return action, log_prob, entropy, value

    def evaluate_actions(self, obs: torch.Tensor, actions: torch.Tensor):
        logits, value = self.forward(obs)
        dist = torch.distributions.Categorical(logits=logits)
        log_prob = dist.log_prob(actions)
        entropy = dist.entropy()
        return log_prob, entropy, value


agent = ActorCritic(obs_dim, action_dim).to(device)
optimizer = torch.optim.Adam(agent.parameters(), lr=LEARNING_RATE, eps=ADAM_EPS)


## 4) Rollouts + GAE

We collect an on-policy rollout of length `ROLLOUT_STEPS`, then compute:

- advantages \(A_t\) via **Generalized Advantage Estimation** (GAE)
- returns \(R_t = A_t + V(s_t)\)

Finally we do multiple epochs of minibatch optimization on the same rollout.


In [None]:
@dataclass
class Rollout:
    obs: torch.Tensor
    actions: torch.Tensor
    log_probs: torch.Tensor
    values: torch.Tensor
    rewards: torch.Tensor
    dones: torch.Tensor
    advantages: torch.Tensor
    returns: torch.Tensor


def compute_gae(
    rewards: np.ndarray,
    values: np.ndarray,
    dones: np.ndarray,
    last_value: float,
    *,
    gamma: float,
    lam: float,
):
    """GAE for a single-environment rollout."""
    T = len(rewards)
    adv = np.zeros(T, dtype=np.float32)
    gae = 0.0
    for t in reversed(range(T)):
        next_nonterminal = 1.0 - float(dones[t])
        next_value = last_value if t == T - 1 else values[t + 1]
        delta = rewards[t] + gamma * next_value * next_nonterminal - values[t]
        gae = delta + gamma * lam * next_nonterminal * gae
        adv[t] = gae
    ret = adv + values
    return adv, ret


In [None]:
def collect_rollout(env, agent: ActorCritic, rollout_steps: int, obs: np.ndarray):
    obs_list = []
    action_list = []
    logp_list = []
    value_list = []
    reward_list = []
    done_list = []

    episode_returns = []
    ep_return = 0.0

    for _ in range(rollout_steps):
        obs_tensor = torch.tensor(obs, dtype=torch.float32, device=device).unsqueeze(0)
        action, logp, entropy, value = agent.act(obs_tensor)

        action_item = int(action.item())
        next_obs, reward, terminated, truncated, _ = env.step(action_item)
        done = bool(terminated or truncated)

        obs_list.append(obs)
        action_list.append(action_item)
        logp_list.append(float(logp.item()))
        value_list.append(float(value.item()))
        reward_list.append(float(reward))
        done_list.append(done)

        ep_return += float(reward)

        obs = next_obs
        if done:
            episode_returns.append(ep_return)
            ep_return = 0.0
            obs, _ = env.reset()

    # Bootstrap value at the end of rollout
    with torch.no_grad():
        obs_tensor = torch.tensor(obs, dtype=torch.float32, device=device).unsqueeze(0)
        _, last_value = agent.forward(obs_tensor)
        last_value = float(last_value.item())

    obs_arr = np.asarray(obs_list, dtype=np.float32)
    actions_arr = np.asarray(action_list, dtype=np.int64)
    logp_arr = np.asarray(logp_list, dtype=np.float32)
    values_arr = np.asarray(value_list, dtype=np.float32)
    rewards_arr = np.asarray(reward_list, dtype=np.float32)
    dones_arr = np.asarray(done_list, dtype=np.bool_)

    adv_arr, ret_arr = compute_gae(
        rewards_arr,
        values_arr,
        dones_arr,
        last_value,
        gamma=GAMMA,
        lam=GAE_LAMBDA,
    )

    # Advantage normalization is a common PPO trick
    adv_arr = (adv_arr - adv_arr.mean()) / (adv_arr.std() + 1e-8)

    rollout = Rollout(
        obs=torch.tensor(obs_arr, dtype=torch.float32, device=device),
        actions=torch.tensor(actions_arr, dtype=torch.int64, device=device),
        log_probs=torch.tensor(logp_arr, dtype=torch.float32, device=device),
        values=torch.tensor(values_arr, dtype=torch.float32, device=device),
        rewards=torch.tensor(rewards_arr, dtype=torch.float32, device=device),
        dones=torch.tensor(dones_arr.astype(np.float32), dtype=torch.float32, device=device),
        advantages=torch.tensor(adv_arr, dtype=torch.float32, device=device),
        returns=torch.tensor(ret_arr, dtype=torch.float32, device=device),
    )

    return rollout, episode_returns, obs


## 5) PPO1 update step

For each rollout batch we optimize the clipped surrogate objective over several epochs/minibatches.

We also log ratio statistics so we can visualize clipping.


In [None]:
def ppo_update(agent: ActorCritic, optimizer: torch.optim.Optimizer, rollout: Rollout):
    batch_size = rollout.obs.shape[0]
    b_inds = np.arange(batch_size)

    policy_losses = []
    value_losses = []
    entropies = []
    clip_fracs = []
    approx_kls = []

    for _ in range(UPDATE_EPOCHS):
        np.random.shuffle(b_inds)
        for start in range(0, batch_size, MINIBATCH_SIZE):
            end = start + MINIBATCH_SIZE
            mb_inds = b_inds[start:end]

            obs_b = rollout.obs[mb_inds]
            actions_b = rollout.actions[mb_inds]
            old_logp_b = rollout.log_probs[mb_inds]
            adv_b = rollout.advantages[mb_inds]
            ret_b = rollout.returns[mb_inds]

            new_logp, entropy, value = agent.evaluate_actions(obs_b, actions_b)

            log_ratio = new_logp - old_logp_b
            ratio = log_ratio.exp()

            # PPO clipped surrogate
            unclipped = ratio * adv_b
            clipped = ratio.clamp(1.0 - CLIP_EPS, 1.0 + CLIP_EPS) * adv_b
            policy_loss = -torch.mean(torch.minimum(unclipped, clipped))

            value_loss = F.mse_loss(value, ret_b)
            entropy_mean = torch.mean(entropy)

            loss = policy_loss + VF_COEF * value_loss - ENT_COEF * entropy_mean

            optimizer.zero_grad(set_to_none=True)
            loss.backward()
            nn.utils.clip_grad_norm_(agent.parameters(), MAX_GRAD_NORM)
            optimizer.step()

            # Diagnostics
            with torch.no_grad():
                approx_kl = torch.mean(old_logp_b - new_logp).item()
                clip_frac = torch.mean((torch.abs(ratio - 1.0) > CLIP_EPS).float()).item()

            policy_losses.append(policy_loss.item())
            value_losses.append(value_loss.item())
            entropies.append(entropy_mean.item())
            clip_fracs.append(clip_frac)
            approx_kls.append(approx_kl)

    return {
        "policy_loss": float(np.mean(policy_losses)),
        "value_loss": float(np.mean(value_losses)),
        "entropy": float(np.mean(entropies)),
        "clip_frac": float(np.mean(clip_fracs)),
        "approx_kl": float(np.mean(approx_kls)),
    }


## 6) Train PPO1 on CartPole

We train for `TOTAL_TIMESTEPS` and record:

- reward per episode
- PPO diagnostics (losses, clip fraction, KL)
- a final batch of ratios/advantages for plotting


In [None]:
episode_rewards = []
logs = []

last_ratio_snapshot = None
last_adv_snapshot = None
last_clip_active_snapshot = None

obs, _ = env.reset(seed=SEED)

for update in range(1, N_UPDATES + 1):
    rollout, ep_returns, obs = collect_rollout(env, agent, ROLLOUT_STEPS, obs)
    episode_rewards.extend(ep_returns)

    metrics = ppo_update(agent, optimizer, rollout)

    # Capture ratio/adv snapshots (after the update) for visualization
    with torch.no_grad():
        new_logp, _, _ = agent.evaluate_actions(rollout.obs, rollout.actions)
        ratio = (new_logp - rollout.log_probs).exp()
        adv = rollout.advantages
        clip_active = ((adv >= 0) & (ratio > 1.0 + CLIP_EPS)) | (
            (adv < 0) & (ratio < 1.0 - CLIP_EPS)
        )

        last_ratio_snapshot = ratio.detach().cpu().numpy()
        last_adv_snapshot = adv.detach().cpu().numpy()
        last_clip_active_snapshot = clip_active.detach().cpu().numpy().astype(bool)

    logs.append({"update": update, "timesteps": update * ROLLOUT_STEPS, **metrics, "episodes": len(episode_rewards)})

    if update % LOG_EVERY_UPDATES == 0:
        recent = episode_rewards[-10:]
        recent_mean = float(np.mean(recent)) if recent else float("nan")
        print(
            f"update {update:>3}/{N_UPDATES} | "
            f"episodes={len(episode_rewards):>4} | "
            f"recent_reward_mean(10)={recent_mean:>7.2f} | "
            f"clip_frac={metrics['clip_frac']:.3f} | "
            f"approx_kl={metrics['approx_kl']:.4f}"
        )


In [None]:
df_logs = pd.DataFrame(logs)
df_logs.head()


## 7) Plotly: reward per episode (learning curve)

This is the most direct signal for whether the policy is improving.


In [None]:
df_ep = pd.DataFrame({"episode": np.arange(len(episode_rewards)), "reward": episode_rewards})

window = 20
if len(df_ep) >= window:
    df_ep["reward_ma"] = df_ep["reward"].rolling(window).mean()

fig = px.line(df_ep, x="episode", y="reward", title="CartPole reward per episode")
if "reward_ma" in df_ep.columns:
    fig.add_trace(
        go.Scatter(x=df_ep["episode"], y=df_ep["reward_ma"], name=f"MA({window})")
    )
fig.update_layout(xaxis_title="Episode", yaxis_title="Total reward")
fig

## 8) Plotly: PPO diagnostics over updates

We visualize clipping behavior and losses over training updates.


In [None]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_logs["update"], y=df_logs["clip_frac"], name="clip_frac"))
fig.add_trace(go.Scatter(x=df_logs["update"], y=df_logs["approx_kl"], name="approx_kl"))
fig.update_layout(title="PPO diagnostics", xaxis_title="Update", yaxis_title="Value")
fig

In [None]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_logs["update"], y=df_logs["policy_loss"], name="policy_loss"))
fig.add_trace(go.Scatter(x=df_logs["update"], y=df_logs["value_loss"], name="value_loss"))
fig.add_trace(go.Scatter(x=df_logs["update"], y=df_logs["entropy"], name="entropy"))
fig.update_layout(title="Losses over updates", xaxis_title="Update", yaxis_title="Loss / entropy")
fig

## 9) Plotly: policy ratios \(r_t\) and clipping

Below we plot the distribution of \(r_t\) and highlight where clipping is active.

- The histogram should concentrate near 1.0.
- As training progresses, some mass moves outside \([1-\epsilon, 1+\epsilon]\), but PPO discourages large deviations.


In [None]:
ratios = last_ratio_snapshot

fig = go.Figure()
fig.add_trace(go.Histogram(x=ratios, nbinsx=60, name="r_t"))
fig.add_vline(x=1.0 - CLIP_EPS, line_dash="dash", line_color="orange")
fig.add_vline(x=1.0, line_dash="dash", line_color="gray")
fig.add_vline(x=1.0 + CLIP_EPS, line_dash="dash", line_color="orange")
fig.update_layout(
    title="Policy ratio distribution (last rollout)",
    xaxis_title="r_t = pi_new(a|s) / pi_old(a|s)",
    yaxis_title="Count",
)
fig

In [None]:
df_ratio = pd.DataFrame(
    {
        "ratio": last_ratio_snapshot,
        "advantage": last_adv_snapshot,
        "clip_active": last_clip_active_snapshot,
    }
)

fig = px.scatter(
    df_ratio,
    x="ratio",
    y="advantage",
    color="clip_active",
    title="Where clipping is active (last rollout)",
    labels={"ratio": "r_t", "advantage": "A_t"},
)
fig.add_vline(x=1.0 - CLIP_EPS, line_dash="dash", line_color="orange")
fig.add_vline(x=1.0, line_dash="dash", line_color="gray")
fig.add_vline(x=1.0 + CLIP_EPS, line_dash="dash", line_color="orange")
fig

## 10) Stable-Baselines PPO1 (web research)

A Stable-Baselines implementation of **PPO1** exists (legacy TensorFlow 1.x codebase):

- Repo: https://github.com/hill-a/stable-baselines
- PPO1 package: https://github.com/hill-a/stable-baselines/tree/master/stable_baselines/ppo1
- Main file: https://github.com/hill-a/stable-baselines/blob/master/stable_baselines/ppo1/pposgd_simple.py
  - Exposes `class PPO1(...)` (imported by `stable_baselines/ppo1/__init__.py`)

The original OpenAI Baselines PPO implementation is also available:

- https://github.com/openai/baselines/tree/master/baselines/ppo1

Example usage (not run here):

```python
from stable_baselines import PPO1
import gym

env = gym.make("CartPole-v1")
model = PPO1("MlpPolicy", env, clip_param=0.2, timesteps_per_actorbatch=2048)
model.learn(total_timesteps=1_000_000)
```

Note: Stable-Baselines is archived/legacy and uses TF1/MPI; Stable-Baselines3 is PyTorch and offers `PPO` (conceptually closer to PPO2-style implementations).


## 11) Stable-Baselines `PPO1` hyperparameters (explained)

Stable-Baselines `PPO1` (legacy TensorFlow/MPI) exposes the following constructor signature (from `stable_baselines/ppo1/pposgd_simple.py`):

```python
PPO1(
    policy,
    env,
    gamma=0.99,
    timesteps_per_actorbatch=256,
    clip_param=0.2,
    entcoeff=0.01,
    optim_epochs=4,
    optim_stepsize=1e-3,
    optim_batchsize=64,
    lam=0.95,
    adam_epsilon=1e-5,
    schedule='linear',
    verbose=0,
    tensorboard_log=None,
    _init_setup_model=True,
    policy_kwargs=None,
    full_tensorboard_log=False,
    seed=None,
    n_cpu_tf_sess=1,
)
```

### What each hyperparameter does

- `policy`: policy class (or registered string) like `MlpPolicy`, `CnnPolicy`, etc.
- `env`: Gym env instance or an env id string (e.g. `'CartPole-v1'`).
- `gamma`: discount factor $\gamma$.
- `timesteps_per_actorbatch`: number of environment steps collected per update **per actor** (batch size).
- `clip_param`: PPO clip parameter $\epsilon$.
- `entcoeff`: entropy coefficient (larger → more exploration pressure).
- `optim_epochs`: number of epochs over the on-policy batch per update.
- `optim_stepsize`: optimizer step size (learning rate), optionally controlled by `schedule`.
- `optim_batchsize`: minibatch size.
- `lam`: GAE($\lambda$) parameter.
- `adam_epsilon`: Adam epsilon for numerical stability.
- `schedule`: learning-rate schedule type (e.g. `'linear'`, `'constant'`, ...).

### Mapping to this notebook

- SB `timesteps_per_actorbatch` → this notebook’s `ROLLOUT_STEPS`
- SB `clip_param` → `CLIP_EPS`
- SB `entcoeff` → `ENT_COEF`
- SB `optim_epochs` → `UPDATE_EPOCHS`
- SB `optim_stepsize` → `LEARNING_RATE`
- SB `optim_batchsize` → `MINIBATCH_SIZE`
- SB `gamma` → `GAMMA`
- SB `lam` → `GAE_LAMBDA`
- SB `adam_epsilon` → `ADAM_EPS`
- SB `schedule` → not implemented here (easy extension: linearly decay `LEARNING_RATE` over updates)

### Practical tuning hints

- If **reward collapses**: reduce `LEARNING_RATE`, reduce `UPDATE_EPOCHS`, or reduce `CLIP_EPS`.
- If **learning is slow**: increase `ROLLOUT_STEPS`, increase `UPDATE_EPOCHS`, or slightly increase `LEARNING_RATE`.
- Watch **`approx_kl`** and **`clip_frac`**: sustained high values mean policy updates are too aggressive.



## Pitfalls + exercises

- If training is unstable: lower `LEARNING_RATE`, check advantage normalization, and verify the done/bootstrapping logic.
- If `clip_frac` is near 0.0: updates may be too small (try higher LR or more epochs).
- If `clip_frac` is very high: updates are too aggressive (try smaller LR or smaller `CLIP_EPS`).

---

## Exercises

1. Add an **entropy bonus** (`ENT_COEF > 0`) and compare learning curves.
2. Implement **value function clipping** (as in some PPO variants) and compare critic stability.
3. Switch to a continuous-action env (e.g., Pendulum) using a Gaussian policy.

---

## References

- Schulman et al., *Proximal Policy Optimization Algorithms* (2017): https://arxiv.org/abs/1707.06347
- Stable-Baselines PPO1 source (TF1): https://github.com/hill-a/stable-baselines/blob/master/stable_baselines/ppo1/pposgd_simple.py
