# Proximal Policy Optimization 2 (PPO2) — from scratch in PyTorch

This notebook builds a **low-level** PPO2 implementation in **PyTorch** and uses it to train an agent on a classic control environment.

---

## Learning goals

By the end you should be able to:

- derive the PPO2 clipped objective and connect it to a trust-region intuition
- implement PPO2 (rollout → GAE → multi-epoch mini-batch updates) in **raw PyTorch**
- understand *exactly* how PPO2 differs from PPO1 (both in the paper and in Stable-Baselines naming)
- plot **episodic rewards** and training diagnostics with Plotly

---

## Prerequisites

- comfortable with gradients and backprop
- basic RL notation: policy $\pi_\theta(a\mid s)$, returns, value function $V_\phi(s)$
- packages: `torch`, `gymnasium` (or `gym`), `numpy`, `plotly`


In [None]:
import math
import time
from dataclasses import dataclass
from typing import Dict, List, Optional, Tuple

import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.distributions import Categorical, Independent, Normal

# Gymnasium first (new API), fallback to Gym (old API)
try:
    import gymnasium as gym
except Exception:  # pragma: no cover
    import gym  # type: ignore

pio.templates.default = 'plotly_white'
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)

SEED = 42
rng = np.random.default_rng(SEED)

torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device


## 1) The RL objective (notation)

We’ll use the standard episodic discounted-return objective:

$$
J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\Big[\sum_{t=0}^{T-1} \gamma^t r_t\Big]
$$

- $\tau = (s_0, a_0, r_0, s_1, \dots)$ is a trajectory sampled by following the policy.
- $\gamma \in (0, 1]$ is the discount factor.

Two key helper objects:

- **Value function**: $V_\phi(s) \approx \mathbb{E}[\sum_{k\ge 0} \gamma^k r_{t+k} \mid s_t=s]$
- **Advantage**: $A_t = Q(s_t, a_t) - V(s_t)$ — “how much better was this action than average?”


## 2) Policy gradients in one equation

The policy-gradient theorem motivates the surrogate objective:

$$
\nabla_\theta J(\theta)
= \mathbb{E}_t\big[\nabla_\theta \log \pi_\theta(a_t \mid s_t)\, A_t\big]
$$

In practice we:

1) sample data with an **old** policy $\pi_{\theta_{\text{old}}}$

2) estimate advantages $\hat{A}_t$ (often via **GAE**)

3) update the policy using mini-batch SGD.


## 3) Why PPO exists: “big steps” break policy gradients

A vanilla policy-gradient update can change the policy too much.

PPO controls this by comparing the new policy to the old policy using the **probability ratio**:

$$
r_t(\theta) = \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_{\text{old}}}(a_t \mid s_t)}
$$

If $r_t(\theta)=1$ the new policy agrees with the old policy on that sampled action.

The classic importance-sampled surrogate (CPI) is:

$$
L^{\text{CPI}}(\theta) = \mathbb{E}_t\big[r_t(\theta)\,\hat{A}_t\big]
$$

The problem: maximizing this can push $r_t$ to extreme values — effectively taking a **too-large** policy update.


## 4) PPO1 vs PPO2 (be precise about naming)

People use “PPO1” vs “PPO2” in **two different ways**:

### A) In the PPO paper (algorithmic variants)

- **PPO-Penalty**: adds a KL penalty $\beta\,\mathrm{KL}(\pi_{\text{old}}\,\|\,\pi_\theta)$ and adapts $\beta$.
- **PPO-Clip**: uses a clipped surrogate objective (no explicit KL penalty term).

A common PPO-Penalty surrogate is:

$$
L^{\text{KLPEN}}(\theta) = \mathbb{E}_t\Big[r_t(\theta)\,\hat{A}_t - \beta\,\mathrm{KL}\big(\pi_{\theta_{\text{old}}}(\cdot\mid s_t)\,\|\,\pi_{\theta}(\cdot\mid s_t)\big)\Big]
$$

with $\beta$ tuned (often adaptively) to keep the KL near a target. PPO-Clip instead bakes the “keep it close” constraint into the objective via clipping.

Many blogs call these “PPO1” (penalty) and “PPO2” (clip). When this notebook says **PPO2**, it means **PPO-Clip**.

### B) In OpenAI Baselines / Stable-Baselines (implementation families)

Stable-Baselines historically exposes **two codebases**:

- `PPO1`: an older MPI-oriented implementation (requires `mpi4py`), with different batching and optimizer plumbing.
- `PPO2`: a newer implementation that supports vectorized envs and (optionally) **value-function clipping** (`cliprange_vf`).

**Important nuance**: Stable-Baselines `PPO1` also uses the clipped surrogate; the “1 vs 2” there is mostly *engineering*, not the core objective.

Concretely, in Stable-Baselines:

- `PPO1` is documented as an “MPI version”, with hyperparameters like `timesteps_per_actorbatch`, `optim_stepsize`, `optim_batchsize`, and a learning-rate `schedule`.
- `PPO2` is documented as a “GPU version”, with hyperparameters like `n_steps` (per env), `nminibatches`, `noptepochs`, and the extra `cliprange_vf` option for value clipping.

If you’re comparing results across implementations, these differences (batch construction + optimizer details + value clipping) can matter even when the high-level PPO objective looks similar.


## 5) PPO2 clipped objective (the main idea)

PPO2 replaces the CPI surrogate with the **clipped surrogate**:

$$
L^{\text{CLIP}}(\theta)
= \mathbb{E}_t\Big[\min\big(r_t(\theta)\,\hat{A}_t,\; \mathrm{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\,\hat{A}_t\big)\Big]
$$

Interpretation:

- If $\hat{A}_t > 0$ (action better than baseline), we don’t want $r_t$ to grow far above $1+\epsilon$.
- If $\hat{A}_t < 0$ (action worse than baseline), we don’t want $r_t$ to shrink far below $1-\epsilon$.

So PPO2 constrains the *effective* improvement you can get from any single sample.

### Full loss (actor + critic + entropy)

In practice we minimize the negative surrogate plus a value loss and an entropy bonus:

$$
\mathcal{L}(\theta,\phi) =
-L^{\text{CLIP}}(\theta)
+ c_v\,\mathbb{E}_t[(V_\phi(s_t) - \hat{R}_t)^2]
- c_e\,\mathbb{E}_t[\mathcal{H}(\pi_\theta(\cdot\mid s_t))]
$$

where $\hat{R}_t$ are “return targets” (often $\hat{A}_t + V(s_t)$).

### Value function clipping (SB/OpenAI variant)

Stable-Baselines `PPO2` optionally clips value updates (not in the original PPO paper):

$$
V^{\text{clip}}(s_t) = V_{\text{old}}(s_t) + \mathrm{clip}(V(s_t)-V_{\text{old}}(s_t), -\epsilon_v, \epsilon_v)
$$

and uses the max of the unclipped/clipped squared error.


In [None]:
# Visual intuition: how clipping changes the surrogate
eps = 0.2
ratios = np.linspace(0.0, 2.0, 600)

A_pos = 1.0
A_neg = -1.0

def clipped_surrogate(r, A, eps):
    r_clipped = np.clip(r, 1.0 - eps, 1.0 + eps)
    return np.minimum(r * A, r_clipped * A)

fig = make_subplots(
    rows=1,
    cols=2,
    subplot_titles=(
        'Surrogate term when $A_t > 0$',
        'Surrogate term when $A_t < 0$',
    ),
)

for col, A in [(1, A_pos), (2, A_neg)]:
    fig.add_trace(
        go.Scatter(x=ratios, y=ratios * A, name='CPI: $rA$', line=dict(width=2)),
        row=1,
        col=col,
    )
    fig.add_trace(
        go.Scatter(
            x=ratios,
            y=clipped_surrogate(ratios, A, eps),
            name='PPO2: $\min(rA, \mathrm{clip}(r)A)$',
            line=dict(width=3),
        ),
        row=1,
        col=col,
    )
    fig.add_vline(x=1.0 - eps, line=dict(color='gray', dash='dot'), row=1, col=col)
    fig.add_vline(x=1.0 + eps, line=dict(color='gray', dash='dot'), row=1, col=col)

fig.update_layout(
    title='PPO2 clipping limits how much any sample can improve the objective',
    xaxis_title='$r_t(\theta)$',
    height=380,
    legend=dict(orientation='h', yanchor='bottom', y=-0.25, xanchor='left', x=0.0),
)
fig.update_xaxes(range=[0.0, 2.0])
fig.show()


## 6) Advantage estimation: GAE($\lambda$)

A practical choice is **Generalized Advantage Estimation**:

$$
\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)
$$

$$
\hat{A}_t^{\text{GAE}(\lambda)} = \sum_{l=0}^{\infty} (\gamma\lambda)^l\,\delta_{t+l}
$$

- $\lambda \to 0$ → low variance, higher bias (more like TD)
- $\lambda \to 1$ → lower bias, higher variance (more like Monte Carlo)

We’ll also use $\hat{R}_t = \hat{A}_t + V(s_t)$ as the target return for the critic.


## 7) Implementation roadmap (what we’ll code)

PPO2 training loop per update:

1) Collect a rollout of length $T$ (here: `n_steps`) with the current policy.
2) Compute values $V(s_t)$, log-probs $\log\pi(a_t\mid s_t)$, and rewards.
3) Compute GAE advantages $\hat{A}_t$ and returns $\hat{R}_t$.
4) For `n_epochs` epochs:
   - shuffle the rollout into mini-batches
   - optimize the clipped policy objective + value loss + entropy bonus.

We’ll log:

- episodic returns (what you care about)
- policy loss, value loss, entropy
- approximate KL and clip fraction (sanity checks)


In [None]:
def env_reset(env, *, seed: Optional[int] = None):
    out = env.reset(seed=seed) if seed is not None else env.reset()
    if isinstance(out, tuple) and len(out) == 2:
        obs, _info = out
        return obs
    return out


def env_step(env, action):
    out = env.step(action)
    # Gymnasium: (obs, reward, terminated, truncated, info)
    if isinstance(out, tuple) and len(out) == 5:
        obs, reward, terminated, truncated, info = out
        done = bool(terminated) or bool(truncated)
        return obs, float(reward), done, info
    # Gym: (obs, reward, done, info)
    obs, reward, done, info = out
    return obs, float(reward), bool(done), info


def set_seed_everywhere(seed: int):
    np.random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)


def explained_variance(y_pred: np.ndarray, y_true: np.ndarray) -> float:
    """1 - Var[y_true - y_pred] / Var[y_true]."""
    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)
    var_y = np.var(y_true)
    if var_y < 1e-12:
        return float('nan')
    return float(1.0 - np.var(y_true - y_pred) / var_y)


In [None]:
class ActorCritic(nn.Module):
    def __init__(self, obs_dim: int, action_space, hidden_sizes=(64, 64)):
        super().__init__()
        self.obs_dim = int(obs_dim)
        self.action_space = action_space

        layers: List[nn.Module] = []
        in_dim = self.obs_dim
        for h in hidden_sizes:
            layers.append(nn.Linear(in_dim, h))
            layers.append(nn.Tanh())
            in_dim = h
        self.backbone = nn.Sequential(*layers)

        # Discrete actions: categorical over logits
        if isinstance(action_space, gym.spaces.Discrete):
            self.is_discrete = True
            self.n_actions = int(action_space.n)
            self.actor = nn.Linear(in_dim, self.n_actions)
            self.log_std = None
        # Continuous actions: diagonal Gaussian
        elif isinstance(action_space, gym.spaces.Box):
            self.is_discrete = False
            self.action_dim = int(np.prod(action_space.shape))
            self.actor_mean = nn.Linear(in_dim, self.action_dim)
            self.log_std = nn.Parameter(torch.zeros(self.action_dim))
        else:
            raise TypeError(f'Unsupported action space: {type(action_space)}')

        self.critic = nn.Linear(in_dim, 1)

    def _dist(self, obs: torch.Tensor):
        h = self.backbone(obs)
        if self.is_discrete:
            logits = self.actor(h)
            return Categorical(logits=logits)
        mean = self.actor_mean(h)
        std = torch.exp(self.log_std).expand_as(mean)
        return Independent(Normal(mean, std), 1)

    def value(self, obs: torch.Tensor) -> torch.Tensor:
        h = self.backbone(obs)
        return self.critic(h).squeeze(-1)

    def act(self, obs: torch.Tensor, action: Optional[torch.Tensor] = None):
        dist = self._dist(obs)
        if action is None:
            action = dist.sample()
        log_prob = dist.log_prob(action)
        entropy = dist.entropy()
        value = self.value(obs)
        return action, log_prob, entropy, value


In [None]:
@dataclass
class Rollout:
    obs: np.ndarray
    actions: np.ndarray
    log_probs: np.ndarray
    values: np.ndarray
    rewards: np.ndarray
    dones: np.ndarray


def make_rollout_storage(n_steps: int, obs_dim: int, action_space) -> Rollout:
    obs = np.zeros((n_steps, obs_dim), dtype=np.float32)
    rewards = np.zeros((n_steps,), dtype=np.float32)
    dones = np.zeros((n_steps,), dtype=np.float32)
    values = np.zeros((n_steps,), dtype=np.float32)
    log_probs = np.zeros((n_steps,), dtype=np.float32)

    if isinstance(action_space, gym.spaces.Discrete):
        actions = np.zeros((n_steps,), dtype=np.int64)
    elif isinstance(action_space, gym.spaces.Box):
        act_dim = int(np.prod(action_space.shape))
        actions = np.zeros((n_steps, act_dim), dtype=np.float32)
    else:
        raise TypeError(f'Unsupported action space: {type(action_space)}')

    return Rollout(obs=obs, actions=actions, log_probs=log_probs, values=values, rewards=rewards, dones=dones)


In [None]:
def compute_gae(
    rewards: np.ndarray,
    dones: np.ndarray,
    values: np.ndarray,
    next_value: float,
    *,
    gamma: float,
    gae_lambda: float,
) -> Tuple[np.ndarray, np.ndarray]:
    """Returns (advantages, returns)."""
    n_steps = len(rewards)
    advantages = np.zeros((n_steps,), dtype=np.float32)
    last_gae = 0.0
    for t in reversed(range(n_steps)):
        next_nonterminal = 1.0 - dones[t]
        next_v = next_value if t == n_steps - 1 else values[t + 1]
        delta = rewards[t] + gamma * next_v * next_nonterminal - values[t]
        last_gae = delta + gamma * gae_lambda * next_nonterminal * last_gae
        advantages[t] = last_gae
    returns = advantages + values
    return advantages, returns


## 8) PPO2 update step (PyTorch)

The heart of PPO2 is computing:

- the ratio $r_t(\theta)$ using old and new log-probs
- the clipped surrogate
- the value loss (optionally clipped)
- the entropy bonus

and then doing standard backprop + optimizer step.


In [None]:
def ppo2_update(
    model: ActorCritic,
    optimizer: torch.optim.Optimizer,
    *,
    obs: torch.Tensor,
    actions: torch.Tensor,
    old_log_probs: torch.Tensor,
    old_values: torch.Tensor,
    advantages: torch.Tensor,
    returns: torch.Tensor,
    clip_coef: float,
    vf_clip_coef: Optional[float],
    ent_coef: float,
    vf_coef: float,
    max_grad_norm: float,
) -> Dict[str, float]:
    action, log_prob, entropy, value = model.act(obs, action=actions)

    log_ratio = log_prob - old_log_probs
    ratio = torch.exp(log_ratio)

    # Policy loss (clipped)
    unclipped = ratio * advantages
    clipped = torch.clamp(ratio, 1.0 - clip_coef, 1.0 + clip_coef) * advantages
    policy_loss = -torch.mean(torch.min(unclipped, clipped))

    # Value loss (optionally clipped, SB/OpenAI variant)
    if vf_clip_coef is None:
        value_loss = 0.5 * F.mse_loss(value, returns)
    elif vf_clip_coef < 0:
        # match original PPO paper: no value clipping
        value_loss = 0.5 * F.mse_loss(value, returns)
    else:
        v_clipped = old_values + torch.clamp(value - old_values, -vf_clip_coef, vf_clip_coef)
        v_loss1 = (value - returns).pow(2)
        v_loss2 = (v_clipped - returns).pow(2)
        value_loss = 0.5 * torch.mean(torch.max(v_loss1, v_loss2))

    entropy_loss = -torch.mean(entropy)

    loss = policy_loss + vf_coef * value_loss + ent_coef * entropy_loss

    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)
    optimizer.step()

    approx_kl = torch.mean(-log_ratio).item()
    clipfrac = torch.mean((torch.abs(ratio - 1.0) > clip_coef).float()).item()

    return {
        'loss': float(loss.item()),
        'policy_loss': float(policy_loss.item()),
        'value_loss': float(value_loss.item()),
        'entropy': float(torch.mean(entropy).item()),
        'approx_kl': float(approx_kl),
        'clipfrac': float(clipfrac),
    }


## 9) Train PPO2 on CartPole-v1

We’ll keep this as close as possible to the textbook PPO2 recipe:

- rollout length: `n_steps`
- multi-epoch mini-batch SGD updates
- GAE($\lambda$) advantages (normalized)
- plot episodic rewards

Tip: CartPole is fast. If you try harder environments, prefer **vectorized** envs (parallel rollouts) for more stable gradient estimates.


In [None]:
def train_ppo2(
    *,
    env_id: str = 'CartPole-v1',
    total_timesteps: int = 150_000,
    n_steps: int = 2048,
    n_epochs: int = 10,
    minibatch_size: int = 64,
    gamma: float = 0.99,
    gae_lambda: float = 0.95,
    learning_rate: float = 3e-4,
    clip_coef: float = 0.2,
    vf_clip_coef: Optional[float] = None,
    ent_coef: float = 0.0,
    vf_coef: float = 0.5,
    max_grad_norm: float = 0.5,
    target_kl: Optional[float] = 0.03,
    seed: int = 42,
) -> Dict[str, List[float]]:
    set_seed_everywhere(seed)

    env = gym.make(env_id)
    obs0 = env_reset(env, seed=seed)
    obs_dim = int(np.prod(env.observation_space.shape))

    model = ActorCritic(obs_dim=obs_dim, action_space=env.action_space).to(device)
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate, eps=1e-5)

    logs: Dict[str, List[float]] = {
        'timesteps': [],
        'episode_returns': [],
        'policy_loss': [],
        'value_loss': [],
        'entropy': [],
        'approx_kl': [],
        'clipfrac': [],
        'explained_variance': [],
    }

    obs = obs0
    ep_return = 0.0

    num_updates = math.ceil(total_timesteps / n_steps)
    global_step = 0

    for update in range(num_updates):
        # Linear schedules (common PPO2 choice)
        frac = 1.0 - (update / num_updates)
        lr_now = learning_rate * frac
        clip_now = clip_coef * frac
        for pg in optimizer.param_groups:
            pg['lr'] = lr_now

        rollout = make_rollout_storage(n_steps=n_steps, obs_dim=obs_dim, action_space=env.action_space)

        # Collect on-policy data
        for t in range(n_steps):
            rollout.obs[t] = np.asarray(obs, dtype=np.float32).reshape(-1)

            obs_t = torch.tensor(rollout.obs[t], dtype=torch.float32, device=device).unsqueeze(0)
            with torch.no_grad():
                action_t, logp_t, _ent_t, value_t = model.act(obs_t)

            if model.is_discrete:
                action = int(action_t.item())
            else:
                action = action_t.squeeze(0).cpu().numpy().astype(np.float32)

            next_obs, reward, done, _info = env_step(env, action)

            rollout.actions[t] = action
            rollout.log_probs[t] = float(logp_t.item())
            rollout.values[t] = float(value_t.item())
            rollout.rewards[t] = float(reward)
            rollout.dones[t] = float(done)

            ep_return += reward
            global_step += 1

            obs = next_obs
            if done:
                logs['episode_returns'].append(float(ep_return))
                ep_return = 0.0
                obs = env_reset(env)

        # Bootstrap value for the last observation
        obs_last = torch.tensor(np.asarray(obs, dtype=np.float32).reshape(-1), device=device).unsqueeze(0)
        with torch.no_grad():
            next_value = float(model.value(obs_last).item())

        adv_np, ret_np = compute_gae(
            rewards=rollout.rewards,
            dones=rollout.dones,
            values=rollout.values,
            next_value=next_value,
            gamma=gamma,
            gae_lambda=gae_lambda,
        )

        # Flatten batch tensors
        b_obs = torch.tensor(rollout.obs, dtype=torch.float32, device=device)
        if model.is_discrete:
            b_actions = torch.tensor(rollout.actions, dtype=torch.int64, device=device)
        else:
            b_actions = torch.tensor(rollout.actions, dtype=torch.float32, device=device)
        b_old_logp = torch.tensor(rollout.log_probs, dtype=torch.float32, device=device)
        b_old_values = torch.tensor(rollout.values, dtype=torch.float32, device=device)
        b_adv = torch.tensor(adv_np, dtype=torch.float32, device=device)
        b_returns = torch.tensor(ret_np, dtype=torch.float32, device=device)

        # Advantage normalization is standard PPO2 practice
        b_adv = (b_adv - b_adv.mean()) / (b_adv.std() + 1e-8)

        # PPO update: multiple epochs over the same on-policy batch
        batch_indices = np.arange(n_steps)

        metrics_accum = {
            'policy_loss': [],
            'value_loss': [],
            'entropy': [],
            'approx_kl': [],
            'clipfrac': [],
        }

        for epoch in range(n_epochs):
            rng.shuffle(batch_indices)

            for start in range(0, n_steps, minibatch_size):
                mb_idx = batch_indices[start : start + minibatch_size]

                out = ppo2_update(
                    model,
                    optimizer,
                    obs=b_obs[mb_idx],
                    actions=b_actions[mb_idx],
                    old_log_probs=b_old_logp[mb_idx],
                    old_values=b_old_values[mb_idx],
                    advantages=b_adv[mb_idx],
                    returns=b_returns[mb_idx],
                    clip_coef=float(clip_now),
                    vf_clip_coef=vf_clip_coef if vf_clip_coef is not None else None,
                    ent_coef=float(ent_coef),
                    vf_coef=float(vf_coef),
                    max_grad_norm=float(max_grad_norm),
                )

                for k in metrics_accum:
                    metrics_accum[k].append(out[k])

            # Optional early stopping if KL explodes (common safety valve)
            if target_kl is not None and np.mean(metrics_accum['approx_kl']) > 1.5 * target_kl:
                break

        # Logging at update granularity
        logs['timesteps'].append(float(global_step))
        logs['policy_loss'].append(float(np.mean(metrics_accum['policy_loss'])))
        logs['value_loss'].append(float(np.mean(metrics_accum['value_loss'])))
        logs['entropy'].append(float(np.mean(metrics_accum['entropy'])))
        logs['approx_kl'].append(float(np.mean(metrics_accum['approx_kl'])))
        logs['clipfrac'].append(float(np.mean(metrics_accum['clipfrac'])))
        logs['explained_variance'].append(explained_variance(rollout.values, ret_np))

    env.close()
    return logs


In [None]:
# Run training (adjust total_timesteps if you're on CPU and want it faster)
logs = train_ppo2(
    env_id='CartPole-v1',
    total_timesteps=120_000,
    n_steps=1024,
    n_epochs=10,
    minibatch_size=64,
    learning_rate=3e-4,
    ent_coef=0.0,
    vf_clip_coef=0.2,  # SB/OpenAI-style value clipping (set -1 to disable)
)

len(logs['episode_returns']), logs['episode_returns'][:5]


In [None]:
# Plot episodic rewards (and a rolling mean)
episode_returns = np.asarray(logs['episode_returns'], dtype=np.float32)
episodes = np.arange(1, len(episode_returns) + 1)

window = 25
if len(episode_returns) >= window:
    rolling = np.convolve(episode_returns, np.ones(window) / window, mode='valid')
    rolling_x = np.arange(window, len(episode_returns) + 1)
else:
    rolling = episode_returns
    rolling_x = episodes

fig = go.Figure()
fig.add_trace(go.Scatter(x=episodes, y=episode_returns, mode='lines', name='Episode return'))
fig.add_trace(go.Scatter(x=rolling_x, y=rolling, mode='lines', name=f'Rolling mean ({window})', line=dict(width=4)))
fig.update_layout(
    title='PPO2 on CartPole-v1: episodic reward over training',
    xaxis_title='Episode',
    yaxis_title='Episodic return',
    height=420,
)
fig.show()


In [None]:
# Plot training diagnostics per update
df = {
    'update': np.arange(len(logs['timesteps'])),
    'timesteps': np.asarray(logs['timesteps']),
    'policy_loss': np.asarray(logs['policy_loss']),
    'value_loss': np.asarray(logs['value_loss']),
    'entropy': np.asarray(logs['entropy']),
    'approx_kl': np.asarray(logs['approx_kl']),
    'clipfrac': np.asarray(logs['clipfrac']),
    'explained_variance': np.asarray(logs['explained_variance']),
}

fig = make_subplots(
    rows=2,
    cols=3,
    subplot_titles=(
        'Policy loss',
        'Value loss',
        'Entropy',
        'Approx KL',
        'Clip fraction',
        'Explained variance',
    ),
)

def add_line(row, col, y, name):
    fig.add_trace(go.Scatter(x=df['update'], y=y, mode='lines', name=name), row=row, col=col)

add_line(1, 1, df['policy_loss'], 'policy_loss')
add_line(1, 2, df['value_loss'], 'value_loss')
add_line(1, 3, df['entropy'], 'entropy')
add_line(2, 1, df['approx_kl'], 'approx_kl')
add_line(2, 2, df['clipfrac'], 'clipfrac')
add_line(2, 3, df['explained_variance'], 'explained_variance')

fig.update_layout(title='Training diagnostics (per PPO update)', height=560, showlegend=False)
fig.update_xaxes(title_text='Update')
fig.show()


## 10) Stable-Baselines `PPO2` (reference implementation)

Stable-Baselines (the TensorFlow library, now in maintenance mode) provides a `PPO2` class.

Example from the Stable-Baselines docs (CartPole with a vectorized env):

```python
import gym

from stable_baselines.common.policies import MlpPolicy
from stable_baselines.common import make_vec_env
from stable_baselines import PPO2

env = make_vec_env('CartPole-v1', n_envs=4)
model = PPO2(MlpPolicy, env, verbose=1)
model.learn(total_timesteps=25000)
model.save('ppo2_cartpole')
```

We’ll list and explain the Stable-Baselines `PPO2` hyperparameters in the next section.


## 11) Stable-Baselines `PPO2` hyperparameters (explained)

Stable-Baselines `PPO2` (TensorFlow) exposes the following constructor signature (from `stable_baselines/ppo2/ppo2.py`):

```python
PPO2(
    policy,
    env,
    gamma=0.99,
    n_steps=128,
    ent_coef=0.01,
    learning_rate=2.5e-4,
    vf_coef=0.5,
    max_grad_norm=0.5,
    lam=0.95,
    nminibatches=4,
    noptepochs=4,
    cliprange=0.2,
    cliprange_vf=None,
    verbose=0,
    tensorboard_log=None,
    _init_setup_model=True,
    policy_kwargs=None,
    full_tensorboard_log=False,
    seed=None,
    n_cpu_tf_sess=None,
)
```

### What each hyperparameter does

- `policy`: policy class (or registered string) like `MlpPolicy`, `CnnPolicy`, `MlpLstmPolicy`.
- `env`: Gym env instance or an env id string (e.g. `'CartPole-v1'`).
- `gamma`: discount factor $\gamma$.
- `n_steps`: rollout horizon per env per update. With vectorized envs, the batch size is:

  $$
  n_{\text{batch}} = n_{\text{steps}} \cdot n_{\text{envs}}
  $$

- `ent_coef`: entropy coefficient $c_e$ (larger → more exploration pressure).
- `learning_rate`: learning rate (float) or a schedule function of training progress.
- `vf_coef`: value-loss coefficient $c_v$.
- `max_grad_norm`: global gradient norm clip threshold.
- `lam`: GAE($\lambda$) parameter.
- `nminibatches`: number of minibatches per update (minibatch size is `n_batch / nminibatches`). For recurrent policies, SB recommends `n_envs` be a multiple of `nminibatches`.
- `noptepochs`: number of epochs over the on-policy batch per update.
- `cliprange`: PPO clip parameter $\epsilon$ (float) or a schedule.
- `cliprange_vf`: value-function clipping range.
  - `None` (default): reuse `cliprange` for the value function (OpenAI baselines legacy behavior).
  - negative value (e.g. `-1`): **disable** value clipping (closer to the original PPO paper).
  - positive float/schedule: enable value clipping with that range.

  Note: value clipping depends on reward scaling.

- `verbose`: logging verbosity.
- `tensorboard_log`: TensorBoard log directory (or `None`).
- `_init_setup_model`: whether to build the TF graph at init.
- `policy_kwargs`: extra kwargs forwarded to the policy network constructor.
- `full_tensorboard_log`: log additional tensors/histograms (large disk usage).
- `seed`: random seed (Python/NumPy/TF). For fully deterministic TF runs, SB notes you should set `n_cpu_tf_sess=1`.
- `n_cpu_tf_sess`: number of TensorFlow threads.

### Mapping to this notebook

- SB `n_steps` → this notebook’s `n_steps`
- SB `noptepochs` → this notebook’s `n_epochs`
- SB `nminibatches` → this notebook’s `minibatch_size = n_steps / nminibatches` (single-env case)
- SB `cliprange` → this notebook’s `clip_coef`
- SB `cliprange_vf` → this notebook’s `vf_clip_coef`
