
# PPO (Proximal Policy Optimization) with Stable Baselines3

**Goal:** Train a PPO agent on `CartPole-v1` with vectorized environments.  
**Why PPO?** On-policy actor-critic with **clipped objective** to limit destructive policy updates — robust and widely used.



### Why this step?  
We install the required packages so the notebook can run anywhere (local, Colab).  
- **gymnasium**: modern Gym API for RL environments  
- **stable-baselines3**: popular, well-maintained RL algorithms (DQN, PPO, A2C)  
- **tensorboard** (optional): view training curves  
- **matplotlib**: simple plotting

### How it works  
`pip` installs packages into the active environment. In Colab, this writes to the session; locally, it writes to your venv.


In [None]:

# If running locally and already installed, you can skip this.
# In Colab, keep this cell.
# Note: remove the `-q` if you want to see full logs.
!pip install -q gymnasium[classic-control]==0.29.1 stable-baselines3==2.3.2 tensorboard matplotlib numpy



### Why this step?
Import PPO and create **vectorized environments** to parallelize rollouts.

### How it works
`make_vec_env` spawns multiple env copies; PPO collects batches from all of them before each update.


In [None]:

import gymnasium as gym
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.evaluation import evaluate_policy

vec_env = make_vec_env("CartPole-v1", n_envs=4, seed=42)
vec_env



### Why this step?
Instantiate the PPO model and set core hyperparameters.

### How it works
- `n_steps`: rollout length per env before an update  
- `batch_size`: minibatch size per gradient step  
- `clip_range`: PPO's policy ratio clipping parameter


In [None]:

model = PPO(
    policy="MlpPolicy",
    env=vec_env,
    n_steps=128,
    batch_size=64,
    n_epochs=10,
    gamma=0.99,
    gae_lambda=0.95,
    clip_range=0.2,
    ent_coef=0.0,
    vf_coef=0.5,
    max_grad_norm=0.5,
    verbose=1,
    seed=42,
)
model



### Why this step?
Train PPO with parallel environments for a fixed budget.

### How it works
PPO alternates between collecting on-policy rollouts and performing several epochs of SGD on those data.


In [None]:

model.learn(total_timesteps=200_000, log_interval=10)



### Why this step?
Evaluate the trained PPO policy.

### How it works
We run `evaluate_policy` with deterministic actions on a non-vectorized env for clarity.


In [None]:

eval_env = gym.make("CartPole-v1")
mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=10, deterministic=True)
print(f"PPO — Mean reward: {mean_reward:.2f} ± {std_reward:.2f}")



### Why this step?
(Optionally) Save and reload, then roll out one episode to verify behavior.


In [None]:
model.save("ppo_cartpole")
del model
model = PPO.load("ppo_cartpole", env=vec_env)

obs = vec_env.reset()
ep_rewards = 0.0
for _ in range(1000):
    action, _ = model.predict(obs, deterministic=True)
    obs, rewards, dones, infos = vec_env.step(action)   # only 4 values here
    ep_rewards += float(rewards.mean())
    if dones.any():   # at least one env finished
        obs = vec_env.reset()
print("Sample vectorized rollout mean-step reward:", ep_rewards)
