
# A2C (Advantage Actor-Critic) with Stable Baselines3

**Goal:** Train an A2C agent on `CartPole-v1`.  
**Why A2C?** Simple, synchronous on-policy actor-critic; computes advantages and updates actor (policy) and critic (value) jointly.



### Why this step?  
We install the required packages so the notebook can run anywhere (local, Colab).  
- **gymnasium**: modern Gym API for RL environments  
- **stable-baselines3**: popular, well-maintained RL algorithms (DQN, PPO, A2C)  
- **tensorboard** (optional): view training curves  
- **matplotlib**: simple plotting

### How it works  
`pip` installs packages into the active environment. In Colab, this writes to the session; locally, it writes to your venv.


In [None]:

# If running locally and already installed, you can skip this.
# In Colab, keep this cell.
# Note: remove the `-q` if you want to see full logs.
!pip install -q gymnasium[classic-control]==0.29.1 stable-baselines3==2.3.2 tensorboard matplotlib numpy



### Why this step?
Import A2C and build vectorized environments to gather small rollouts frequently.

### How it works
A2C uses short trajectories (controlled by `n_steps`) to compute returns/advantages.


In [None]:

import gymnasium as gym
from stable_baselines3 import A2C
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.evaluation import evaluate_policy

vec_env = make_vec_env("CartPole-v1", n_envs=4, seed=42)
vec_env



### Why this step?
Instantiate A2C and configure core hyperparameters.

### How it works
- `n_steps`: rollout length per env before each update  
- `gamma`: discount factor  
- `gae_lambda`: use GAE for variance reduction (SB3 enables it in A2C by default).


In [None]:

model = A2C(
    policy="MlpPolicy",
    env=vec_env,
    n_steps=5,
    gamma=0.99,
    gae_lambda=1.0,
    ent_coef=0.0,
    vf_coef=0.5,
    max_grad_norm=0.5,
    learning_rate=7e-4,
    verbose=1,
    seed=42,
)
model



### Why this step?
Train A2C for a fixed number of timesteps.

### How it works
A2C gathers short on-policy rollouts, computes advantage estimates, and updates actor & critic synchronously.


In [None]:

model.learn(total_timesteps=150_000, log_interval=10)



### Why this step?
Evaluate the learned policy.

### How it works
Deterministic evaluation on a single env to get mean ± std reward.


In [None]:

eval_env = gym.make("CartPole-v1")
mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=10, deterministic=True)
print(f"A2C — Mean reward: {mean_reward:.2f} ± {std_reward:.2f}")



### Why this step?
(Optionally) Save/reload the model and run a quick rollout.


In [None]:
model.save("a2c_cartpole")
del model
model = A2C.load("a2c_cartpole", env=vec_env)

obs = vec_env.reset()   # no [0], since vec_env.reset() returns obs directly
steps = 0
while steps < 1000:
    action, _ = model.predict(obs, deterministic=True)
    obs, rewards, dones, infos = vec_env.step(action)   # only 4 values
    steps += 1
    if dones.any():   # reset if any env is done
        obs = vec_env.reset()
print("Completed a short deterministic rollout.")
