
# Comparison: Q-Learning vs DQN vs PPO vs A2C

**Goal:** Provide a simple scaffold to compare training curves and evaluation scores.**  
You can **train here** or **load saved models** from the other notebooks.

> For apples-to-apples comparisons, keep seeds, environments, and timesteps aligned.



### Why this step?  
We install the required packages so the notebook can run anywhere (local, Colab).  
- **gymnasium**: modern Gym API for RL environments  
- **stable-baselines3**: popular, well-maintained RL algorithms (DQN, PPO, A2C)  
- **tensorboard** (optional): view training curves  
- **matplotlib**: simple plotting

### How it works  
`pip` installs packages into the active environment. In Colab, this writes to the session; locally, it writes to your venv.


In [None]:

# If running locally and already installed, you can skip this.
# In Colab, keep this cell.
# Note: remove the `-q` if you want to see full logs.
!pip install -q gymnasium[classic-control]==0.29.1 stable-baselines3==2.3.2 tensorboard matplotlib numpy



### Why this step?
Import needed libraries and choose a common environment.

### How it works
We'll use `CartPole-v1` for the deep RL algorithms and `FrozenLake-v1` for tabular Q-learning (since CartPole has continuous observations).


In [None]:

import gymnasium as gym
import numpy as np
import matplotlib.pyplot as plt

from stable_baselines3 import DQN, PPO, A2C
from stable_baselines3.common.evaluation import evaluate_policy

# Envs
cartpole_env = gym.make("CartPole-v1")
frozen_env = gym.make("FrozenLake-v1", is_slippery=False)



### Why this step?
(Option A) **Load** pre-trained models (if you've saved them).

### How it works
Uncomment the relevant lines to load models produced in earlier notebooks.


In [None]:

# dqn = DQN.load("dqn_cartpole", env=cartpole_env)
# ppo = PPO.load("ppo_cartpole", env=cartpole_env)
# a2c = A2C.load("a2c_cartpole", env=cartpole_env)



### Why this step?
(Option B) **Train** small models quickly just for demonstration.

### How it works
We keep timesteps low so the cell runs fast; for serious comparison, raise budgets.


In [None]:

dqn = DQN("MlpPolicy", cartpole_env, learning_rate=1e-3, buffer_size=20_000, learning_starts=500, verbose=0, seed=0)
dqn.learn(total_timesteps=20_000)

ppo = PPO("MlpPolicy", cartpole_env, n_steps=128, batch_size=64, verbose=0, seed=0)
ppo.learn(total_timesteps=20_000)

a2c = A2C("MlpPolicy", cartpole_env, n_steps=5, verbose=0, seed=0)
a2c.learn(total_timesteps=20_000)



### Why this step?
Evaluate all agents on the **same** environment with deterministic policies.

### How it works
We use 20 evaluation episodes for a quick but repeatable comparison.


In [None]:

def eval_agent(model, env, n=20):
    mean, std = evaluate_policy(model, env, n_eval_episodes=n, deterministic=True)
    return mean, std

dqn_m, dqn_s = eval_agent(dqn, cartpole_env, n=20)
ppo_m, ppo_s = eval_agent(ppo, cartpole_env, n=20)
a2c_m, a2c_s = eval_agent(a2c, cartpole_env, n=20)

print(f"DQN: {dqn_m:.1f} ± {dqn_s:.1f}")
print(f"PPO: {ppo_m:.1f} ± {ppo_s:.1f}")
print(f"A2C: {a2c_m:.1f} ± {a2c_s:.1f}")



### Why this step?
(For reference) Re-create **tabular Q-learning** quickly and evaluate on FrozenLake.

### How it works
Small helper to train a Q-table and compute mean return.


In [None]:

def train_q_learning(env, episodes=2000, alpha=0.1, gamma=0.99, eps=1.0, eps_min=0.01, eps_decay=0.995):
    Q = np.zeros((env.observation_space.n, env.action_space.n))
    for ep in range(episodes):
        s, _ = env.reset()
        done = False
        while not done:
            a = env.action_space.sample() if np.random.rand() < eps else int(np.argmax(Q[s]))
            ns, r, term, trunc, _ = env.step(a)
            done = term or trunc
            Q[s, a] += alpha * (r + gamma * np.max(Q[ns]) - Q[s, a])
            s = ns
        eps = max(eps_min, eps * eps_decay)
    return Q

Q = train_q_learning(frozen_env, episodes=2000)
policy = np.argmax(Q, axis=1)

def eval_q(policy, env, n=200):
    total = 0.0
    for i in range(n):
        s, _ = env.reset()
        done = False
        while not done:
            a = int(policy[s])
            s, r, term, trunc, _ = env.step(a)
            done = term or trunc
            total += r
    return total / n

q_avg = eval_q(policy, frozen_env, n=500)
print(f"Q-Learning on FrozenLake — avg return over 500 eps: {q_avg:.3f}")
