
# DQN (Deep Q-Network) with Stable Baselines3

**Goal:** Train a DQN agent on `CartPole-v1`.  
**Why DQN?** Extends Q-Learning to large/continuous state spaces by using a neural network to approximate Q(s,a), plus **experience replay** and a **target network** for stability.



### Why this step?  
We install the required packages so the notebook can run anywhere (local, Colab).  
- **gymnasium**: modern Gym API for RL environments  
- **stable-baselines3**: popular, well-maintained RL algorithms (DQN, PPO, A2C)  
- **tensorboard** (optional): view training curves  
- **matplotlib**: simple plotting

### How it works  
`pip` installs packages into the active environment. In Colab, this writes to the session; locally, it writes to your venv.


In [None]:

# If running locally and already installed, you can skip this.
# In Colab, keep this cell.
# Note: remove the `-q` if you want to see full logs.
!pip install -q gymnasium[classic-control]==0.29.1 stable-baselines3==2.3.2 tensorboard matplotlib numpy



### Why this step?
Import DQN components and create the environment.

### How it works
- `DQN` from SB3 wraps the algorithm.  
- `evaluate_policy` runs several episodes without exploration to measure performance.


In [None]:

import gymnasium as gym
from stable_baselines3 import DQN
from stable_baselines3.common.evaluation import evaluate_policy

env = gym.make("CartPole-v1")
env



### Why this step?
Instantiate the DQN model and set important hyperparameters.

### How it works
- `MlpPolicy`: a small feed-forward network maps states → Q-values for each discrete action.  
- `buffer_size`: replay buffer capacity.  
- `exploration_fraction` & `exploration_final_eps`: ε schedule across training.


In [None]:

model = DQN(
    policy="MlpPolicy",
    env=env,
    learning_rate=1e-3,
    buffer_size=50_000,
    learning_starts=1000,
    batch_size=64,
    tau=1.0,
    gamma=0.99,
    train_freq=4,
    gradient_steps=1,
    target_update_interval=1000,
    exploration_fraction=0.1,
    exploration_final_eps=0.02,
    verbose=1,
    seed=42
)
model



### Why this step?
Train the agent for a fixed number of timesteps.

### How it works
DQN will interact with the env, store transitions in replay, sample mini-batches, and periodically update the target network.


In [None]:

model.learn(total_timesteps=100_000, log_interval=10)



### Why this step?
Evaluate the trained policy to estimate performance.

### How it works
`evaluate_policy` runs deterministic rollouts and returns mean ± std reward.


In [None]:

mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=10, deterministic=True)
print(f"Evaluation — Mean reward: {mean_reward:.2f} ± {std_reward:.2f}")



### Why this step?
Save and reload the model, then run an episode with **deterministic** actions.

### How it works
- `model.save` persists weights.  
- `DQN.load` restores them.  
- `predict(..., deterministic=True)` disables exploration noise.


In [None]:

model.save("dqn_cartpole")
del model
model = DQN.load("dqn_cartpole", env=env)

obs, _ = env.reset(seed=123)
total_reward = 0.0
for t in range(1000):
    action, _ = model.predict(obs, deterministic=True)
    obs, reward, terminated, truncated, info = env.step(action)
    total_reward += reward
    if terminated or truncated:
        break

print(f"Episode return (deterministic): {total_reward}")
