# 🧠 Reinforcement Learning Assignment: Q-Learning vs Deep Q-Network (DQN)

In this assignment, you will experiment with two reinforcement learning methods—**Q-learning** and **Deep Q-Networks (DQN)**—on the `CartPole-v1` environment.

### 🎯 Objectives:
- Understand the differences between tabular Q-learning and neural network-based DQN.
- Adjust hyperparameters for both models and observe how they impact performance.
- Compare the number of steps each model survives per episode.

### 📌 Instructions:
1. Run the provided Q-learning code and tune parameters.
2. Run the DQN code using `Stable-Baselines3` and tune parameters.
3. Record average episode rewards and compare both models.


## 🔢 Q-learning Implementation (Discrete State Approximation)

In [3]:
%pip install gymnasium stable-baselines3

Collecting stable-baselines3
  Downloading stable_baselines3-2.6.0-py3-none-any.whl.metadata (4.8 kB)
Downloading stable_baselines3-2.6.0-py3-none-any.whl (184 kB)
Installing collected packages: stable-baselines3
Successfully installed stable-baselines3-2.6.0
Note: you may need to restart the kernel to use updated packages.


In [5]:
import gymnasium as gym
import numpy as np

env = gym.make('CartPole-v1')

def discretize(obs, bins):
    return tuple(np.digitize(x, bins[i]) for i, x in enumerate(obs))

n_bins = 6
bins = [np.linspace(-4.8, 4.8, n_bins), np.linspace(-5, 5, n_bins),
        np.linspace(-0.418, 0.418, n_bins), np.linspace(-5, 5, n_bins)]
q_table = np.zeros([n_bins] * 4 + [env.action_space.n])

alpha = 0.1
gamma = 0.99
epsilon = 1.0
epsilon_decay = 0.995
epsilon_min = 0.01

episodes = 500
steps_per_episode = []

for ep in range(episodes):
    state, _ = env.reset()
    state = discretize(state, bins)
    done = False
    total_steps = 0

    while not done:
        if np.random.random() < epsilon:
            action = env.action_space.sample()
        else:
            action = np.argmax(q_table[state])

        next_state, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated
        next_discrete = discretize(next_state, bins)
        best_next = np.max(q_table[next_discrete])
        q_table[state][action] += alpha * (reward + gamma * best_next - q_table[state][action])

        state = next_discrete
        total_steps += 1

    epsilon = max(epsilon_min, epsilon * epsilon_decay)
    steps_per_episode.append(total_steps)

print(f"Q-learning average steps: {np.mean(steps_per_episode):.2f}")

Q-learning average steps: 15.43


## 🧠 Deep Q-Network using Stable-Baselines3

The `DQN` class from **Stable-Baselines3** has several parameters you can adjust to control the training behavior of the Deep Q-Network. Here's a breakdown of the most important ones:

---

## 🧠 **Key Parameters of `DQN` in Stable-Baselines3**

### 🎯 Core Parameters

| Parameter                 | Description                                                    | Typical Values               |
| ------------------------- | -------------------------------------------------------------- | ---------------------------- |
| `policy`                  | Neural network architecture. `"MlpPolicy"` for MLPs (default). | `"MlpPolicy"`, `"CnnPolicy"` |
| `env`                     | The Gym environment to train on.                               | `gym.make("CartPole-v1")`    |
| `learning_rate`           | Step size for the optimizer.                                   | `1e-4` to `1e-3`             |
| `buffer_size`             | Replay buffer size (how many experiences to store).            | `10_000` to `100_000`        |
| `learning_starts`         | Timesteps before training starts (buffer warm-up).             | `1_000`                      |
| `batch_size`              | Size of mini-batches sampled from the buffer.                  | `32`, `64`, `128`            |
| `tau`                     | Soft update coefficient for the target network.                | `1.0` (hard update)          |
| `gamma`                   | Discount factor for future rewards.                            | `0.95` to `0.99`             |
| `train_freq`              | Frequency of training the model.                               | `4` (every 4 steps)          |
| `target_update_interval`  | Steps between target network updates.                          | `100` to `1000`              |
| `exploration_fraction`    | Fraction of total timesteps to decay epsilon.                  | `0.1`                        |
| `exploration_initial_eps` | Initial epsilon for exploration.                               | `1.0`                        |
| `exploration_final_eps`   | Final epsilon after decay.                                     | `0.05` to `0.01`             |
| `max_grad_norm`           | Gradient clipping value.                                       | `10`                         |
| `verbose`                 | Verbosity level (0: silent, 1: info, 2: debug).                | `1`                          |

---

### 🧪 Example Usage

```python
from stable_baselines3 import DQN

model = DQN(
    "MlpPolicy",
    env,
    learning_rate=1e-4,
    buffer_size=50000,
    learning_starts=1000,
    batch_size=64,
    tau=1.0,
    gamma=0.99,
    train_freq=4,
    target_update_interval=500,
    exploration_fraction=0.1,
    exploration_initial_eps=1.0,
    exploration_final_eps=0.01,
    verbose=1
)
```

---

### 📚 Optional Parameters

| Parameter         | Purpose                                             |
| ----------------- | --------------------------------------------------- |
| `tensorboard_log` | Path to log directory for TensorBoard visualization |
| `device`          | `"cpu"` or `"cuda"` to specify hardware             |
| `seed`            | Random seed for reproducibility                     |

 


In [4]:
import gymnasium as gym
from stable_baselines3 import DQN

env = gym.make('CartPole-v1')

model = DQN('MlpPolicy', env, verbose=1)
model.learn(total_timesteps=50000)
model.save('dqn_cartpole')
print("✅ DQN training complete!")

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 22.8     |
|    ep_rew_mean      | 22.8     |
|    exploration_rate | 0.983    |
| time/               |          |
|    episodes         | 4        |
|    fps              | 42602    |
|    time_elapsed     | 0        |
|    total_timesteps  | 91       |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 21.4     |
|    ep_rew_mean      | 21.4     |
|    exploration_rate | 0.968    |
| time/               |          |
|    episodes         | 8        |
|    fps              | 3521     |
|    time_elapsed     | 0        |
|    total_timesteps  | 171      |
| train/              |          |
|    learning_rate    | 0.0001   |
|    loss             | 0.547    |
|    n_updates        | 17       |
-------------------------------

The **training log** from a `Stable-Baselines3` DQN model. This output helps you monitor training progress and evaluate how well your agent is learning over time.

Here's how to interpret each section:

---

## 📦 **Section: `rollout/`**

This shows **behavior of the agent during evaluation episodes** (not used for training directly).

| Metric             | Meaning                                                                                                                                                                           |
| ------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `ep_len_mean`      | 🕒 Average number of steps the agent survives per episode (i.e., how long it balances the pole). `24.5` means your agent typically survives \~24 steps.                           |
| `ep_rew_mean`      | 💰 Average reward per episode. In `CartPole-v1`, the reward is `+1` per timestep survived, so this equals `ep_len_mean`.                                                          |
| `exploration_rate` | 🎲 Current value of epsilon (𝜖), controlling randomness in action selection. `0.05` means the agent explores 5% of the time and exploits (uses the best action) 95% of the time. |

---

## ⏱️ **Section: `time/`**

This shows training duration and speed.

| Metric            | Meaning                                                                                                            |
| ----------------- | ------------------------------------------------------------------------------------------------------------------ |
| `episodes`        | 🎬 Total number of episodes completed so far.                                                                      |
| `fps`             | ⏩ Training speed in **frames per second**. `1094` means it's simulating 1,094 timesteps per second.                |
| `time_elapsed`    | ⏱️ Total wall-clock time elapsed (in seconds).                                                                     |
| `total_timesteps` | 🔁 Total number of environment steps taken. `49958` is very close to your `total_timesteps=50000` training target. |

---

## 🧠 **Section: `train/`**

This shows training-specific stats during updates to the model.

| Metric          | Meaning                                                                                                                                                            |
| --------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `learning_rate` | 🔧 Current learning rate (step size).                                                                                                                              |
| `loss`          | 📉 Temporal Difference (TD) loss. This is how wrong the Q-network's predictions are. Lower is generally better, and `0.0115` is quite low, indicating convergence. |
| `n_updates`     | 🔁 Number of times the model has updated its weights using batches from the replay buffer.                                                                         |

---

## 🧪 Summary

* Your agent has **stopped exploring much** (`epsilon = 0.05`).
* It's **learning slowly but steadily**, with a **low loss**.
* But **average reward is still low (`24.5`)**, meaning it hasn't learned to balance the pole well yet.

  * Possible reasons:

    * Not enough training (`50k` steps may be too few)
    * Need to tune hyperparameters (like `learning_rate`, `buffer_size`, etc.)
    * DQN may need better network architecture or longer training

 

## 📊 Evaluation
After running both training methods, evaluate their performance by calculating average steps per episode.

You can re-run the environment using the trained models and compare results.

### ✍️ Compare:
- How does discretization impact Q-learning performance?
- How much better does DQN perform?
- What parameter changes made the biggest difference?


- How does discretization impact Q-learning performance?
    - Discretization groups similar states together so information required for optimial decisions may be lost.
- How much better does DQN perform?
    - DQN performs 193.8% better mostly due to how corse the Q-Learning bins are. Increasing the number of bins can close that gap.
- What parameter changes made the biggest difference?
    - For Q-Learning, the numbers of bins has the biggest impact at the cost of longer training time. For DQN, the number of timesteps seems to have a large impact. Both parameters indicate that more training typically results in better performance.

In [None]:
# Evaluate Q-Learning
def evaluate_qlearning(q_table, bins, episodes=100):
    env = gym.make('CartPole-v1')
    total_steps = []
    
    for _ in range(episodes):
        state, _ = env.reset()
        state = discretize(state, bins)
        steps = 0
        done = False
        
        while not done:
            action = np.argmax(q_table[state])  # No exploration during evaluation
            next_state, _, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            state = discretize(next_state, bins)
            steps += 1
            
        total_steps.append(steps)
    
    return np.mean(total_steps)

# Evaluate DQN
def evaluate_dqn(model, episodes=100):
    env = gym.make('CartPole-v1')
    total_steps = []
    
    for _ in range(episodes):
        obs, _ = env.reset()
        steps = 0
        done = False
        
        while not done:
            action, _ = model.predict(obs, deterministic=True)  # No exploration
            obs, _, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            steps += 1
            
        total_steps.append(steps)
    
    return np.mean(total_steps)

q_learning_avg = evaluate_qlearning(q_table, bins)
dqn_avg = evaluate_dqn(model)

print(f"Q-Learning average steps: {q_learning_avg:.2f}")
print(f"DQN average steps: {dqn_avg:.2f}")
print(f"DQN improvement: {(dqn_avg/q_learning_avg)*100:.1f}% better")

Q-Learning average steps: 12.80
DQN average steps: 24.80
DQN improvement: 193.8% better
