# Deep Q-Learning (DQN) – Explained

## Online Reinforcement Learning

Online RL refers to learning that happens as the agent interacts with the environment — in real-time. In other words:

- The agent updates its model or policy immediately after each new experience.
- No full dataset is available in advance (unlike supervised learning).
- The environment is often stochastic and partially unknown.

DQN is a classic example of online RL, because it:
- Learns from new interactions step-by-step.
- Continuously updates its estimates of action values \( Q(s, a) \).
- Balances exploration (trying new things) and exploitation (using what it knows).

---

## What is Deep Q-Learning?

Deep Q-Learning (DQN) is an online RL algorithm that combines:
- Q-learning: a value-based method that learns the function \( Q(s, a) \)
- With a deep neural network to approximate that function when the state space is too large or continuous

---

## Objective of DQN

The goal is to learn:

$$
Q^*(s, a) = \mathbb{E}\left[ r + \gamma \cdot \max_{a'} Q^*(s', a') \right]
$$

Where:
- $r$: reward  
- $\gamma$: discount factor  
- $s'$: next state  
- $a'$: next action  

---

## Components of a DQN

### 1. Experience Replay

- Stores transitions: $ (s, a, r, s', \text{done}) $ in a buffer  
- Random mini-batches are sampled for training  
- Breaks correlations and improves stability


---

## Components of a DQN

### 1. Experience Replay

- Stores transitions: $ (s, a, r, s', \text{done}) $ in a buffer
- Random mini-batches are sampled for training
- Breaks correlations and improves stability

### 2. Target Network

- A copy of the Q-network, used to compute learning targets:

$$
y = r + \gamma \cdot \max_{a'} Q_{\text{target}}(s', a')
$$

- Updated less frequently to keep targets stable

### 3. Epsilon-Greedy Exploration

- With probability $ \varepsilon $, take a random action (exploration)  
- Otherwise, select the action that maximizes $ Q(s, a) $ (exploitation)  
- $ \varepsilon $ decays over time to shift from exploring to exploiting

---

## Training with Bellman Loss

The network is trained using the Bellman loss:

$$
\text{Loss} = \left( Q(s, a) - \left[ r + \gamma \cdot \max_{a'} Q_{\text{target}}(s', a') \right] \right)^2
$$

This measures the difference between predicted Q-values and expected returns — the Bellman error.

---

## References

[1] Mnih, V., et al. (2015). *Human-level control through deep reinforcement learning*. Nature.  
[2] DQN PyTorch implementation by johnnycode8:  
https://github.com/johnnycode8/dqn_pytorch


In [3]:
# modules

from agents import DQNAgent
import matplotlib.pyplot as plt

Cartpole

In [None]:
agent = DQNAgent(env_name="CartPole-v1", hidden_dim=512)
rewards = agent.run(num_episodes=100, is_training=True, render=True)
print("Rewards per episode:", rewards)
plt.plot(rewards)
plt.xlabel('Episode')
plt.ylabel('Total Reward')
plt.title('Rewards per Episode')
plt.savefig("rewards.png")

Episode 1 finished with reward: 38.0
Episode 2 finished with reward: 17.0
Episode 3 finished with reward: 14.0
Episode 4 finished with reward: 13.0
Episode 5 finished with reward: 30.0
Episode 6 finished with reward: 13.0
Episode 7 finished with reward: 19.0
Episode 8 finished with reward: 11.0
Episode 9 finished with reward: 10.0
Episode 10 finished with reward: 13.0
Episode 11 finished with reward: 10.0
Episode 12 finished with reward: 13.0
Episode 13 finished with reward: 11.0
Episode 14 finished with reward: 9.0
Episode 15 finished with reward: 12.0
Episode 16 finished with reward: 10.0
Episode 17 finished with reward: 10.0
Episode 18 finished with reward: 9.0
Episode 19 finished with reward: 9.0
Episode 20 finished with reward: 12.0
Episode 21 finished with reward: 10.0
Episode 22 finished with reward: 10.0
Episode 23 finished with reward: 14.0
Episode 24 finished with reward: 10.0
Episode 25 finished with reward: 19.0
Episode 26 finished with reward: 13.0
Episode 27 finished with