# Deep Q-Network (DQN) for CartPole-v1: A Comprehensive Implementation

## Abstract
This notebook presents a complete implementation of the Deep Q-Network (DQN) algorithm [1] for solving the CartPole-v1 control task from OpenAI Gymnasium. We provide detailed theoretical foundations, mathematical formulations, and practical implementation of key reinforcement learning concepts including experience replay, target networks, and ε-greedy exploration strategies. The implementation demonstrates how deep learning can be effectively combined with Q-learning to solve continuous state space problems.

---

## I. Introduction

### A. Motivation
Traditional Q-learning algorithms maintain a tabular representation of state-action values, which becomes intractable for problems with large or continuous state spaces. The Deep Q-Network (DQN) addresses this limitation by using a deep neural network as a function approximator, enabling the agent to generalize across similar states.

### B. Problem Statement
The CartPole-v1 environment presents a classic control problem where an agent must balance a pole on a moving cart. The state space is continuous (4-dimensional), and the action space is discrete (2 actions: move left or right). The goal is to learn a policy that maximizes the cumulative reward by keeping the pole balanced as long as possible.

### C. Key Contributions of DQN [1]
1. **Experience Replay**: Breaking temporal correlations in training data
2. **Target Network**: Stabilizing the learning process
3. **Deep Neural Network**: Function approximation for continuous state spaces
4. **ε-greedy Exploration**: Balancing exploration and exploitation

---

## II. Theoretical Background

### A. Markov Decision Process (MDP)
The reinforcement learning problem is formalized as a Markov Decision Process defined by the tuple (S, A, P, R, γ):
- **S**: State space
- **A**: Action space  
- **P**: Transition probability P(s'|s,a)
- **R**: Reward function R(s,a,s')
- **γ**: Discount factor ∈ [0,1]

### B. Q-Learning Foundation
The optimal action-value function Q*(s,a) represents the expected cumulative discounted reward starting from state s, taking action a, and following the optimal policy thereafter:

**Q*(s,a) = E[R_t + γR_{t+1} + γ²R_{t+2} + ... | s_t=s, a_t=a]**

The Bellman optimality equation states:

**Q*(s,a) = E_{s'}[r + γ max_{a'} Q*(s',a') | s,a]**

### C. Deep Q-Network Architecture
DQN approximates Q*(s,a) using a neural network with parameters θ:

**Q(s,a;θ) ≈ Q*(s,a)**

The network takes a state s as input and outputs Q-values for all actions simultaneously.

### D. Key DQN Innovations

#### 1. Experience Replay
Store transitions (s_t, a_t, r_t, s_{t+1}) in replay memory D. During training, sample random mini-batches to:
- Break temporal correlations between consecutive samples
- Improve data efficiency through reuse of experiences
- Reduce variance in updates

#### 2. Target Network
Maintain two networks:
- **Q-network** (parameters θ): Updated every step
- **Target network** (parameters θ⁻): Updated slowly (soft updates)

This separation stabilizes training by providing consistent targets during learning.

#### 3. Loss Function
The temporal difference (TD) error is minimized using:

**L(θ) = E_{(s,a,r,s')~D}[(r + γ max_{a'} Q(s',a';θ⁻) - Q(s,a;θ))²]**

Where:
- r + γ max_{a'} Q(s',a';θ⁻) is the TD target (using target network)
- Q(s,a;θ) is the current prediction (using Q-network)

---

## III. Algorithm Overview

### DQN Algorithm Pseudocode

```
Initialize replay memory D with capacity N
Initialize Q-network with random weights θ
Initialize target network with weights θ⁻ = θ

For episode = 1 to M:
    Initialize state s_1
    For t = 1 to T:
        Select action a_t:
            with probability ε: random action (exploration)
            otherwise: a_t = argmax_a Q(s_t, a; θ) (exploitation)
        
        Execute action a_t, observe reward r_t and next state s_{t+1}
        Store transition (s_t, a_t, r_t, s_{t+1}) in D
        
        Sample random mini-batch of transitions from D
        For each transition (s, a, r, s'):
            if s' is terminal:
                y = r
            else:
                y = r + γ max_{a'} Q(s', a'; θ⁻)
        
        Perform gradient descent on (y - Q(s,a;θ))²
        
        Soft update target network: θ⁻ ← τθ + (1-τ)θ⁻
```

---

## IV. Implementation Details

### A. Environment Specifications
- **State Space**: 4-dimensional continuous (cart position, cart velocity, pole angle, pole angular velocity)
- **Action Space**: 2 discrete actions (push cart left or right)
- **Reward**: +1 for every timestep the pole remains upright
- **Episode Termination**: Pole angle > 12° or cart position > 2.4 units

### B. Network Architecture
- Input layer: 4 neurons (state dimensions)
- Hidden layer 1: 128 neurons with ReLU activation
- Hidden layer 2: 128 neurons with ReLU activation
- Output layer: 2 neurons (Q-values for each action, no activation)

### C. Hyperparameters
The following hyperparameters are used (justified in Section V):
- Batch size: 128
- Discount factor γ: 0.99
- Learning rate α: 1e-4
- Replay memory capacity: 10,000
- Target network update rate τ: 0.005
- Initial exploration ε: 0.9
- Final exploration ε: 0.05
- Exploration decay: 1000 steps

---

## V. References
[1] Mnih, V., et al. (2015). "Human-level control through deep reinforcement learning." Nature, 518(7540), 529-533.

[2] Sutton, R. S., & Barto, A. G. (2018). "Reinforcement learning: An introduction." MIT press.

[3] van Hasselt, H., Guez, A., & Silver, D. (2016). "Deep reinforcement learning with double Q-learning." AAAI.

---

# VI. Installations and Imports

In [None]:
!sudo apt-get update
!pip install 'imageio==2.4.0'
!sudo apt-get install -y xvfb ffmpeg
!pip3 install gymnasium[classic_control]

In [14]:
import math
import base64
import random
import imageio
import IPython
import matplotlib
import gymnasium as gym
from itertools import count
import matplotlib.pyplot as plt
from collections import namedtuple, deque

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Utility Functions for Rendering Environment

These helper functions allow us to:
1. Record agent performance as video
2. Embed videos in the notebook for visualization
3. Evaluate trained policies visually

In [15]:
def embed_mp4(filename):
    """Convert MP4 video to base64 and embed in HTML for notebook display"""
    video = open(filename,'rb').read()
    b64 = base64.b64encode(video)
    tag = '''
    <video width="640" height="480" controls>
    <source src="data:video/mp4;base64,{0}" type="video/mp4">
    Your browser does not support the video tag.
    </video>'''.format(b64.decode())
    
    return IPython.display.HTML(tag)

In [16]:
def create_policy_eval_video(env, policy, filename, num_episodes=1, fps=30):
    """Create a video of the agent following a given policy"""
    filename = filename + ".mp4"
    with imageio.get_writer(filename, fps=fps) as video:
        for _ in range(num_episodes):
            state, info = env.reset()
            video.append_data(env.render())
            while True:
                state = torch.from_numpy(state).unsqueeze(0).to(DEVICE)
                action = policy(state)
                state, reward, terminated, truncated, _ = env.step(action.item())
                video.append_data(env.render())
                if terminated:
                    break
    return embed_mp4(filename)

# VIII. Experience Replay Memory and Q-Network Architecture

## A. Experience Replay: Theoretical Foundation

### 1. Motivation for Experience Replay
Standard online reinforcement learning suffers from:
- **Temporal Correlation**: Consecutive samples are highly correlated, violating the i.i.d. assumption
- **Sample Inefficiency**: Each experience is used once and discarded
- **Catastrophic Forgetting**: Rapid policy changes can erase previous learning

### 2. Experience Replay Mechanism
Store transitions τ = (s_t, a_t, r_t, s_{t+1}) in replay buffer D with capacity N. During training:
1. Store current transition: D ← D ∪ {τ_t}
2. Sample mini-batch: B ~ Uniform(D) where |B| = batch_size
3. Update network using samples from B

### 3. Mathematical Benefits
By sampling uniformly from D, we:
- **Decorrelate samples**: P(τ_i, τ_j in B) = 1/N² for i≠j
- **Stabilize gradients**: Reduce variance in gradient estimates
- **Improve sample efficiency**: Each transition can be reused multiple times

**Expected gradient with replay**:
∇L(θ) = E_{(s,a,r,s')~Uniform(D)}[∇_θ(y - Q(s,a;θ))²]

where y = r + γ max_{a'} Q(s',a';θ⁻)

## B. Replay Memory Implementation Details

### Data Structure
- **Circular Buffer**: Uses collections.deque with maxlen for automatic FIFO replacement
- **Capacity**: 10,000 transitions (tunable based on memory constraints)
- **Sampling**: Random uniform sampling without replacement within each batch

## C. Q-Network Architecture: Deep Neural Network Design

### 1. Architecture Specifications
The Q-network Q(s;θ) : R^n → R^m maps states to action values:

**Input Layer**: n = 4 neurons (state dimensionality)
↓
**Hidden Layer 1**: 128 neurons + ReLU
   Mathematically: h₁ = ReLU(W₁s + b₁)
↓
**Hidden Layer 2**: 128 neurons + ReLU
   Mathematically: h₂ = ReLU(W₂h₁ + b₂)
↓
**Output Layer**: m = 2 neurons (action dimensionality)
   Mathematically: Q(s) = W₃h₂ + b₃

### 2. Activation Function Choice

**ReLU (Rectified Linear Unit)**: f(x) = max(0, x)
- **Advantages**:
  * Mitigates vanishing gradient problem
  * Computationally efficient
  * Promotes sparse activation
  * Empirically effective for deep networks

**No activation on output**: Q-values can be any real number (unbounded)

### 3. Network Capacity
Total parameters:
- Layer 1: 4 × 128 + 128 = 640 parameters
- Layer 2: 128 × 128 + 128 = 16,512 parameters  
- Layer 3: 128 × 2 + 2 = 258 parameters
- **Total**: 17,410 trainable parameters

This capacity is sufficient for CartPole while avoiding overfitting.

In [17]:
class ReplayMemory(object):
    """Experience Replay Memory with fixed capacity"""
    def __init__(self, capacity):
        self.memory = deque([], maxlen=capacity)

    def push(self, transition):
        """Save a transition"""
        self.memory.append(transition)

    def sample(self, batch_size):
        """Randomly sample a batch of transitions"""
        return random.sample(self.memory, batch_size)

    def __len__(self):
        return len(self.memory)

In [18]:
# Complete the Q-Network below. 
# The Q-Network takes a state as input and the output is a vector so that each element is the q-value for an action.

class DQN(nn.Module):
    """Deep Q-Network with 3 fully connected layers"""
    def __init__(self, n_observations, n_actions):
        super(DQN, self).__init__()
        # ==================================== Your Code (Begin) ====================================
        # Define a simple feedforward neural network with 3 layers
        # Architecture: Input -> 128 -> 128 -> Output
        self.layer1 = nn.Linear(n_observations, 128)
        self.layer2 = nn.Linear(128, 128)
        self.layer3 = nn.Linear(128, n_actions)
        # ==================================== Your Code (End) ====================================

    def forward(self, x):
        # ==================================== Your Code (Begin) ====================================
        # Forward pass through the network with ReLU activations
        x = F.relu(self.layer1(x))
        x = F.relu(self.layer2(x))
        return self.layer3(x)  # No activation on output layer (Q-values can be any real number)
        # ==================================== Your Code (End) ====================================

# IX. Action Selection Policies: Exploration-Exploitation Trade-off

## A. The Exploration-Exploitation Dilemma

One of the fundamental challenges in reinforcement learning is balancing:
- **Exploitation**: Choosing actions known to yield high rewards
- **Exploration**: Trying new actions to discover potentially better strategies

This dilemma is formalized as the **multi-armed bandit problem** extended to sequential decision-making.

## B. Greedy Policy (Pure Exploitation)

### 1. Definition
The greedy policy deterministically selects the action with maximum Q-value:

**π_greedy(s) = argmax_{a∈A} Q(s,a;θ)**

### 2. Implementation Details
```python
with torch.no_grad():  # Disable gradient computation for efficiency
    action = argmax_a Q(s,a)  # Select best action
```

### 3. Use Cases
- **Policy Evaluation**: Assessing learned policy performance
- **Testing**: Final deployment after training
- **Deterministic Behavior**: When exploration is undesirable

### 4. Limitations
- **No exploration**: Cannot discover better actions
- **Suboptimal convergence**: May converge to local optima
- **Sensitive to initialization**: Poor initial estimates persist

## C. ε-Greedy Policy (Exploration-Exploitation Balance)

### 1. Mathematical Formulation
The ε-greedy policy is defined as:

**π_ε(a|s) = {
    1-ε + ε/|A|,  if a = argmax_{a'} Q(s,a';θ)  (best action)
    ε/|A|,         otherwise                      (random actions)
}**

Simplified interpretation:
- Probability ε: select random action (uniform over A)
- Probability 1-ε: select best action

### 2. Epsilon Decay Schedule
To shift from exploration to exploitation over time, ε is decayed:

**ε(t) = ε_end + (ε_start - ε_end) × exp(-t/τ)**

Where:
- **ε_start = 0.9**: Initial exploration (90% random actions)
- **ε_end = 0.05**: Final exploration (5% random actions)
- **τ = 1000**: Decay constant (time scale)
- **t**: Current timestep

### 3. Decay Behavior Analysis
The exponential decay ensures:
- **Rapid initial exploration**: High ε at start discovers state space
- **Gradual transition**: Smooth shift from exploration to exploitation
- **Minimal final exploration**: Small ε_end maintains robustness

**Half-life**: t_{1/2} = τ ln(2) ≈ 693 steps (time for ε to decay by 50%)

### 4. Theoretical Justification
ε-greedy exploration provides:
- **PAC (Probably Approximately Correct) guarantees**: Bounded suboptimality with high probability
- **Convergence assurance**: Infinite exploration ensures visiting all state-action pairs
- **Computational efficiency**: Simple to implement and compute

### 5. Implementation Considerations
```python
sample = random()  # Uniform [0,1]
if sample > ε(t):
    action = argmax_a Q(s,a;θ)  # Exploitation
else:
    action = random(A)           # Exploration
```

## D. Alternative Exploration Strategies (Not Implemented Here)

### 1. Boltzmann Exploration (Softmax)
**π(a|s) = exp(Q(s,a)/τ) / Σ_{a'} exp(Q(s,a')/τ)**
- Temperature τ controls randomness
- Favors high-value actions probabilistically

### 2. Upper Confidence Bound (UCB)
Selects actions with highest upper confidence bound on value estimate

### 3. Thompson Sampling
Bayesian approach: sample from posterior distribution over Q-values

### 4. Noisy Networks
Add parametric noise to network weights for exploration

Now we define 2 policies. We use greedy policy for evaluation and e-greedy during training.

In [19]:
# This function takes in a state and returns the best action according to your q-network.
# Don't forget "torch.no_grad()". We don't want gradient flowing through our network. 

# state shape: (1, state_size) -> output shape: (1, 1)  
def greedy_policy(qnet, state):
    # ==================================== Your Code (Begin) ====================================
    with torch.no_grad():
        # Get Q-values for all actions and select the action with maximum Q-value
        return qnet(state).max(1)[1].view(1, 1)
    # ==================================== Your Code (End) ====================================

In [20]:
# state shape: (1, state_size) -> output shape: (1, 1)
# Don't forget "torch.no_grad()". We don't want gradient flowing through our network.

def e_greedy_policy(qnet, state, current_timestep):
    """Epsilon-greedy action selection with exponential decay"""
    eps_threshold = EPS_END + (EPS_START - EPS_END) * math.exp(-1. * current_timestep / EPS_DECAY)
    # ==================================== Your Code (Begin) ====================================
    # With probability "eps_threshold" choose a random action 
    # and with probability 1-"eps_threshold" choose the best action according to your Q-Network.
    
    sample = random.random()
    if sample > eps_threshold:
        # Exploitation: choose best action
        with torch.no_grad():
            return qnet(state).max(1)[1].view(1, 1)
    else:
        # Exploration: choose random action
        return torch.tensor([[random.randrange(n_actions)]], device=device, dtype=torch.long)
    # ==================================== Your Code (End) ====================================

# X. Experimental Setup and Hyperparameter Configuration

## A. Hyperparameter Selection and Justification

### 1. BATCH_SIZE = 128
**Purpose**: Number of transitions sampled from replay buffer per update

**Trade-offs**:
- **Larger batches**: 
  * More stable gradient estimates (lower variance)
  * Better GPU utilization
  * Slower learning (fewer updates per epoch)
- **Smaller batches**:
  * Higher gradient variance
  * More frequent updates
  * May escape local minima better

**Justification**: 128 provides good balance for CartPole. Gradient variance is acceptable while maintaining computational efficiency.

**Mathematical insight**: Variance of gradient estimate decreases as O(1/√batch_size)

### 2. GAMMA (γ) = 0.99
**Purpose**: Discount factor for future rewards

**Interpretation**: Future reward T steps ahead is discounted by γ^T
- γ = 0: Only immediate rewards matter (myopic)
- γ → 1: All future rewards equally important (far-sighted)

**Effective horizon**: H_eff = 1/(1-γ) = 100 steps

**Justification**: CartPole episodes can last 500 steps. γ=0.99 ensures agent considers long-term consequences while maintaining numerical stability.

**Effect on learning**: Higher γ increases variance in value estimates but enables long-term planning.

### 3. Exploration Parameters
**EPS_START = 0.9**: Initial exploration rate (90%)
- **Justification**: High initial exploration ensures comprehensive state space coverage
- **Critical for**: Discovering successful strategies early in training

**EPS_END = 0.05**: Final exploration rate (5%)
- **Justification**: Maintains minimal exploration to adapt to environment changes
- **Prevents**: Complete elimination of exploration (robustness)

**EPS_DECAY = 1000**: Decay time constant
- **Justification**: Allows ~2000 steps (4-10 episodes) for exploration phase
- **Balances**: Sufficient exploration vs. timely exploitation

### 4. TAU (τ) = 0.005
**Purpose**: Soft update rate for target network

**Update rule**: θ⁻ ← τθ + (1-τ)θ⁻

**Effective time constant**: T_eff = 1/τ = 200 updates

**Trade-offs**:
- **Small τ (slow updates)**: 
  * More stable learning
  * Target network lags behind
  * Prevents oscillations
- **Large τ (fast updates)**:
  * Rapid target adaptation  
  * Less stability
  * Approaches standard Q-learning (τ=1)

**Justification**: τ=0.005 provides stability while preventing excessive lag. Target network tracks Q-network over ~200 updates.

### 5. LR (Learning Rate) = 1e-4
**Purpose**: Step size for gradient descent

**Adam optimizer adaptive learning**: Combines momentum and RMSprop
- Maintains per-parameter learning rates
- Adapts to gradient history
- More robust than SGD

**Trade-offs**:
- **High LR**: Faster initial learning, instability, overshooting
- **Low LR**: Stable convergence, slow learning, may get stuck

**Justification**: 1e-4 is conservative for DQN. Ensures stable learning in CartPole's relatively simple environment.

**Typical range**: [1e-5, 1e-3] for deep RL

### 6. Memory Capacity = 10,000
**Purpose**: Maximum transitions stored in replay buffer

**Memory requirement**: 10,000 × (4 + 1 + 1 + 4) × 4 bytes = 400 KB (negligible)

**Trade-offs**:
- **Larger capacity**:
  * More diverse experiences
  * Better decorrelation
  * Slower adaptation to policy changes
- **Smaller capacity**:
  * Less memory usage
  * Faster adaptation
  * Risk of overfitting to recent experiences

**Justification**: 10,000 transitions represents ~20-50 episodes. Provides sufficient diversity for CartPole while maintaining adaptation capability.

## B. System Components

### 1. Environment: CartPole-v1
- **State**: [cart_position, cart_velocity, pole_angle, pole_angular_velocity]
- **Actions**: {0: Push left, 1: Push right}
- **Reward**: +1 per timestep
- **Terminal**: |angle| > 12° or |position| > 2.4 or t > 500

### 2. Q-Network (Policy Network)
- **Role**: Current Q-value approximation
- **Updates**: Every timestep (if batch available)
- **Gradients**: Backpropagation from TD error

### 3. Target Network
- **Role**: Stable target for TD computation
- **Updates**: Soft updates with rate τ
- **Purpose**: Prevent moving target problem

### 4. Optimizer: Adam
- **Parameters**: β₁=0.9, β₂=0.999, ε=1e-8 (PyTorch defaults)
- **Advantages**: Adaptive learning rates, momentum, bias correction
- **Memory**: Maintains first and second moment estimates

### 5. Loss Function: Smooth L1 (Huber Loss)
**Definition**:
L_δ(x) = {
  0.5x²,        if |x| ≤ δ
  δ(|x| - 0.5δ), if |x| > δ
}

**Advantages over MSE**:
- Less sensitive to outliers
- Quadratic for small errors (fast convergence)
- Linear for large errors (robustness)

## C. Computational Resources
- **Device**: CUDA GPU if available, else CPU
- **Training time**: ~2-5 minutes on CPU, ~30 seconds on GPU
- **Memory usage**: < 1 GB RAM

In [None]:
BATCH_SIZE = 128
GAMMA = 0.99
EPS_START = 0.9
EPS_END = 0.05
EPS_DECAY = 1000
TAU = 0.005
LR = 1e-4

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
env = gym.make("CartPole-v1", render_mode='rgb_array')
n_actions = env.action_space.n
state, info = env.reset()
n_observations = len(state)
q_network = DQN(n_observations, n_actions).to(device)
target_network = DQN(n_observations, n_actions).to(device)
target_network.load_state_dict(q_network.state_dict())  # Initialize target network with same weights
optimizer = optim.Adam(q_network.parameters(), lr=LR)
memory = ReplayMemory(10000)

print(f"Device: {device}")
print(f"State space: {n_observations}")
print(f"Action space: {n_actions}")
print("\nRandom agent before training:")
create_policy_eval_video(env, lambda s: greedy_policy(q_network, s), "random_agent")

# Training Loop

## DQN Algorithm Steps:

### For each episode:
1. **Reset environment** and get initial state
2. **Interact with environment**:
   - Select action using ε-greedy policy
   - Execute action and observe reward and next state
   - Store transition in replay memory
   
3. **Learn from experience**:
   - Sample random batch from replay memory
   - Compute predicted Q-values: Q(s,a) using q_network
   - Compute target Q-values: r + γ × max_a' Q(s',a') using target_network
   - Compute loss (e.g., Huber loss or MSE)
   - Update q_network via gradient descent
   
4. **Soft update target network**:
   - θ' ← τθ + (1-τ)θ'
   - Slowly update target network to track q_network
   
5. **Track performance**:
   - Record episode duration and total reward

In [None]:
Transition = namedtuple('Transition', ('state', 'action', 'next_state', 'reward'))

num_episodes = 200
episode_returns = []
episode_durations = []

for i_episode in range(num_episodes):

    # ==================================== Your Code (Begin) ====================================
    # 1. Start a new episode
    state, info = env.reset()
    state = torch.tensor(state, dtype=torch.float32, device=device).unsqueeze(0)
    
    total_reward = 0
    t = 0
    
    for t in count():
        # 2. Run the environment for 1 step using e-greedy policy
        action = e_greedy_policy(q_network, state, i_episode * 500 + t)
        observation, reward, terminated, truncated, _ = env.step(action.item())
        reward = torch.tensor([reward], device=device)
        total_reward += reward.item()
        
        if terminated:
            next_state = None
        else:
            next_state = torch.tensor(observation, dtype=torch.float32, device=device).unsqueeze(0)
        
        # 3. Add the (state, action, next_state, reward) to replay memory
        memory.push(Transition(state, action, next_state, reward))
        
        # Move to next state
        state = next_state
        
        # 4. Optimize your q_network for 1 iteration
        if len(memory) >= BATCH_SIZE:
            # 4.1 Sample one batch from replay memory
            transitions = memory.sample(BATCH_SIZE)
            batch = Transition(*zip(*transitions))
            
            # Create masks for non-final states
            non_final_mask = torch.tensor(tuple(map(lambda s: s is not None, batch.next_state)), 
                                         device=device, dtype=torch.bool)
            non_final_next_states = torch.cat([s for s in batch.next_state if s is not None])
            
            state_batch = torch.cat(batch.state)
            action_batch = torch.cat(batch.action)
            reward_batch = torch.cat(batch.reward)
            
            # 4.2 Compute predicted state-action values using q_network
            # Q(s_t, a) - the model computes Q(s_t), then we select the columns of actions taken
            state_action_values = q_network(state_batch).gather(1, action_batch)
            
            # 4.3 Compute expected state-action values using target_network
            next_state_values = torch.zeros(BATCH_SIZE, device=device)
            with torch.no_grad():
                next_state_values[non_final_mask] = target_network(non_final_next_states).max(1)[0]
            # Compute the expected Q values: r + gamma * max_a' Q(s', a')
            expected_state_action_values = (next_state_values * GAMMA) + reward_batch
            
            # 4.4 Compute loss function and optimize q_network for 1 step
            # Huber loss is less sensitive to outliers than MSE
            criterion = nn.SmoothL1Loss()
            loss = criterion(state_action_values, expected_state_action_values.unsqueeze(1))
            
            # Optimize the model
            optimizer.zero_grad()
            loss.backward()
            # Gradient clipping to prevent exploding gradients
            torch.nn.utils.clip_grad_value_(q_network.parameters(), 100)
            optimizer.step()
        
        if terminated or truncated:
            episode_durations.append(t + 1)
            episode_returns.append(total_reward)
            break
    
    # 5. Soft update the weights of target_network
    # θ′ ← τ θ + (1 −τ )θ′
    target_net_state_dict = target_network.state_dict()
    policy_net_state_dict = q_network.state_dict()
    for key in policy_net_state_dict:
        target_net_state_dict[key] = policy_net_state_dict[key]*TAU + target_net_state_dict[key]*(1-TAU)
    target_network.load_state_dict(target_net_state_dict)
    
    # Print progress every 10 episodes
    if (i_episode + 1) % 10 == 0:
        avg_duration = sum(episode_durations[-10:]) / 10
        avg_return = sum(episode_returns[-10:]) / 10
        print(f'Episode {i_episode+1}/{num_episodes} | Avg Duration: {avg_duration:.1f} | Avg Return: {avg_return:.1f}')

    # ==================================== Your Code (End) ====================================  

print('\nTraining Complete!')
print(f'Final Average Duration (last 10 episodes): {sum(episode_durations[-10:]) / 10:.1f}')
print(f'Final Average Return (last 10 episodes): {sum(episode_returns[-10:]) / 10:.1f}')

# Plot results
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.plot(range(1, num_episodes+1), episode_durations)
plt.xlabel('Episode')
plt.ylabel('Duration')
plt.title('Episode Durations Over Training')
plt.grid(True)

plt.subplot(1, 2, 2)
plt.plot(range(1, num_episodes+1), episode_returns)
plt.xlabel('Episode')
plt.ylabel('Total Return')
plt.title('Episode Returns Over Training')
plt.grid(True)

plt.tight_layout()
plt.show()

# Evaluation

Now let's visualize how well our trained agent performs!
Compare this video to the random agent from before training.

In [None]:
# Render trained model
print("Trained agent performance:")
create_policy_eval_video(env, lambda s: greedy_policy(q_network, s), "trained_agent")

# Analysis and Discussion

## What We Learned:
1. **Experience Replay** helps break correlation between consecutive samples
2. **Target Network** provides stable learning targets
3. **ε-greedy exploration** balances exploration and exploitation
4. **Soft updates** of target network prevent instability

## Expected Results:
- CartPole is considered solved when average reward ≥ 195 over 100 episodes
- The agent should learn to balance the pole for longer durations
- Training curves should show increasing episode durations

## Further Improvements:
- **Double DQN**: Reduce overestimation bias
- **Dueling DQN**: Separate value and advantage streams
- **Prioritized Experience Replay**: Sample important transitions more frequently
- **Noisy Networks**: Add parametric noise for exploration

In [None]:
# Optional: Save the trained model
torch.save({
    'q_network_state_dict': q_network.state_dict(),
    'target_network_state_dict': target_network.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'episode_durations': episode_durations,
    'episode_returns': episode_returns,
}, 'dqn_cartpole.pth')

print("Model saved as 'dqn_cartpole.pth'")