# Reinforcement Learning Plan B - Part 3: From Tabular to Deep RL

This notebook bridges the gap between tabular RL methods and deep reinforcement learning. We'll explore function approximation, understand why neural networks are necessary for complex environments, and implement the foundation for deep RL algorithms.

**Learning Objectives:**
- Understand the limitations of tabular methods and need for function approximation
- Master linear and neural network function approximation theory
- Implement the "deadly triad" and understand instability issues
- Build neural network foundations for value function approximation
- Explore experience replay and target networks
- Transition from discrete to continuous state spaces

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from collections import defaultdict, deque
import pandas as pd
from typing import Tuple, Dict, List, Optional, Union
import time
import warnings
import random
import gym
warnings.filterwarnings('ignore')

# Set style for better plots
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Set random seeds for reproducibility
np.random.seed(42)
random.seed(42)
torch.manual_seed(42)

# CPU optimization for MacBook Air M2
torch.set_num_threads(8)
device = torch.device('cpu')

print(f"PyTorch version: {torch.__version__}")
print(f"Device: {device}")
print(f"Torch threads: {torch.get_num_threads()}")
print("Environment setup complete!")

## 1. The Curse of Dimensionality and Need for Function Approximation

Tabular methods become impractical when state or action spaces grow large. **Function approximation** allows us to generalize across states using learned representations.

### Mathematical Framework

Instead of storing $V(s)$ for each state $s$, we approximate it with a parameterized function:
$$\hat{V}(s, \mathbf{w}) \approx V^\pi(s)$$

Where $\mathbf{w} \in \mathbb{R}^d$ are the **learnable parameters**.

### Types of Function Approximation

1. **Linear Function Approximation**:
   $$\hat{V}(s, \mathbf{w}) = \mathbf{w}^T \phi(s)$$
   Where $\phi(s)$ are **feature vectors** representing state $s$.

2. **Neural Network Approximation**:
   $$\hat{V}(s, \mathbf{w}) = f_{\text{neural}}(s; \mathbf{w})$$
   Where $f_{\text{neural}}$ is a neural network with parameters $\mathbf{w}$.

### Objective Function

We minimize the **Mean Squared Value Error** (MSVE):
$$\text{MSVE}(\mathbf{w}) = \sum_{s \in \mathcal{S}} d(s) [V^\pi(s) - \hat{V}(s, \mathbf{w})]^2$$

Where $d(s)$ is a state distribution (often the stationary distribution under policy $\pi$).

### Gradient Descent Update

The gradient descent update rule is:
$$\mathbf{w}_{t+1} = \mathbf{w}_t - \frac{1}{2}\alpha \nabla [V^\pi(S_t) - \hat{V}(S_t, \mathbf{w}_t)]^2$$
$$= \mathbf{w}_t + \alpha [V^\pi(S_t) - \hat{V}(S_t, \mathbf{w}_t)] \nabla \hat{V}(S_t, \mathbf{w}_t)$$

## 2. CartPole Environment: Continuous States

Let's implement the classic CartPole environment to demonstrate function approximation with continuous state spaces.

In [None]:
class CartPoleWrapper:
    """
    Wrapper for CartPole environment with additional functionality.
    
    CartPole has a 4-dimensional continuous state space:
    - Cart Position: [-2.4, 2.4]
    - Cart Velocity: [-inf, inf]
    - Pole Angle: [-0.2095, 0.2095] radians (~12 degrees)
    - Pole Angular Velocity: [-inf, inf]
    """
    
    def __init__(self, render_mode=None):
        try:
            # Try new gym API first
            self.env = gym.make('CartPole-v1', render_mode=render_mode)
        except:
            # Fallback to old gym API
            self.env = gym.make('CartPole-v1')
        
        self.state_dim = 4
        self.action_dim = 2  # Left (0) or Right (1)
        
        # State bounds for normalization
        self.state_bounds = np.array([
            [-2.4, 2.4],      # Cart Position
            [-3.0, 3.0],      # Cart Velocity (clipped)
            [-0.21, 0.21],    # Pole Angle
            [-3.0, 3.0]       # Pole Angular Velocity (clipped)
        ])
    
    def reset(self):
        """Reset environment and return initial state."""
        try:
            # New gym API
            state, info = self.env.reset()
            return state
        except:
            # Old gym API
            return self.env.reset()
    
    def step(self, action):
        """Take action and return (state, reward, done, info)."""
        try:
            # New gym API
            state, reward, terminated, truncated, info = self.env.step(action)
            done = terminated or truncated
            return state, reward, done, info
        except:
            # Old gym API
            return self.env.step(action)
    
    def normalize_state(self, state):
        """Normalize state to [-1, 1] range."""
        state = np.clip(state, self.state_bounds[:, 0], self.state_bounds[:, 1])
        normalized = 2 * (state - self.state_bounds[:, 0]) / (self.state_bounds[:, 1] - self.state_bounds[:, 0]) - 1
        return normalized
    
    def close(self):
        """Close the environment."""
        self.env.close()

# Create CartPole environment
cartpole_env = CartPoleWrapper()
print(f"CartPole Environment:")
print(f"  State dimension: {cartpole_env.state_dim}")
print(f"  Action dimension: {cartpole_env.action_dim}")
print(f"  State bounds: {cartpole_env.state_bounds}")

# Test the environment
state = cartpole_env.reset()
print(f"\nSample state: {state}")
print(f"Normalized: {cartpole_env.normalize_state(state)}")

# Run a short episode to see the dynamics
print("\nSample episode (first 10 steps):")
state = cartpole_env.reset()
for step in range(10):
    action = cartpole_env.env.action_space.sample()  # Random action
    next_state, reward, done, info = cartpole_env.step(action)
    print(f"Step {step}: Action={action}, Reward={reward:.1f}, Done={done}")
    print(f"  State: [{next_state[0]:.3f}, {next_state[1]:.3f}, {next_state[2]:.3f}, {next_state[3]:.3f}]")
    
    if done:
        print(f"  Episode ended at step {step + 1}")
        break
    state = next_state

## 3. Linear Function Approximation

Linear function approximation is the simplest form of function approximation and provides theoretical guarantees under certain conditions.

### Mathematical Foundation

For linear approximation:
$$\hat{V}(s, \mathbf{w}) = \sum_{i=1}^d w_i \phi_i(s) = \mathbf{w}^T \phi(s)$$

The **gradient** is simply:
$$\nabla_\mathbf{w} \hat{V}(s, \mathbf{w}) = \phi(s)$$

### Semi-Gradient TD Update

Since we don't have access to the true value $V^\pi(S_t)$, we use the **TD target** as an approximation:
$$\mathbf{w}_{t+1} = \mathbf{w}_t + \alpha [R_{t+1} + \gamma \hat{V}(S_{t+1}, \mathbf{w}_t) - \hat{V}(S_t, \mathbf{w}_t)] \phi(S_t)$$

This is called **semi-gradient** because we don't differentiate through the target $\hat{V}(S_{t+1}, \mathbf{w}_t)$.

### Feature Engineering

Good features $\phi(s)$ are crucial for linear approximation:
- **Polynomial features**: $[1, s_1, s_2, s_1^2, s_1 s_2, s_2^2, \ldots]$
- **RBF features**: Radial basis functions centered at different points
- **Fourier features**: Sine and cosine basis functions

In [None]:
class LinearFunctionApproximator:
    """
    Linear function approximator for value functions.
    
    Implements various feature extraction methods and semi-gradient TD learning.
    """
    
    def __init__(self, state_dim: int, feature_type: str = 'polynomial', 
                 degree: int = 2, n_rbf: int = 10):
        self.state_dim = state_dim
        self.feature_type = feature_type
        self.degree = degree
        self.n_rbf = n_rbf
        
        # Determine feature dimension
        if feature_type == 'polynomial':
            # Polynomial features up to given degree
            from itertools import combinations_with_replacement
            self.feature_dim = sum(len(list(combinations_with_replacement(range(state_dim), d))) 
                                 for d in range(degree + 1))
        elif feature_type == 'rbf':
            # RBF features
            self.feature_dim = n_rbf
            # Create random RBF centers
            self.rbf_centers = np.random.randn(n_rbf, state_dim)
            self.rbf_width = 1.0
        else:
            # Simple linear features (identity)
            self.feature_dim = state_dim
        
        # Initialize weights
        self.weights = np.zeros(self.feature_dim)
        
        print(f"Linear FA: {feature_type} features, dim: {state_dim} -> {self.feature_dim}")
    
    def extract_features(self, state: np.ndarray) -> np.ndarray:
        """Extract features from state."""
        if self.feature_type == 'polynomial':
            return self._polynomial_features(state)
        elif self.feature_type == 'rbf':
            return self._rbf_features(state)
        else:
            return state  # Linear features
    
    def _polynomial_features(self, state: np.ndarray) -> np.ndarray:
        """Generate polynomial features."""
        from itertools import combinations_with_replacement
        
        features = []
        
        for d in range(self.degree + 1):
            for indices in combinations_with_replacement(range(self.state_dim), d):
                if d == 0:
                    features.append(1.0)  # Bias term
                else:
                    feature_val = 1.0
                    for idx in indices:
                        feature_val *= state[idx]
                    features.append(feature_val)
        
        return np.array(features)
    
    def _rbf_features(self, state: np.ndarray) -> np.ndarray:
        """Generate RBF (Radial Basis Function) features."""
        features = []
        
        for center in self.rbf_centers:
            distance = np.linalg.norm(state - center)
            feature_val = np.exp(-distance**2 / (2 * self.rbf_width**2))
            features.append(feature_val)
        
        return np.array(features)
    
    def predict(self, state: np.ndarray) -> float:
        """Predict value for given state."""
        features = self.extract_features(state)
        return np.dot(self.weights, features)
    
    def update(self, state: np.ndarray, target: float, alpha: float = 0.01) -> float:
        """Update weights using semi-gradient descent."""
        features = self.extract_features(state)
        prediction = np.dot(self.weights, features)
        error = target - prediction
        
        # Semi-gradient update
        self.weights += alpha * error * features
        
        return abs(error)

# Test different feature types
test_state = np.array([0.1, -0.2, 0.05, 0.3])

print("\nTesting different feature extractors:")
print(f"Test state: {test_state}")

# Linear features
linear_fa = LinearFunctionApproximator(4, 'linear')
linear_features = linear_fa.extract_features(test_state)
print(f"\nLinear features ({len(linear_features)}): {linear_features}")

# Polynomial features
poly_fa = LinearFunctionApproximator(4, 'polynomial', degree=2)
poly_features = poly_fa.extract_features(test_state)
print(f"\nPolynomial features (degree 2, {len(poly_features)}): {poly_features[:10]}...")  # Show first 10

# RBF features
rbf_fa = LinearFunctionApproximator(4, 'rbf', n_rbf=8)
rbf_features = rbf_fa.extract_features(test_state)
print(f"\nRBF features ({len(rbf_features)}): {rbf_features}")

## 4. Semi-Gradient TD with Linear Function Approximation

Let's implement a complete semi-gradient TD agent for CartPole using linear function approximation.

In [None]:
class LinearTDAgent:
    """
    Semi-gradient TD agent with linear function approximation.
    
    Implements both state-value and action-value function approximation
    with various feature extraction methods.
    """
    
    def __init__(self, env, feature_type: str = 'polynomial', degree: int = 2, 
                 alpha: float = 0.01, gamma: float = 0.99, epsilon: float = 0.1):
        self.env = env
        self.alpha = alpha
        self.gamma = gamma
        self.epsilon = epsilon
        
        # Create separate function approximators for each action
        self.q_functions = {
            action: LinearFunctionApproximator(
                env.state_dim, feature_type, degree
            ) for action in range(env.action_dim)
        }
        
        # Learning statistics
        self.episode_rewards = []
        self.episode_lengths = []
        self.td_errors = []
        self.weights_history = []
    
    def choose_action(self, state: np.ndarray, training: bool = True) -> int:
        """Choose action using ε-greedy policy."""
        if training and np.random.random() < self.epsilon:
            return np.random.randint(self.env.action_dim)
        else:
            # Greedy action selection
            q_values = [self.q_functions[a].predict(state) for a in range(self.env.action_dim)]
            return np.argmax(q_values)
    
    def get_q_values(self, state: np.ndarray) -> np.ndarray:
        """Get Q-values for all actions in given state."""
        return np.array([self.q_functions[a].predict(state) for a in range(self.env.action_dim)])
    
    def train_episode(self) -> Tuple[float, int]:
        """Train for one episode using semi-gradient TD."""
        state = self.env.reset()
        state = self.env.normalize_state(state)
        
        episode_reward = 0
        episode_length = 0
        episode_td_errors = []
        
        while True:
            # Choose action
            action = self.choose_action(state, training=True)
            
            # Take action
            next_state, reward, done, _ = self.env.step(action)
            next_state = self.env.normalize_state(next_state)
            
            # Calculate TD target
            current_q = self.q_functions[action].predict(state)
            
            if done:
                target = reward
            else:
                next_q_values = self.get_q_values(next_state)
                target = reward + self.gamma * np.max(next_q_values)
            
            # Update Q-function
            td_error = self.q_functions[action].update(state, target, self.alpha)
            episode_td_errors.append(td_error)
            
            episode_reward += reward
            episode_length += 1
            
            if done:
                break
            
            state = next_state
        
        # Store statistics
        self.episode_rewards.append(episode_reward)
        self.episode_lengths.append(episode_length)
        self.td_errors.extend(episode_td_errors)
        
        return episode_reward, episode_length
    
    def train(self, num_episodes: int = 1000, verbose: bool = True) -> None:
        """Train the agent for multiple episodes."""
        
        for episode in range(num_episodes):
            reward, length = self.train_episode()
            
            # Track weight evolution
            if episode % 100 == 0:
                weights_snapshot = {
                    action: fa.weights.copy() 
                    for action, fa in self.q_functions.items()
                }
                self.weights_history.append(weights_snapshot)
            
            # Decay epsilon
            if self.epsilon > 0.01:
                self.epsilon *= 0.995
            
            if episode % 100 == 0 and verbose:
                avg_reward = np.mean(self.episode_rewards[-100:]) if len(self.episode_rewards) >= 100 else np.mean(self.episode_rewards)
                avg_length = np.mean(self.episode_lengths[-100:]) if len(self.episode_lengths) >= 100 else np.mean(self.episode_lengths)
                avg_td_error = np.mean(self.td_errors[-1000:]) if len(self.td_errors) >= 1000 else np.mean(self.td_errors)
                print(f"Episode {episode}: Avg Reward = {avg_reward:.2f}, Avg Length = {avg_length:.1f}, "
                      f"TD Error = {avg_td_error:.4f}, ε = {self.epsilon:.4f}")
    
    def evaluate(self, num_episodes: int = 10) -> Tuple[float, float]:
        """Evaluate the learned policy."""
        rewards = []
        lengths = []
        
        for _ in range(num_episodes):
            state = self.env.reset()
            state = self.env.normalize_state(state)
            
            episode_reward = 0
            episode_length = 0
            
            while True:
                action = self.choose_action(state, training=False)
                next_state, reward, done, _ = self.env.step(action)
                next_state = self.env.normalize_state(next_state)
                
                episode_reward += reward
                episode_length += 1
                
                if done:
                    break
                
                state = next_state
            
            rewards.append(episode_reward)
            lengths.append(episode_length)
        
        return np.mean(rewards), np.mean(lengths)
    
    def plot_learning_curves(self) -> None:
        """Plot learning progress."""
        fig, axes = plt.subplots(2, 2, figsize=(15, 10))
        
        # Episode rewards
        window = min(50, len(self.episode_rewards) // 10)
        if window > 1:
            smooth_rewards = pd.Series(self.episode_rewards).rolling(window=window).mean()
            axes[0, 0].plot(smooth_rewards, 'r-', linewidth=2, label=f'{window}-episode average')
        
        axes[0, 0].plot(self.episode_rewards, alpha=0.3, color='blue')
        axes[0, 0].set_xlabel('Episode')
        axes[0, 0].set_ylabel('Episode Reward')
        axes[0, 0].set_title('Learning Progress: Episode Rewards')
        if window > 1:
            axes[0, 0].legend()
        axes[0, 0].grid(True, alpha=0.3)
        
        # Episode lengths
        if window > 1:
            smooth_lengths = pd.Series(self.episode_lengths).rolling(window=window).mean()
            axes[0, 1].plot(smooth_lengths, 'r-', linewidth=2, label=f'{window}-episode average')
        
        axes[0, 1].plot(self.episode_lengths, alpha=0.3, color='green')
        axes[0, 1].set_xlabel('Episode')
        axes[0, 1].set_ylabel('Episode Length')
        axes[0, 1].set_title('Learning Progress: Episode Lengths')
        if window > 1:
            axes[0, 1].legend()
        axes[0, 1].grid(True, alpha=0.3)
        
        # TD Errors
        if len(self.td_errors) > 100:
            td_window = min(500, len(self.td_errors) // 20)
            smooth_td = pd.Series(self.td_errors).rolling(window=td_window).mean()
            axes[1, 0].plot(smooth_td, 'orange', linewidth=2)
        else:
            axes[1, 0].plot(self.td_errors, alpha=0.7, color='orange')
        
        axes[1, 0].set_xlabel('Update Step')
        axes[1, 0].set_ylabel('TD Error')
        axes[1, 0].set_title('TD Error Evolution')
        axes[1, 0].grid(True, alpha=0.3)
        
        # Weight evolution (magnitude of weights for action 0)
        if self.weights_history:
            weight_magnitudes = [np.linalg.norm(w[0]) for w in self.weights_history]
            checkpoints = np.arange(0, len(self.episode_rewards), 100)[:len(weight_magnitudes)]
            axes[1, 1].plot(checkpoints, weight_magnitudes, 'purple', marker='o', linewidth=2)
            axes[1, 1].set_xlabel('Episode')
            axes[1, 1].set_ylabel('Weight Magnitude (Action 0)')
            axes[1, 1].set_title('Weight Evolution')
            axes[1, 1].grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()

print("Linear TD agent implementation complete!")

In [None]:
# Train Linear TD agent with different feature types
print("Training Linear TD agents with different feature types...")

feature_types = ['polynomial', 'rbf']
linear_results = {}

for feature_type in feature_types:
    print(f"\n=== Training with {feature_type} features ===")
    
    agent = LinearTDAgent(
        cartpole_env, 
        feature_type=feature_type, 
        degree=2 if feature_type == 'polynomial' else None,
        alpha=0.01,
        gamma=0.99,
        epsilon=0.1
    )
    
    # Train
    agent.train(num_episodes=1000, verbose=True)
    
    # Evaluate
    eval_reward, eval_length = agent.evaluate(num_episodes=20)
    
    linear_results[feature_type] = {
        'agent': agent,
        'eval_reward': eval_reward,
        'eval_length': eval_length,
        'final_training_reward': np.mean(agent.episode_rewards[-100:])
    }
    
    print(f"\n{feature_type.capitalize()} Features Results:")
    print(f"  Final training reward: {linear_results[feature_type]['final_training_reward']:.2f}")
    print(f"  Evaluation reward: {eval_reward:.2f}")
    print(f"  Evaluation length: {eval_length:.2f}")

print("\n=== Linear Function Approximation Comparison ===")
print(f"{'Feature Type':<15} {'Training Reward':<18} {'Eval Reward':<15} {'Eval Length':<15}")
print("-" * 65)
for feature_type, results in linear_results.items():
    print(f"{feature_type:<15} {results['final_training_reward']:<18.2f} "
          f"{results['eval_reward']:<15.2f} {results['eval_length']:<15.2f}")

In [None]:
# Plot comparison of linear methods
best_linear_agent = linear_results['polynomial']['agent']  # Choose best performing
best_linear_agent.plot_learning_curves()

## 5. Neural Network Function Approximation

Neural networks provide more expressive function approximation but introduce additional challenges.

### Mathematical Framework

For a neural network with parameters $\mathbf{w} = \{W^{(1)}, b^{(1)}, W^{(2)}, b^{(2)}, \ldots\}$:

$$\hat{Q}(s, a; \mathbf{w}) = f_{\text{neural}}(s, a; \mathbf{w})$$

### Gradient Computation

The gradient is computed via **backpropagation**:
$$\nabla_\mathbf{w} \hat{Q}(s, a; \mathbf{w}) = \frac{\partial f_{\text{neural}}(s, a; \mathbf{w})}{\partial \mathbf{w}}$$

### Deep Q-Learning Update

The neural network update becomes:
$$\mathbf{w}_{t+1} = \mathbf{w}_t + \alpha [R_{t+1} + \gamma \max_{a'} \hat{Q}(S_{t+1}, a'; \mathbf{w}_t) - \hat{Q}(S_t, A_t; \mathbf{w}_t)] \nabla_\mathbf{w}} \hat{Q}(S_t, A_t; \mathbf{w}_t)$$

### Challenges with Neural Networks

1. **Instability**: Non-linear approximation can lead to oscillations
2. **Correlation**: Sequential data violates i.i.d. assumption
3. **Moving targets**: The target values change as we update the network
4. **Catastrophic forgetting**: New experiences can overwrite old knowledge

In [None]:
class DQNNetwork(nn.Module):
    """
    Deep Q-Network for value function approximation.
    
    Simple fully-connected network optimized for CPU training.
    """
    
    def __init__(self, state_dim: int, action_dim: int, hidden_dims: List[int] = [64, 64]):
        super(DQNNetwork, self).__init__()
        
        self.state_dim = state_dim
        self.action_dim = action_dim
        
        # Build network layers
        layers = []
        input_dim = state_dim
        
        for hidden_dim in hidden_dims:
            layers.extend([
                nn.Linear(input_dim, hidden_dim),
                nn.ReLU(),
                nn.Dropout(0.1)  # Light dropout for regularization
            ])
            input_dim = hidden_dim
        
        # Output layer
        layers.append(nn.Linear(input_dim, action_dim))
        
        self.network = nn.Sequential(*layers)
        
        # Initialize weights
        self.apply(self._init_weights)
    
    def _init_weights(self, module):
        """Initialize network weights."""
        if isinstance(module, nn.Linear):
            torch.nn.init.xavier_uniform_(module.weight)
            torch.nn.init.zeros_(module.bias)
    
    def forward(self, state: torch.Tensor) -> torch.Tensor:
        """Forward pass through the network."""
        return self.network(state)
    
    def get_q_values(self, state: np.ndarray) -> np.ndarray:
        """Get Q-values for a single state."""
        self.eval()
        with torch.no_grad():
            state_tensor = torch.FloatTensor(state).unsqueeze(0)
            q_values = self.forward(state_tensor)
            return q_values.squeeze(0).numpy()
    
    def get_action(self, state: np.ndarray) -> int:
        """Get greedy action for a single state."""
        q_values = self.get_q_values(state)
        return np.argmax(q_values)

# Test the network
test_network = DQNNetwork(state_dim=4, action_dim=2, hidden_dims=[32, 32])
print(f"DQN Network:")
print(f"  Parameters: {sum(p.numel() for p in test_network.parameters())}")
print(f"  Architecture: {test_network}")

# Test forward pass
test_state = np.array([0.1, -0.2, 0.05, 0.3])
test_q_values = test_network.get_q_values(test_state)
print(f"\nTest state: {test_state}")
print(f"Q-values: {test_q_values}")
print(f"Greedy action: {test_network.get_action(test_state)}")

## 6. The Deadly Triad and Instability

The **Deadly Triad** consists of three components that can cause instability when combined:

1. **Function Approximation**: Using parameterized functions instead of tables
2. **Bootstrapping**: Using estimates to update estimates (TD learning)
3. **Off-policy learning**: Learning about a different policy than the one being followed

### Why Instability Occurs

- **Moving targets**: The target $R + \gamma \max_{a'} Q(s', a'; \mathbf{w})$ changes as $\mathbf{w}$ updates
- **Correlation**: Consecutive experiences are highly correlated
- **Distribution shift**: The data distribution changes as the policy evolves
- **Overestimation**: Function approximation errors can compound

### Solutions

1. **Experience Replay**: Store and sample experiences randomly
2. **Target Networks**: Use separate network for computing targets
3. **Gradient Clipping**: Limit the magnitude of gradients
4. **Regularization**: Add constraints to prevent overfitting

In [None]:
class ExperienceReplayBuffer:
    """
    Experience replay buffer for storing and sampling transitions.
    
    Breaks the correlation between consecutive experiences by
    storing experiences and sampling random mini-batches.
    """
    
    def __init__(self, capacity: int = 10000):
        self.capacity = capacity
        self.buffer = deque(maxlen=capacity)
        self.position = 0
    
    def push(self, state: np.ndarray, action: int, reward: float, 
             next_state: np.ndarray, done: bool) -> None:
        """Store a transition in the buffer."""
        self.buffer.append((state, action, reward, next_state, done))
    
    def sample(self, batch_size: int) -> Tuple[torch.Tensor, ...]:
        """Sample a random batch of transitions."""
        if len(self.buffer) < batch_size:
            batch_size = len(self.buffer)
        
        batch = random.sample(self.buffer, batch_size)
        
        # Unpack batch
        states, actions, rewards, next_states, dones = zip(*batch)
        
        # Convert to tensors
        states = torch.FloatTensor(np.array(states))
        actions = torch.LongTensor(actions)
        rewards = torch.FloatTensor(rewards)
        next_states = torch.FloatTensor(np.array(next_states))
        dones = torch.BoolTensor(dones)
        
        return states, actions, rewards, next_states, dones
    
    def __len__(self) -> int:
        return len(self.buffer)

class SimpleDQNAgent:
    """
    Simple DQN agent demonstrating the deadly triad and basic solutions.
    
    Includes experience replay and optional target networks.
    """
    
    def __init__(self, env, hidden_dims: List[int] = [32, 32], 
                 lr: float = 0.001, gamma: float = 0.99, epsilon: float = 0.1,
                 buffer_size: int = 10000, batch_size: int = 32,
                 use_target_network: bool = True, target_update_freq: int = 100):
        
        self.env = env
        self.gamma = gamma
        self.epsilon = epsilon
        self.batch_size = batch_size
        self.use_target_network = use_target_network
        self.target_update_freq = target_update_freq
        
        # Networks
        self.q_network = DQNNetwork(env.state_dim, env.action_dim, hidden_dims)
        if use_target_network:
            self.target_network = DQNNetwork(env.state_dim, env.action_dim, hidden_dims)
            self.target_network.load_state_dict(self.q_network.state_dict())
            self.target_network.eval()
        
        # Optimizer
        self.optimizer = optim.Adam(self.q_network.parameters(), lr=lr)
        
        # Experience replay
        self.replay_buffer = ExperienceReplayBuffer(buffer_size)
        
        # Learning statistics
        self.episode_rewards = []
        self.episode_lengths = []
        self.losses = []
        self.update_count = 0
        
        print(f"DQN Agent: {sum(p.numel() for p in self.q_network.parameters())} parameters")
        print(f"Target network: {use_target_network}")
        print(f"Buffer size: {buffer_size}, Batch size: {batch_size}")
    
    def choose_action(self, state: np.ndarray, training: bool = True) -> int:
        """Choose action using ε-greedy policy."""
        if training and np.random.random() < self.epsilon:
            return np.random.randint(self.env.action_dim)
        else:
            return self.q_network.get_action(state)
    
    def update_target_network(self) -> None:
        """Update target network by copying weights from main network."""
        if self.use_target_network:
            self.target_network.load_state_dict(self.q_network.state_dict())
    
    def train_step(self) -> float:
        """Perform one training step using experience replay."""
        if len(self.replay_buffer) < self.batch_size:
            return 0.0
        
        # Sample batch from replay buffer
        states, actions, rewards, next_states, dones = self.replay_buffer.sample(self.batch_size)
        
        # Current Q values
        current_q_values = self.q_network(states).gather(1, actions.unsqueeze(1)).squeeze(1)
        
        # Next Q values
        with torch.no_grad():
            if self.use_target_network:
                next_q_values = self.target_network(next_states).max(1)[0]
            else:
                next_q_values = self.q_network(next_states).max(1)[0]
            
            target_q_values = rewards + (self.gamma * next_q_values * ~dones)
        
        # Compute loss
        loss = F.mse_loss(current_q_values, target_q_values)
        
        # Optimize
        self.optimizer.zero_grad()
        loss.backward()
        
        # Gradient clipping for stability
        torch.nn.utils.clip_grad_norm_(self.q_network.parameters(), max_norm=1.0)
        
        self.optimizer.step()
        
        # Update target network
        self.update_count += 1
        if self.use_target_network and self.update_count % self.target_update_freq == 0:
            self.update_target_network()
        
        return loss.item()
    
    def train_episode(self) -> Tuple[float, int]:
        """Train for one episode."""
        state = self.env.reset()
        state = self.env.normalize_state(state)
        
        episode_reward = 0
        episode_length = 0
        episode_losses = []
        
        while True:
            # Choose action
            action = self.choose_action(state, training=True)
            
            # Take action
            next_state, reward, done, _ = self.env.step(action)
            next_state = self.env.normalize_state(next_state)
            
            # Store transition
            self.replay_buffer.push(state, action, reward, next_state, done)
            
            # Train if enough experiences
            if len(self.replay_buffer) >= self.batch_size:
                loss = self.train_step()
                episode_losses.append(loss)
            
            episode_reward += reward
            episode_length += 1
            
            if done:
                break
            
            state = next_state
        
        # Store statistics
        self.episode_rewards.append(episode_reward)
        self.episode_lengths.append(episode_length)
        if episode_losses:
            self.losses.extend(episode_losses)
        
        return episode_reward, episode_length
    
    def train(self, num_episodes: int = 1000, verbose: bool = True) -> None:
        """Train the agent for multiple episodes."""
        
        for episode in range(num_episodes):
            reward, length = self.train_episode()
            
            # Decay epsilon
            if self.epsilon > 0.01:
                self.epsilon *= 0.995
            
            if episode % 100 == 0 and verbose:
                avg_reward = np.mean(self.episode_rewards[-100:]) if len(self.episode_rewards) >= 100 else np.mean(self.episode_rewards)
                avg_length = np.mean(self.episode_lengths[-100:]) if len(self.episode_lengths) >= 100 else np.mean(self.episode_lengths)
                avg_loss = np.mean(self.losses[-1000:]) if len(self.losses) >= 1000 else (np.mean(self.losses) if self.losses else 0)
                buffer_size = len(self.replay_buffer)
                print(f"Episode {episode}: Avg Reward = {avg_reward:.2f}, Avg Length = {avg_length:.1f}, "
                      f"Loss = {avg_loss:.4f}, Buffer = {buffer_size}, ε = {self.epsilon:.4f}")
    
    def evaluate(self, num_episodes: int = 10) -> Tuple[float, float]:
        """Evaluate the learned policy."""
        rewards = []
        lengths = []
        
        for _ in range(num_episodes):
            state = self.env.reset()
            state = self.env.normalize_state(state)
            
            episode_reward = 0
            episode_length = 0
            
            while True:
                action = self.choose_action(state, training=False)
                next_state, reward, done, _ = self.env.step(action)
                next_state = self.env.normalize_state(next_state)
                
                episode_reward += reward
                episode_length += 1
                
                if done:
                    break
                
                state = next_state
            
            rewards.append(episode_reward)
            lengths.append(episode_length)
        
        return np.mean(rewards), np.mean(lengths)

print("DQN agent implementation complete!")

In [None]:
# Compare DQN with and without target networks
print("Training DQN agents with different configurations...")

dqn_configs = {
    'DQN_basic': {'use_target_network': False},
    'DQN_target': {'use_target_network': True, 'target_update_freq': 100}
}

dqn_results = {}

for name, config in dqn_configs.items():
    print(f"\n=== Training {name} ===")
    
    agent = SimpleDQNAgent(
        cartpole_env,
        hidden_dims=[32, 32],
        lr=0.001,
        gamma=0.99,
        epsilon=0.1,
        buffer_size=5000,
        batch_size=32,
        **config
    )
    
    # Train
    agent.train(num_episodes=800, verbose=True)
    
    # Evaluate
    eval_reward, eval_length = agent.evaluate(num_episodes=20)
    
    dqn_results[name] = {
        'agent': agent,
        'eval_reward': eval_reward,
        'eval_length': eval_length,
        'final_training_reward': np.mean(agent.episode_rewards[-100:])
    }
    
    print(f"\n{name} Results:")
    print(f"  Final training reward: {dqn_results[name]['final_training_reward']:.2f}")
    print(f"  Evaluation reward: {eval_reward:.2f}")
    print(f"  Evaluation length: {eval_length:.2f}")

print("\n=== DQN Configuration Comparison ===")
print(f"{'Configuration':<15} {'Training Reward':<18} {'Eval Reward':<15} {'Eval Length':<15}")
print("-" * 65)
for name, results in dqn_results.items():
    print(f"{name:<15} {results['final_training_reward']:<18.2f} "
          f"{results['eval_reward']:<15.2f} {results['eval_length']:<15.2f}")

In [None]:
# Plot DQN learning curves and compare with linear methods
def plot_comprehensive_comparison():
    """Compare all implemented methods."""
    
    fig, axes = plt.subplots(2, 2, figsize=(16, 10))
    
    # Episode rewards comparison
    window = 50
    
    # Linear methods
    for feature_type, results in linear_results.items():
        agent = results['agent']
        if len(agent.episode_rewards) >= window:
            smooth_rewards = pd.Series(agent.episode_rewards).rolling(window=window).mean()
            axes[0, 0].plot(smooth_rewards, label=f'Linear ({feature_type})', alpha=0.8)
    
    # DQN methods
    for name, results in dqn_results.items():
        agent = results['agent']
        if len(agent.episode_rewards) >= window:
            smooth_rewards = pd.Series(agent.episode_rewards).rolling(window=window).mean()
            axes[0, 0].plot(smooth_rewards, label=name, alpha=0.8, linewidth=2)
    
    axes[0, 0].set_xlabel('Episode')
    axes[0, 0].set_ylabel('Average Episode Reward')
    axes[0, 0].set_title('Learning Progress Comparison')
    axes[0, 0].legend()
    axes[0, 0].grid(True, alpha=0.3)
    
    # Final performance comparison
    all_methods = list(linear_results.keys()) + list(dqn_results.keys())
    all_rewards = ([linear_results[k]['eval_reward'] for k in linear_results.keys()] + 
                  [dqn_results[k]['eval_reward'] for k in dqn_results.keys()])
    
    colors = ['skyblue', 'lightgreen', 'orange', 'lightcoral']
    bars = axes[0, 1].bar(range(len(all_methods)), all_rewards, 
                         color=colors[:len(all_methods)], alpha=0.8)
    axes[0, 1].set_xlabel('Method')
    axes[0, 1].set_ylabel('Evaluation Reward')
    axes[0, 1].set_title('Final Performance Comparison')
    axes[0, 1].set_xticks(range(len(all_methods)))
    axes[0, 1].set_xticklabels(all_methods, rotation=45, ha='right')
    
    # Add value labels
    for bar, reward in zip(bars, all_rewards):
        axes[0, 1].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 5, 
                       f'{reward:.1f}', ha='center', va='bottom')
    
    # Loss evolution for DQN
    for name, results in dqn_results.items():
        agent = results['agent']
        if agent.losses:
            loss_window = min(100, len(agent.losses) // 10)
            if loss_window > 1:
                smooth_loss = pd.Series(agent.losses).rolling(window=loss_window).mean()
                axes[1, 0].plot(smooth_loss, label=name, alpha=0.8)
    
    axes[1, 0].set_xlabel('Update Step')
    axes[1, 0].set_ylabel('Loss')
    axes[1, 0].set_title('DQN Loss Evolution')
    axes[1, 0].legend()
    axes[1, 0].grid(True, alpha=0.3)
    
    # Sample complexity comparison
    method_episodes = []
    for method in all_methods:
        if method in linear_results:
            episodes = len(linear_results[method]['agent'].episode_rewards)
        else:
            episodes = len(dqn_results[method]['agent'].episode_rewards)
        method_episodes.append(episodes)
    
    bars2 = axes[1, 1].bar(range(len(all_methods)), method_episodes,
                          color=colors[:len(all_methods)], alpha=0.8)
    axes[1, 1].set_xlabel('Method')
    axes[1, 1].set_ylabel('Training Episodes')
    axes[1, 1].set_title('Sample Complexity')
    axes[1, 1].set_xticks(range(len(all_methods)))
    axes[1, 1].set_xticklabels(all_methods, rotation=45, ha='right')
    
    plt.tight_layout()
    plt.show()

plot_comprehensive_comparison()

## 7. Key Insights and Theoretical Analysis

Let's analyze the results and extract key insights about function approximation in RL.

In [None]:
def analyze_function_approximation_results():
    """Analyze and summarize insights from function approximation experiments."""
    
    print("=== Function Approximation Analysis ===")
    print()
    
    # Performance comparison
    print("1. PERFORMANCE COMPARISON:")
    all_results = {**linear_results, **dqn_results}
    
    best_method = max(all_results.keys(), key=lambda x: all_results[x]['eval_reward'])
    worst_method = min(all_results.keys(), key=lambda x: all_results[x]['eval_reward'])
    
    print(f"   • Best performing method: {best_method} ({all_results[best_method]['eval_reward']:.2f} reward)")
    print(f"   • Worst performing method: {worst_method} ({all_results[worst_method]['eval_reward']:.2f} reward)")
    print(f"   • Performance gap: {all_results[best_method]['eval_reward'] - all_results[worst_method]['eval_reward']:.2f}")
    print()
    
    print("2. LINEAR VS NEURAL NETWORK APPROXIMATION:")
    linear_avg = np.mean([linear_results[k]['eval_reward'] for k in linear_results.keys()])
    dqn_avg = np.mean([dqn_results[k]['eval_reward'] for k in dqn_results.keys()])
    
    print(f"   • Average linear performance: {linear_avg:.2f}")
    print(f"   • Average DQN performance: {dqn_avg:.2f}")
    
    if dqn_avg > linear_avg:
        print(f"   • Neural networks outperform linear methods by {dqn_avg - linear_avg:.2f}")
    else:
        print(f"   • Linear methods outperform neural networks by {linear_avg - dqn_avg:.2f}")
    print()
    
    print("3. FEATURE ENGINEERING IMPACT:")
    if 'polynomial' in linear_results and 'rbf' in linear_results:
        poly_perf = linear_results['polynomial']['eval_reward']
        rbf_perf = linear_results['rbf']['eval_reward']
        print(f"   • Polynomial features: {poly_perf:.2f}")
        print(f"   • RBF features: {rbf_perf:.2f}")
        print(f"   • Feature choice impact: {abs(poly_perf - rbf_perf):.2f}")
    print()
    
    print("4. TARGET NETWORK IMPACT:")
    if 'DQN_basic' in dqn_results and 'DQN_target' in dqn_results:
        basic_perf = dqn_results['DQN_basic']['eval_reward']
        target_perf = dqn_results['DQN_target']['eval_reward']
        print(f"   • DQN without target network: {basic_perf:.2f}")
        print(f"   • DQN with target network: {target_perf:.2f}")
        print(f"   • Target network improvement: {target_perf - basic_perf:.2f}")
        
        if target_perf > basic_perf:
            print("   • Target networks provide stability and improved performance")
        else:
            print("   • Target networks show minimal or negative impact (possibly due to environment simplicity)")
    print()
    
    print("5. SAMPLE COMPLEXITY:")
    for method, results in all_results.items():
        if method in linear_results:
            episodes = len(linear_results[method]['agent'].episode_rewards)
        else:
            episodes = len(dqn_results[method]['agent'].episode_rewards)
        
        print(f"   • {method}: {episodes} episodes to convergence")
    print()
    
    print("6. THEORETICAL INSIGHTS:")
    print("   • Linear function approximation:")
    print("     - Provides convergence guarantees under certain conditions")
    print("     - Limited expressiveness requires good feature engineering")
    print("     - Stable learning with proper step sizes")
    print("   • Neural network approximation:")
    print("     - Higher expressiveness but no convergence guarantees")
    print("     - Requires careful hyperparameter tuning")
    print("     - Prone to instability (deadly triad)")
    print("   • Experience replay:")
    print("     - Breaks correlation in sequential data")
    print("     - Improves sample efficiency")
    print("     - Essential for stable neural network training")
    print("   • Target networks:")
    print("     - Reduce non-stationarity in target values")
    print("     - Provide more stable learning")
    print("     - May slow adaptation to environment changes")
    print()
    
    print("7. PRACTICAL RECOMMENDATIONS:")
    print("   • For simple environments: Linear methods may suffice")
    print("   • For complex environments: Neural networks are necessary")
    print("   • Always use experience replay with neural networks")
    print("   • Consider target networks for stability")
    print("   • Feature engineering is crucial for linear methods")
    print("   • Monitor for instability and adjust hyperparameters accordingly")

analyze_function_approximation_results()

## 8. Preparing for Deep Reinforcement Learning

Let's prepare the foundations for the next notebook by testing our implementations on a slightly more complex environment.

In [None]:
# Clean up environments
cartpole_env.close()

print("=== Transition to Deep RL - Key Takeaways ===")
print()
print("From this notebook, we've learned:")
print()
print("1. **Function Approximation Necessity**:")
print("   - Tabular methods don't scale to large state spaces")
print("   - Function approximation enables generalization")
print("   - Trade-off between expressiveness and stability")
print()
print("2. **Linear Function Approximation**:")
print("   - Provides theoretical guarantees")
print("   - Requires good feature engineering")
print("   - Limited expressiveness for complex environments")
print()
print("3. **Neural Network Approximation**:")
print("   - High expressiveness and automatic feature learning")
print("   - Introduces instability challenges (deadly triad)")
print("   - Requires sophisticated techniques for stable learning")
print()
print("4. **Key Solutions for Stability**:")
print("   - Experience replay breaks correlation")
print("   - Target networks reduce non-stationarity")
print("   - Gradient clipping prevents exploding gradients")
print("   - Regularization prevents overfitting")
print()
print("5. **Ready for Deep RL**:")
print("   - Understand function approximation theory")
print("   - Implemented basic DQN components")
print("   - Recognize instability issues and solutions")
print("   - Have foundation for advanced algorithms")
print()
print("Next: Deep Q-Learning with advanced techniques and complex environments!")

## Summary

In this comprehensive notebook, we've successfully bridged the gap between tabular RL and deep reinforcement learning:

### **Function Approximation Theory**
- **Curse of Dimensionality**: Understanding why tabular methods fail for large state spaces
- **Linear Approximation**: Mathematical foundation with $\hat{V}(s, \mathbf{w}) = \mathbf{w}^T \phi(s)$
- **Semi-Gradient Methods**: TD learning with function approximation
- **Feature Engineering**: Polynomial, RBF, and custom feature extraction

### **Neural Network Foundations**
- **Deep Q-Networks**: Neural network implementation for value function approximation
- **Backpropagation**: Gradient computation for complex function approximators
- **Network Architecture**: Design choices for RL applications

### **The Deadly Triad**
- **Three Components**: Function approximation + Bootstrapping + Off-policy learning
- **Instability Issues**: Moving targets, correlation, distribution shift
- **Practical Solutions**: Experience replay, target networks, gradient clipping

### **Experience Replay**
- **Breaking Correlation**: Random sampling from stored experiences
- **Implementation**: Efficient buffer management and batch sampling
- **Benefits**: Improved sample efficiency and stability

### **Target Networks**
- **Concept**: Separate network for computing target values
- **Stability**: Reducing non-stationarity in learning targets
- **Update Strategy**: Periodic copying of main network weights

### **Comparative Analysis**
- **Linear vs Neural**: Trade-offs between simplicity and expressiveness
- **Feature Impact**: Importance of representation in linear methods
- **Stability Techniques**: Empirical validation of theoretical solutions

### **Practical Insights**
- **Environment Complexity**: Matching method sophistication to problem difficulty
- **Hyperparameter Sensitivity**: Critical parameters for stable learning
- **Implementation Details**: CPU optimization for MacBook Air M2

This foundation prepares us for deep Q-learning with advanced techniques like Double DQN, Dueling DQN, and Rainbow improvements in the next notebook. We now understand both the theoretical necessity and practical implementation of function approximation in reinforcement learning.