# Reinforcement Learning Plan B - Part 2: Monte Carlo & Temporal Difference Learning

This notebook explores Monte Carlo methods and advanced temporal difference learning techniques. We'll dive deep into the mathematical foundations, implement various algorithms, and analyze their convergence properties and practical trade-offs.

**Learning Objectives:**
- Understand Monte Carlo prediction and control methods
- Master eligibility traces and n-step temporal difference learning
- Analyze on-policy vs off-policy learning paradigms
- Implement importance sampling for off-policy methods
- Compare convergence properties and sample efficiency
- Build intuition for function approximation preparation

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import defaultdict, deque
import pandas as pd
from typing import Tuple, Dict, List, Optional, Union
import time
import warnings
from scipy.stats import norm
import random
warnings.filterwarnings('ignore')

# Set style for better plots
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Set random seeds for reproducibility
np.random.seed(42)
random.seed(42)

print("Environment setup complete!")
print(f"NumPy version: {np.__version__}")
print(f"Matplotlib version: {plt.matplotlib.__version__}")

## 1. Monte Carlo Methods: Mathematical Foundation

Monte Carlo (MC) methods learn directly from episodes of experience without bootstrapping. They use the actual returns experienced to estimate value functions.

### Monte Carlo Prediction

The goal is to estimate $V^\pi(s)$ given a policy $\pi$. For each state $s$, we collect returns from episodes that visit $s$:

$$V^\pi(s) = \mathbb{E}_\pi[G_t | S_t = s]$$

Where $G_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}$ is the return from time $t$.

### First-Visit vs Every-Visit MC

- **First-visit MC**: Only consider the first visit to state $s$ in each episode
- **Every-visit MC**: Consider every visit to state $s$ in each episode

Both converge to $V^\pi(s)$ as the number of visits approaches infinity.

### Monte Carlo Backup

The MC update rule is:
$$V(s) \leftarrow V(s) + \alpha[G_t - V(s)]$$

Where $\alpha$ is the learning rate and $G_t$ is the actual return from the episode.

### Key Properties

1. **Unbiased**: MC estimates are unbiased since they use actual returns
2. **High Variance**: Returns can vary significantly between episodes
3. **Model-Free**: No need for environment dynamics
4. **Episodic**: Requires complete episodes to compute returns

## 2. Enhanced Environments for Advanced Methods

We'll create more complex environments to better demonstrate the differences between algorithms.

In [None]:
class CliffWalkingEnvironment:
    """
    Cliff Walking environment - a classic RL benchmark.
    
    The agent must navigate from start to goal while avoiding a cliff.
    This environment highlights the difference between on-policy and off-policy methods.
    """
    
    def __init__(self, height: int = 4, width: int = 12):
        self.height = height
        self.width = width
        
        # Define actions: 0=Up, 1=Down, 2=Left, 3=Right
        self.actions = [(-1, 0), (1, 0), (0, -1), (0, 1)]  # (dy, dx)
        self.action_names = ['Up', 'Down', 'Left', 'Right']
        self.num_actions = len(self.actions)
        
        # Set up environment
        self.start_state = (height-1, 0)  # Bottom-left
        self.goal_state = (height-1, width-1)  # Bottom-right
        
        # Define cliff (bottom row except start and goal)
        self.cliff_states = {(height-1, x) for x in range(1, width-1)}
        
        # Current state
        self.current_state = self.start_state
        self.done = False
    
    def reset(self) -> Tuple[int, int]:
        """Reset to starting state."""
        self.current_state = self.start_state
        self.done = False
        return self.current_state
    
    def step(self, action: int) -> Tuple[Tuple[int, int], float, bool, dict]:
        """Execute action and return (next_state, reward, done, info)."""
        if self.done:
            return self.current_state, 0, True, {}
        
        # Calculate next state
        dy, dx = self.actions[action]
        next_y = max(0, min(self.height - 1, self.current_state[0] + dy))
        next_x = max(0, min(self.width - 1, self.current_state[1] + dx))
        next_state = (next_y, next_x)
        
        # Calculate reward
        if next_state in self.cliff_states:
            # Fall off cliff - large negative reward and reset to start
            reward = -100
            next_state = self.start_state
            self.done = False  # Episode continues
        elif next_state == self.goal_state:
            # Reached goal
            reward = 0
            self.done = True
        else:
            # Normal move
            reward = -1
        
        self.current_state = next_state
        
        return next_state, reward, self.done, {}
    
    def get_all_states(self) -> List[Tuple[int, int]]:
        """Return all possible states."""
        return [(i, j) for i in range(self.height) for j in range(self.width)]
    
    def is_terminal(self, state: Tuple[int, int]) -> bool:
        """Check if state is terminal."""
        return state == self.goal_state
    
    def render(self, values: Optional[Dict] = None, policy: Optional[Dict] = None) -> None:
        """Visualize the cliff walking environment."""
        fig, ax = plt.subplots(figsize=(12, 6))
        
        # Create grid visualization
        grid = np.zeros((self.height, self.width))
        
        if values:
            for (y, x), value in values.items():
                grid[y, x] = value
        
        # Plot heatmap
        im = ax.imshow(grid, cmap='RdYlBu_r', alpha=0.7)
        
        # Add grid lines
        ax.set_xticks(np.arange(self.width + 1) - 0.5, minor=True)
        ax.set_yticks(np.arange(self.height + 1) - 0.5, minor=True)
        ax.grid(which='minor', color='black', linestyle='-', linewidth=1)
        
        # Mark special states
        start_y, start_x = self.start_state
        goal_y, goal_x = self.goal_state
        
        ax.text(start_x, start_y, 'S', ha='center', va='center', 
                fontsize=16, fontweight='bold', color='green')
        ax.text(goal_x, goal_y, 'G', ha='center', va='center', 
                fontsize=16, fontweight='bold', color='red')
        
        # Mark cliff states
        for (cliff_y, cliff_x) in self.cliff_states:
            ax.text(cliff_x, cliff_y, 'C', ha='center', va='center', 
                    fontsize=14, fontweight='bold', color='darkred')
            # Add red background for cliff
            ax.add_patch(plt.Rectangle((cliff_x-0.4, cliff_y-0.4), 0.8, 0.8, 
                                     facecolor='red', alpha=0.3))
        
        # Add policy arrows if provided
        if policy:
            arrow_props = dict(arrowstyle='->', lw=2, color='blue')
            for (y, x), action in policy.items():
                if not self.is_terminal((y, x)) and (y, x) not in self.cliff_states:
                    dy, dx = self.actions[action]
                    ax.annotate('', xy=(x + dx*0.3, y + dy*0.3), xytext=(x, y),
                              arrowprops=arrow_props)
        
        # Add value labels if provided
        if values:
            for (y, x), value in values.items():
                if not self.is_terminal((y, x)) and (y, x) not in self.cliff_states:
                    ax.text(x, y + 0.3, f'{value:.1f}', ha='center', va='center', 
                            fontsize=8, color='white', fontweight='bold')
        
        ax.set_title('Cliff Walking Environment')
        ax.set_xticks(range(self.width))
        ax.set_yticks(range(self.height))
        
        if values:
            plt.colorbar(im, ax=ax, label='State Value')
        
        plt.tight_layout()
        plt.show()

# Create cliff walking environment
cliff_env = CliffWalkingEnvironment()
print(f"Cliff Walking Environment: {cliff_env.height}x{cliff_env.width}")
print(f"Start: {cliff_env.start_state}, Goal: {cliff_env.goal_state}")
print(f"Cliff states: {len(cliff_env.cliff_states)} states")
print(f"Actions: {cliff_env.action_names}")

# Visualize the environment
cliff_env.render()

## 3. Monte Carlo Prediction Implementation

Let's implement both first-visit and every-visit Monte Carlo prediction methods.

### Algorithm: First-Visit MC Prediction

1. Initialize $V(s) = 0$ and $Returns(s) = \emptyset$ for all $s \in \mathcal{S}$
2. For each episode:
   - Generate episode following policy $\pi$: $S_0, A_0, R_1, S_1, A_1, \ldots, S_{T-1}, A_{T-1}, R_T$
   - Calculate returns: $G_t = \sum_{k=t+1}^{T} \gamma^{k-t-1} R_k$
   - For each state $s$ appearing in the episode:
     - If this is the first visit to $s$ in this episode:
       - Append $G_t$ to $Returns(s)$
       - $V(s) \leftarrow$ average of $Returns(s)$

In [None]:
class MonteCarloAgent:
    """
    Monte Carlo agent for value function estimation and control.
    
    Implements both first-visit and every-visit MC prediction,
    as well as MC control with exploring starts.
    """
    
    def __init__(self, env, gamma: float = 0.9, first_visit: bool = True,
                 epsilon: float = 0.1, epsilon_decay: float = 0.99):
        self.env = env
        self.gamma = gamma
        self.first_visit = first_visit
        self.epsilon = epsilon
        self.epsilon_decay = epsilon_decay
        
        self.states = env.get_all_states()
        self.num_actions = env.num_actions
        
        # Value function and returns
        self.V = defaultdict(float)
        self.returns = defaultdict(list)
        
        # Action-value function for control
        self.Q = defaultdict(lambda: np.zeros(self.num_actions))
        self.q_returns = defaultdict(lambda: defaultdict(list))
        
        # Policy (for control)
        self.policy = defaultdict(lambda: np.ones(self.num_actions) / self.num_actions)
        
        # Learning statistics
        self.episode_rewards = []
        self.episode_lengths = []
        self.value_history = []
    
    def generate_episode(self, policy: Optional[Dict] = None, max_steps: int = 1000) -> List:
        """Generate an episode following the given policy."""
        episode = []
        state = self.env.reset()
        
        for step in range(max_steps):
            # Choose action according to policy
            if policy is None:
                # Use current policy (for control)
                action = np.random.choice(self.num_actions, p=self.policy[state])
            else:
                # Use provided policy (for prediction)
                if isinstance(policy, dict):
                    action = policy[state]
                else:
                    action = policy(state)
            
            next_state, reward, done, _ = self.env.step(action)
            episode.append((state, action, reward))
            
            if done:
                break
                
            state = next_state
        
        return episode
    
    def calculate_returns(self, episode: List) -> List[float]:
        """Calculate returns for each time step in the episode."""
        returns = []
        G = 0
        
        # Work backwards through the episode
        for t in range(len(episode) - 1, -1, -1):
            _, _, reward = episode[t]
            G = reward + self.gamma * G
            returns.insert(0, G)
        
        return returns
    
    def mc_prediction(self, policy, num_episodes: int = 10000, verbose: bool = True) -> Dict:
        """Monte Carlo prediction to estimate V^π."""
        
        for episode_num in range(num_episodes):
            # Generate episode
            episode = self.generate_episode(policy)
            returns = self.calculate_returns(episode)
            
            # Track statistics
            episode_reward = sum(step[2] for step in episode)
            self.episode_rewards.append(episode_reward)
            self.episode_lengths.append(len(episode))
            
            # Update value function
            visited_states = set()
            
            for t, ((state, action, reward), G) in enumerate(zip(episode, returns)):
                if self.first_visit and state in visited_states:
                    continue
                
                visited_states.add(state)
                self.returns[state].append(G)
                self.V[state] = np.mean(self.returns[state])
            
            # Track value function evolution
            if episode_num % 1000 == 0:
                self.value_history.append(dict(self.V))
                if verbose:
                    avg_reward = np.mean(self.episode_rewards[-100:]) if len(self.episode_rewards) >= 100 else np.mean(self.episode_rewards)
                    print(f"Episode {episode_num}: Avg Reward = {avg_reward:.3f}, Avg Length = {np.mean(self.episode_lengths[-100:]):.1f}")
        
        return dict(self.V)
    
    def epsilon_greedy_policy_from_q(self, state: Tuple[int, int]) -> np.ndarray:
        """Generate ε-greedy policy from Q-values."""
        policy = np.ones(self.num_actions) * self.epsilon / self.num_actions
        best_action = np.argmax(self.Q[state])
        policy[best_action] += 1 - self.epsilon
        return policy
    
    def mc_control(self, num_episodes: int = 10000, verbose: bool = True) -> Tuple[Dict, Dict]:
        """Monte Carlo control to find optimal policy."""
        
        for episode_num in range(num_episodes):
            # Update epsilon
            self.epsilon = max(0.01, self.epsilon * self.epsilon_decay)
            
            # Update policy based on current Q-values
            for state in self.states:
                if not self.env.is_terminal(state):
                    self.policy[state] = self.epsilon_greedy_policy_from_q(state)
            
            # Generate episode
            episode = self.generate_episode(max_steps=500)
            returns = self.calculate_returns(episode)
            
            # Track statistics
            episode_reward = sum(step[2] for step in episode)
            self.episode_rewards.append(episode_reward)
            self.episode_lengths.append(len(episode))
            
            # Update Q-function
            visited_state_actions = set()
            
            for t, ((state, action, reward), G) in enumerate(zip(episode, returns)):
                state_action = (state, action)
                
                if self.first_visit and state_action in visited_state_actions:
                    continue
                
                visited_state_actions.add(state_action)
                self.q_returns[state][action].append(G)
                self.Q[state][action] = np.mean(self.q_returns[state][action])
            
            if episode_num % 1000 == 0 and verbose:
                avg_reward = np.mean(self.episode_rewards[-100:]) if len(self.episode_rewards) >= 100 else np.mean(self.episode_rewards)
                avg_length = np.mean(self.episode_lengths[-100:]) if len(self.episode_lengths) >= 100 else np.mean(self.episode_lengths)
                print(f"Episode {episode_num}: Avg Reward = {avg_reward:.3f}, Avg Length = {avg_length:.1f}, ε = {self.epsilon:.4f}")
        
        # Extract final greedy policy
        greedy_policy = {}
        for state in self.states:
            if not self.env.is_terminal(state):
                greedy_policy[state] = np.argmax(self.Q[state])
        
        return dict(self.Q), greedy_policy
    
    def plot_learning_curves(self) -> None:
        """Plot learning progress."""
        fig, axes = plt.subplots(1, 2, figsize=(15, 5))
        
        # Episode rewards
        window = min(100, len(self.episode_rewards) // 10)
        if window > 1:
            moving_avg = pd.Series(self.episode_rewards).rolling(window=window).mean()
            axes[0].plot(moving_avg, 'r-', linewidth=2, label=f'{window}-episode average')
        
        axes[0].plot(self.episode_rewards, alpha=0.3, color='blue')
        axes[0].set_xlabel('Episode')
        axes[0].set_ylabel('Episode Reward')
        axes[0].set_title('Monte Carlo Learning Progress')
        axes[0].legend()
        axes[0].grid(True, alpha=0.3)
        
        # Episode lengths
        if window > 1:
            length_avg = pd.Series(self.episode_lengths).rolling(window=window).mean()
            axes[1].plot(length_avg, 'r-', linewidth=2, label=f'{window}-episode average')
        
        axes[1].plot(self.episode_lengths, alpha=0.3, color='green')
        axes[1].set_xlabel('Episode')
        axes[1].set_ylabel('Episode Length')
        axes[1].set_title('Episode Length Progress')
        axes[1].legend()
        axes[1].grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()

print("Monte Carlo agent implementation complete!")

## 4. Monte Carlo Control with Exploring Starts

Monte Carlo control finds the optimal policy through **Generalized Policy Iteration** (GPI), which interleaves policy evaluation and policy improvement.

### Mathematical Framework

The goal is to find the optimal policy $\pi^*$ by alternating between:

1. **Policy Evaluation**: Estimate $Q^{\pi_k}$ using MC methods
2. **Policy Improvement**: Update policy greedily: $\pi_{k+1}(s) = \arg\max_a Q^{\pi_k}(s,a)$

### Exploring Starts Assumption

To ensure all state-action pairs are visited, we assume **exploring starts**: every state-action pair has nonzero probability of being selected as the starting pair.

### MC Control Algorithm

1. Initialize $Q(s,a)$ arbitrarily and $\pi(s)$ arbitrarily for all $s,a$
2. Repeat:
   - Generate episode with exploring starts
   - For each state-action pair $(s,a)$ in the episode:
     - $G \leftarrow$ return following first occurrence of $(s,a)$
     - Append $G$ to $Returns(s,a)$
     - $Q(s,a) \leftarrow$ average of $Returns(s,a)$
   - For each state $s$ in the episode:
     - $\pi(s) \leftarrow \arg\max_a Q(s,a)$

In [None]:
# Test Monte Carlo control on Cliff Walking
print("Training Monte Carlo Control on Cliff Walking...")

mc_agent = MonteCarloAgent(cliff_env, gamma=0.9, epsilon=0.1, epsilon_decay=0.9995)
Q_values, mc_policy = mc_agent.mc_control(num_episodes=20000, verbose=True)

print(f"\nMonte Carlo Control completed!")
print(f"Final ε: {mc_agent.epsilon:.6f}")
print(f"Average reward (last 100 episodes): {np.mean(mc_agent.episode_rewards[-100:]):.3f}")
print(f"Average length (last 100 episodes): {np.mean(mc_agent.episode_lengths[-100:]):.1f}")

# Extract value function from Q-values
mc_values = {state: np.max(Q_values[state]) for state in cliff_env.get_all_states()}

# Plot learning progress
mc_agent.plot_learning_curves()

In [None]:
# Visualize Monte Carlo results
print("Monte Carlo Control - Learned Value Function:")
cliff_env.render(values=mc_values)

print("\nMonte Carlo Control - Learned Policy:")
cliff_env.render(policy=mc_policy)

## 5. n-Step Temporal Difference Learning

n-step methods bridge Monte Carlo and TD(0) methods by looking ahead $n$ steps instead of just 1.

### n-Step Return

The **n-step return** from time $t$ is:
$$G_t^{(n)} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1} R_{t+n} + \gamma^n V(S_{t+n})$$

### n-Step TD Update

The n-step TD update rule is:
$$V(S_t) \leftarrow V(S_t) + \alpha [G_t^{(n)} - V(S_t)]$$

### Special Cases

- **n = 1**: Standard TD(0)
- **n = ∞**: Monte Carlo (full return)
- **n = intermediate**: Balanced bias-variance trade-off

### Bias-Variance Trade-off

- **Smaller n**: Lower variance (less random), higher bias (more approximate)
- **Larger n**: Higher variance (more random), lower bias (more accurate)

### n-Step SARSA

For action-value methods:
$$G_t^{(n)} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1} R_{t+n} + \gamma^n Q(S_{t+n}, A_{t+n})$$

$$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [G_t^{(n)} - Q(S_t, A_t)]$$

In [None]:
class NStepTDAgent:
    """
    n-Step Temporal Difference Learning Agent.
    
    Implements n-step SARSA and n-step Expected SARSA for
    action-value function learning with configurable lookahead.
    """
    
    def __init__(self, env, n: int = 3, alpha: float = 0.1, gamma: float = 0.9, 
                 epsilon: float = 0.1, epsilon_decay: float = 0.995):
        self.env = env
        self.n = n  # Number of steps to look ahead
        self.alpha = alpha
        self.gamma = gamma
        self.epsilon = epsilon
        self.epsilon_decay = epsilon_decay
        
        self.states = env.get_all_states()
        self.num_actions = env.num_actions
        
        # Initialize Q-function
        self.Q = defaultdict(lambda: np.zeros(self.num_actions))
        
        # Learning statistics
        self.episode_rewards = []
        self.episode_lengths = []
        self.td_errors = []
    
    def choose_action(self, state: Tuple[int, int], training: bool = True) -> int:
        """Choose action using ε-greedy policy."""
        if training and np.random.random() < self.epsilon:
            return np.random.randint(self.num_actions)
        else:
            return np.argmax(self.Q[state])
    
    def calculate_n_step_return(self, rewards: List[float], states: List, 
                               actions: List, start_idx: int, episode_length: int) -> float:
        """Calculate n-step return from given starting index."""
        G = 0.0
        end_idx = min(start_idx + self.n, episode_length)
        
        # Sum discounted rewards
        for i in range(start_idx, end_idx):
            G += (self.gamma ** (i - start_idx)) * rewards[i]
        
        # Add bootstrapped value if episode didn't end
        if end_idx < episode_length:
            bootstrap_state = states[end_idx]
            bootstrap_action = actions[end_idx]
            G += (self.gamma ** self.n) * self.Q[bootstrap_state][bootstrap_action]
        
        return G
    
    def train(self, num_episodes: int = 5000, verbose: bool = True) -> None:
        """Train using n-step SARSA."""
        
        for episode in range(num_episodes):
            # Initialize episode
            state = self.env.reset()
            action = self.choose_action(state, training=True)
            
            # Store episode data
            states = [state]
            actions = [action]
            rewards = []
            
            episode_reward = 0
            t = 0
            T = float('inf')  # Episode termination time
            
            while True:
                if t < T:
                    # Take action and observe result
                    next_state, reward, done, _ = self.env.step(action)
                    rewards.append(reward)
                    episode_reward += reward
                    
                    if done:
                        T = t + 1
                    else:
                        states.append(next_state)
                        next_action = self.choose_action(next_state, training=True)
                        actions.append(next_action)
                        action = next_action
                
                # Update Q-function for state-action pair t-n+1
                update_time = t - self.n + 1
                if update_time >= 0:
                    # Calculate n-step return
                    G = self.calculate_n_step_return(rewards, states, actions, 
                                                   update_time, len(rewards))
                    
                    # Update Q-value
                    update_state = states[update_time]
                    update_action = actions[update_time]
                    
                    old_q = self.Q[update_state][update_action]
                    td_error = G - old_q
                    self.Q[update_state][update_action] += self.alpha * td_error
                    self.td_errors.append(abs(td_error))
                
                if update_time == T - 1:
                    break
                    
                t += 1
            
            # Store episode statistics
            self.episode_rewards.append(episode_reward)
            self.episode_lengths.append(len(rewards))
            
            # Decay epsilon
            if self.epsilon > 0.01:
                self.epsilon *= self.epsilon_decay
            
            if episode % 1000 == 0 and verbose:
                avg_reward = np.mean(self.episode_rewards[-100:]) if len(self.episode_rewards) >= 100 else np.mean(self.episode_rewards)
                avg_length = np.mean(self.episode_lengths[-100:]) if len(self.episode_lengths) >= 100 else np.mean(self.episode_lengths)
                print(f"Episode {episode}: Avg Reward = {avg_reward:.3f}, Avg Length = {avg_length:.1f}, ε = {self.epsilon:.4f}")
    
    def get_policy(self) -> Dict:
        """Extract greedy policy from Q-values."""
        policy = {}
        for state in self.states:
            if not self.env.is_terminal(state):
                policy[state] = np.argmax(self.Q[state])
        return policy
    
    def get_value_function(self) -> Dict:
        """Extract value function from Q-values."""
        return {state: np.max(self.Q[state]) for state in self.states}

print("n-Step TD agent implementation complete!")

In [None]:
# Compare different n-step values
print("Comparing n-Step SARSA with different n values...")

n_values = [1, 3, 5, 10]
n_step_results = {}

for n in n_values:
    print(f"\n=== Training {n}-Step SARSA ===")
    
    agent = NStepTDAgent(cliff_env, n=n, alpha=0.1, gamma=0.9, epsilon=0.1)
    agent.train(num_episodes=10000, verbose=True)
    
    policy = agent.get_policy()
    values = agent.get_value_function()
    
    n_step_results[n] = {
        'agent': agent,
        'policy': policy,
        'values': values,
        'final_reward': np.mean(agent.episode_rewards[-100:]),
        'final_length': np.mean(agent.episode_lengths[-100:])
    }
    
    print(f"{n}-Step SARSA - Final avg reward: {n_step_results[n]['final_reward']:.3f}")
    print(f"{n}-Step SARSA - Final avg length: {n_step_results[n]['final_length']:.1f}")

print("\n=== n-Step Comparison Summary ===")
print(f"{'n':<5} {'Final Reward':<15} {'Final Length':<15} {'Convergence':<15}")
print("-" * 55)
for n in n_values:
    reward = n_step_results[n]['final_reward']
    length = n_step_results[n]['final_length']
    # Simple convergence metric: reward improvement in last 25% of episodes
    agent = n_step_results[n]['agent']
    early_reward = np.mean(agent.episode_rewards[len(agent.episode_rewards)//4:len(agent.episode_rewards)//2])
    late_reward = np.mean(agent.episode_rewards[-len(agent.episode_rewards)//4:])
    improvement = late_reward - early_reward
    print(f"{n:<5} {reward:<15.3f} {length:<15.1f} {improvement:<15.3f}")

In [None]:
# Plot comparison of n-step methods
def plot_n_step_comparison():
    """Compare learning curves for different n values."""
    
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    
    # Episode rewards comparison
    window = 200
    colors = ['blue', 'red', 'green', 'orange']
    
    for i, (n, results) in enumerate(n_step_results.items()):
        agent = results['agent']
        if len(agent.episode_rewards) >= window:
            smooth_rewards = pd.Series(agent.episode_rewards).rolling(window=window).mean()
            axes[0, 0].plot(smooth_rewards, label=f'n={n}', color=colors[i], alpha=0.8)
    
    axes[0, 0].set_xlabel('Episode')
    axes[0, 0].set_ylabel('Average Episode Reward')
    axes[0, 0].set_title('n-Step SARSA: Learning Progress')
    axes[0, 0].legend()
    axes[0, 0].grid(True, alpha=0.3)
    
    # Episode lengths comparison
    for i, (n, results) in enumerate(n_step_results.items()):
        agent = results['agent']
        if len(agent.episode_lengths) >= window:
            smooth_lengths = pd.Series(agent.episode_lengths).rolling(window=window).mean()
            axes[0, 1].plot(smooth_lengths, label=f'n={n}', color=colors[i], alpha=0.8)
    
    axes[0, 1].set_xlabel('Episode')
    axes[0, 1].set_ylabel('Average Episode Length')
    axes[0, 1].set_title('n-Step SARSA: Episode Lengths')
    axes[0, 1].legend()
    axes[0, 1].grid(True, alpha=0.3)
    
    # Final performance comparison
    n_vals = list(n_step_results.keys())
    final_rewards = [n_step_results[n]['final_reward'] for n in n_vals]
    
    bars = axes[1, 0].bar([str(n) for n in n_vals], final_rewards, 
                         color=colors[:len(n_vals)], alpha=0.7)
    axes[1, 0].set_xlabel('n (steps ahead)')
    axes[1, 0].set_ylabel('Final Average Reward')
    axes[1, 0].set_title('Final Performance Comparison')
    
    # Add value labels on bars
    for bar, reward in zip(bars, final_rewards):
        axes[1, 0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.5, 
                       f'{reward:.2f}', ha='center', va='bottom')
    
    # TD Error comparison (first 5000 updates)
    for i, (n, results) in enumerate(n_step_results.items()):
        agent = results['agent']
        if len(agent.td_errors) > 1000:
            # Smooth TD errors
            td_smooth = pd.Series(agent.td_errors[:5000]).rolling(window=100).mean()
            axes[1, 1].plot(td_smooth, label=f'n={n}', color=colors[i], alpha=0.8)
    
    axes[1, 1].set_xlabel('Update Step')
    axes[1, 1].set_ylabel('Average TD Error')
    axes[1, 1].set_title('TD Error Evolution')
    axes[1, 1].legend()
    axes[1, 1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()

plot_n_step_comparison()

## 6. Eligibility Traces and TD(λ)

**Eligibility traces** provide an elegant way to implement n-step methods more efficiently and bridge TD and MC methods.

### Mathematical Foundation

The **eligibility trace** for state $s$ at time $t$ is:
$$e_t(s) = \begin{cases}
\gamma \lambda e_{t-1}(s) + \mathbf{1}_{S_t = s} & \text{(accumulating traces)} \\
\gamma \lambda e_{t-1}(s) & \text{if } S_t \neq s \\
1 & \text{if } S_t = s \text{ (replacing traces)}
\end{cases}$$

Where $\lambda \in [0,1]$ is the **trace decay parameter**.

### TD(λ) Update Rule

The TD(λ) update rule is:
$$\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$$
$$V(s) \leftarrow V(s) + \alpha \delta_t e_t(s) \text{ for all } s$$

### Key Insights

- **λ = 0**: Reduces to TD(0)
- **λ = 1**: Equivalent to Monte Carlo (for accumulating traces)
- **λ ∈ (0,1)**: Balances bias and variance

### Forward vs Backward View

- **Forward view**: TD(λ) as weighted average of n-step returns
- **Backward view**: Eligibility traces distribute TD error backward in time

### SARSA(λ)

For action-value functions:
$$e_t(s,a) = \begin{cases}
\gamma \lambda e_{t-1}(s,a) + 1 & \text{if } S_t = s, A_t = a \\
\gamma \lambda e_{t-1}(s,a) & \text{otherwise}
\end{cases}$$

$$Q(s,a) \leftarrow Q(s,a) + \alpha \delta_t e_t(s,a) \text{ for all } s,a$$

In [None]:
class SARSALambdaAgent:
    """
    SARSA(λ) agent with eligibility traces.
    
    Implements both accumulating and replacing traces,
    providing efficient credit assignment across time.
    """
    
    def __init__(self, env, alpha: float = 0.1, gamma: float = 0.9, 
                 lambda_param: float = 0.9, epsilon: float = 0.1, 
                 epsilon_decay: float = 0.995, trace_type: str = 'accumulating'):
        self.env = env
        self.alpha = alpha
        self.gamma = gamma
        self.lambda_param = lambda_param  # λ parameter
        self.epsilon = epsilon
        self.epsilon_decay = epsilon_decay
        self.trace_type = trace_type  # 'accumulating' or 'replacing'
        
        self.states = env.get_all_states()
        self.num_actions = env.num_actions
        
        # Initialize Q-function and eligibility traces
        self.Q = defaultdict(lambda: np.zeros(self.num_actions))
        self.traces = defaultdict(lambda: np.zeros(self.num_actions))
        
        # Learning statistics
        self.episode_rewards = []
        self.episode_lengths = []
        self.td_errors = []
        self.trace_magnitudes = []  # Track trace decay
    
    def choose_action(self, state: Tuple[int, int], training: bool = True) -> int:
        """Choose action using ε-greedy policy."""
        if training and np.random.random() < self.epsilon:
            return np.random.randint(self.num_actions)
        else:
            return np.argmax(self.Q[state])
    
    def update_traces(self, state: Tuple[int, int], action: int) -> None:
        """Update eligibility traces."""
        # Decay all traces
        for s in list(self.traces.keys()):
            self.traces[s] *= self.gamma * self.lambda_param
            
            # Remove traces that are too small (for efficiency)
            if np.max(np.abs(self.traces[s])) < 1e-8:
                del self.traces[s]
        
        # Update trace for current state-action
        if self.trace_type == 'accumulating':
            self.traces[state][action] += 1.0
        elif self.trace_type == 'replacing':
            # Reset trace for current state to 0, then set to 1
            self.traces[state][:] = 0.0
            self.traces[state][action] = 1.0
    
    def train(self, num_episodes: int = 5000, verbose: bool = True) -> None:
        """Train using SARSA(λ)."""
        
        for episode in range(num_episodes):
            # Reset traces at beginning of episode
            self.traces.clear()
            
            # Initialize episode
            state = self.env.reset()
            action = self.choose_action(state, training=True)
            
            episode_reward = 0
            episode_length = 0
            
            while True:
                # Take action
                next_state, reward, done, _ = self.env.step(action)
                episode_reward += reward
                episode_length += 1
                
                # Choose next action
                if not done:
                    next_action = self.choose_action(next_state, training=True)
                else:
                    next_action = None
                
                # Calculate TD error
                if done:
                    td_error = reward - self.Q[state][action]
                else:
                    td_error = reward + self.gamma * self.Q[next_state][next_action] - self.Q[state][action]
                
                self.td_errors.append(abs(td_error))
                
                # Update eligibility trace for current state-action
                self.update_traces(state, action)
                
                # Update Q-values for all states using eligibility traces
                total_trace_magnitude = 0
                for s in list(self.traces.keys()):
                    for a in range(self.num_actions):
                        if abs(self.traces[s][a]) > 1e-8:
                            self.Q[s][a] += self.alpha * td_error * self.traces[s][a]
                            total_trace_magnitude += abs(self.traces[s][a])
                
                self.trace_magnitudes.append(total_trace_magnitude)
                
                if done:
                    break
                
                state = next_state
                action = next_action
            
            # Store episode statistics
            self.episode_rewards.append(episode_reward)
            self.episode_lengths.append(episode_length)
            
            # Decay epsilon
            if self.epsilon > 0.01:
                self.epsilon *= self.epsilon_decay
            
            if episode % 1000 == 0 and verbose:
                avg_reward = np.mean(self.episode_rewards[-100:]) if len(self.episode_rewards) >= 100 else np.mean(self.episode_rewards)
                avg_length = np.mean(self.episode_lengths[-100:]) if len(self.episode_lengths) >= 100 else np.mean(self.episode_lengths)
                print(f"Episode {episode}: Avg Reward = {avg_reward:.3f}, Avg Length = {avg_length:.1f}, ε = {self.epsilon:.4f}")
    
    def get_policy(self) -> Dict:
        """Extract greedy policy from Q-values."""
        policy = {}
        for state in self.states:
            if not self.env.is_terminal(state):
                policy[state] = np.argmax(self.Q[state])
        return policy
    
    def get_value_function(self) -> Dict:
        """Extract value function from Q-values."""
        return {state: np.max(self.Q[state]) for state in self.states}

print("SARSA(λ) agent implementation complete!")

In [None]:
# Compare different λ values
print("Comparing SARSA(λ) with different λ values...")

lambda_values = [0.0, 0.3, 0.7, 0.9]
lambda_results = {}

for lambda_val in lambda_values:
    print(f"\n=== Training SARSA(λ={lambda_val}) ===")
    
    agent = SARSALambdaAgent(cliff_env, alpha=0.1, gamma=0.9, 
                            lambda_param=lambda_val, epsilon=0.1, 
                            trace_type='accumulating')
    agent.train(num_episodes=8000, verbose=True)
    
    policy = agent.get_policy()
    values = agent.get_value_function()
    
    lambda_results[lambda_val] = {
        'agent': agent,
        'policy': policy,
        'values': values,
        'final_reward': np.mean(agent.episode_rewards[-100:]),
        'final_length': np.mean(agent.episode_lengths[-100:])
    }
    
    print(f"SARSA(λ={lambda_val}) - Final avg reward: {lambda_results[lambda_val]['final_reward']:.3f}")
    print(f"SARSA(λ={lambda_val}) - Final avg length: {lambda_results[lambda_val]['final_length']:.1f}")

print("\n=== λ Value Comparison Summary ===")
print(f"{'λ':<5} {'Final Reward':<15} {'Final Length':<15} {'Learning Speed':<15}")
print("-" * 55)
for lambda_val in lambda_values:
    reward = lambda_results[lambda_val]['final_reward']
    length = lambda_results[lambda_val]['final_length']
    # Measure learning speed as episodes to reach 75% of final performance
    agent = lambda_results[lambda_val]['agent']
    target_reward = reward * 0.75
    episodes_to_target = len(agent.episode_rewards)
    
    # Find when rolling average first exceeds target
    window = 100
    for i in range(window, len(agent.episode_rewards)):
        if np.mean(agent.episode_rewards[i-window:i]) >= target_reward:
            episodes_to_target = i
            break
    
    print(f"{lambda_val:<5} {reward:<15.3f} {length:<15.1f} {episodes_to_target:<15}")

In [None]:
# Plot λ comparison and trace analysis
def plot_lambda_comparison():
    """Compare SARSA(λ) with different λ values."""
    
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    
    window = 150
    colors = ['blue', 'red', 'green', 'orange']
    
    # Episode rewards comparison
    for i, (lambda_val, results) in enumerate(lambda_results.items()):
        agent = results['agent']
        if len(agent.episode_rewards) >= window:
            smooth_rewards = pd.Series(agent.episode_rewards).rolling(window=window).mean()
            axes[0, 0].plot(smooth_rewards, label=f'λ={lambda_val}', color=colors[i], alpha=0.8)
    
    axes[0, 0].set_xlabel('Episode')
    axes[0, 0].set_ylabel('Average Episode Reward')
    axes[0, 0].set_title('SARSA(λ): Learning Progress')
    axes[0, 0].legend()
    axes[0, 0].grid(True, alpha=0.3)
    
    # Trace magnitude evolution (first 10000 steps)
    for i, (lambda_val, results) in enumerate(lambda_results.items()):
        agent = results['agent']
        if len(agent.trace_magnitudes) > 1000:
            trace_smooth = pd.Series(agent.trace_magnitudes[:10000]).rolling(window=100).mean()
            axes[0, 1].plot(trace_smooth, label=f'λ={lambda_val}', color=colors[i], alpha=0.8)
    
    axes[0, 1].set_xlabel('Update Step')
    axes[0, 1].set_ylabel('Total Trace Magnitude')
    axes[0, 1].set_title('Eligibility Trace Magnitudes')
    axes[0, 1].legend()
    axes[0, 1].grid(True, alpha=0.3)
    
    # Final performance comparison
    lambda_vals = list(lambda_results.keys())
    final_rewards = [lambda_results[lam]['final_reward'] for lam in lambda_vals]
    
    bars = axes[1, 0].bar([str(lam) for lam in lambda_vals], final_rewards, 
                         color=colors[:len(lambda_vals)], alpha=0.7)
    axes[1, 0].set_xlabel('λ (trace decay parameter)')
    axes[1, 0].set_ylabel('Final Average Reward')
    axes[1, 0].set_title('Final Performance vs λ')
    
    # Add value labels on bars
    for bar, reward in zip(bars, final_rewards):
        axes[1, 0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.5, 
                       f'{reward:.2f}', ha='center', va='bottom')
    
    # TD Error comparison
    for i, (lambda_val, results) in enumerate(lambda_results.items()):
        agent = results['agent']
        if len(agent.td_errors) > 1000:
            td_smooth = pd.Series(agent.td_errors[:5000]).rolling(window=100).mean()
            axes[1, 1].plot(td_smooth, label=f'λ={lambda_val}', color=colors[i], alpha=0.8)
    
    axes[1, 1].set_xlabel('Update Step')
    axes[1, 1].set_ylabel('Average TD Error')
    axes[1, 1].set_title('TD Error Evolution')
    axes[1, 1].legend()
    axes[1, 1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()

plot_lambda_comparison()

## 7. Off-Policy Methods and Importance Sampling

**Off-policy** methods learn about a target policy while following a different behavior policy. This is crucial for learning optimal policies while maintaining exploration.

### Importance Sampling

To correct for the difference between behavior policy $b(a|s)$ and target policy $\pi(a|s)$, we use **importance sampling**:

The **importance sampling ratio** is:
$$\rho_{t:T-1} = \prod_{k=t}^{T-1} \frac{\pi(A_k|S_k)}{b(A_k|S_k)}$$

### Off-Policy Monte Carlo

The off-policy MC update is:
$$V(S_t) \leftarrow V(S_t) + \alpha [\rho_{t:T-1} G_t - V(S_t)]$$

### Ordinary vs Weighted Importance Sampling

**Ordinary Importance Sampling**:
$$V(s) = \frac{\sum_{t \in \mathcal{T}(s)} \rho_{t:T(t)-1} G_t}{|\mathcal{T}(s)|}$$

**Weighted Importance Sampling**:
$$V(s) = \frac{\sum_{t \in \mathcal{T}(s)} \rho_{t:T(t)-1} G_t}{\sum_{t \in \mathcal{T}(s)} \rho_{t:T(t)-1}}$$

### Properties

- **Ordinary IS**: Unbiased but higher variance
- **Weighted IS**: Biased but lower variance
- **Asymptotically**: Both converge to the true value

### Off-Policy TD Learning

Q-learning is naturally off-policy, but we can also create off-policy versions of SARSA using importance sampling.

In [None]:
class OffPolicyMonteCarloAgent:
    """
    Off-policy Monte Carlo agent using importance sampling.
    
    Learns about a target policy while following a behavior policy,
    using importance sampling to correct for the distribution mismatch.
    """
    
    def __init__(self, env, gamma: float = 0.9, epsilon: float = 0.1, 
                 weighted_is: bool = True):
        self.env = env
        self.gamma = gamma
        self.epsilon = epsilon  # For behavior policy
        self.weighted_is = weighted_is  # Use weighted importance sampling
        
        self.states = env.get_all_states()
        self.num_actions = env.num_actions
        
        # Q-function for target policy (will be greedy)
        self.Q = defaultdict(lambda: np.zeros(self.num_actions))
        
        # For weighted importance sampling
        self.C = defaultdict(lambda: np.zeros(self.num_actions))  # Cumulative weights
        
        # Learning statistics
        self.episode_rewards = []
        self.episode_lengths = []
        self.importance_ratios = []  # Track importance sampling ratios
    
    def behavior_policy(self, state: Tuple[int, int]) -> int:
        """ε-greedy behavior policy."""
        if np.random.random() < self.epsilon:
            return np.random.randint(self.num_actions)
        else:
            return np.argmax(self.Q[state])
    
    def target_policy_prob(self, state: Tuple[int, int], action: int) -> float:
        """Target policy probability (greedy with respect to Q)."""
        best_action = np.argmax(self.Q[state])
        return 1.0 if action == best_action else 0.0
    
    def behavior_policy_prob(self, state: Tuple[int, int], action: int) -> float:
        """Behavior policy probability (ε-greedy)."""
        best_action = np.argmax(self.Q[state])
        if action == best_action:
            return 1 - self.epsilon + self.epsilon / self.num_actions
        else:
            return self.epsilon / self.num_actions
    
    def generate_episode(self, max_steps: int = 1000) -> List:
        """Generate episode using behavior policy."""
        episode = []
        state = self.env.reset()
        
        for step in range(max_steps):
            action = self.behavior_policy(state)
            next_state, reward, done, _ = self.env.step(action)
            
            episode.append((state, action, reward))
            
            if done:
                break
                
            state = next_state
        
        return episode
    
    def calculate_returns(self, episode: List) -> List[float]:
        """Calculate returns for each time step."""
        returns = []
        G = 0
        
        for t in range(len(episode) - 1, -1, -1):
            _, _, reward = episode[t]
            G = reward + self.gamma * G
            returns.insert(0, G)
        
        return returns
    
    def train(self, num_episodes: int = 20000, verbose: bool = True) -> None:
        """Train using off-policy Monte Carlo with importance sampling."""
        
        for episode_num in range(num_episodes):
            # Generate episode using behavior policy
            episode = self.generate_episode(max_steps=500)
            returns = self.calculate_returns(episode)
            
            # Track episode statistics
            episode_reward = sum(step[2] for step in episode)
            self.episode_rewards.append(episode_reward)
            self.episode_lengths.append(len(episode))
            
            # Process episode backwards (off-policy MC)
            G = 0
            W = 1  # Importance sampling weight
            
            for t in range(len(episode) - 1, -1, -1):
                state, action, reward = episode[t]
                G = reward + self.gamma * G
                
                if self.weighted_is:
                    # Weighted importance sampling
                    self.C[state][action] += W
                    if self.C[state][action] > 0:
                        self.Q[state][action] += (W / self.C[state][action]) * (G - self.Q[state][action])
                else:
                    # Ordinary importance sampling
                    self.Q[state][action] += W * (G - self.Q[state][action]) / max(1, episode_num + 1)
                
                # Update importance sampling weight
                target_prob = self.target_policy_prob(state, action)
                behavior_prob = self.behavior_policy_prob(state, action)
                
                if behavior_prob == 0:
                    break  # Cannot continue if behavior policy has 0 probability
                
                W = W * target_prob / behavior_prob
                
                if W == 0:
                    break  # If target policy has 0 probability, truncate
            
            # Track importance ratio for analysis
            if len(episode) > 0:
                avg_ratio = W / len(episode) if len(episode) > 0 else 0
                self.importance_ratios.append(avg_ratio)
            
            if episode_num % 2000 == 0 and verbose:
                avg_reward = np.mean(self.episode_rewards[-200:]) if len(self.episode_rewards) >= 200 else np.mean(self.episode_rewards)
                avg_length = np.mean(self.episode_lengths[-200:]) if len(self.episode_lengths) >= 200 else np.mean(self.episode_lengths)
                avg_ratio = np.mean(self.importance_ratios[-200:]) if len(self.importance_ratios) >= 200 else 0
                print(f"Episode {episode_num}: Avg Reward = {avg_reward:.3f}, Avg Length = {avg_length:.1f}, Avg IS Ratio = {avg_ratio:.6f}")
    
    def get_policy(self) -> Dict:
        """Extract target policy (greedy with respect to Q)."""
        policy = {}
        for state in self.states:
            if not self.env.is_terminal(state):
                policy[state] = np.argmax(self.Q[state])
        return policy
    
    def get_value_function(self) -> Dict:
        """Extract value function from Q-values."""
        return {state: np.max(self.Q[state]) for state in self.states}

print("Off-policy Monte Carlo agent implementation complete!")

In [None]:
# Train off-policy Monte Carlo agent
print("Training Off-Policy Monte Carlo agent...")

off_policy_agent = OffPolicyMonteCarloAgent(cliff_env, gamma=0.9, epsilon=0.3, weighted_is=True)
off_policy_agent.train(num_episodes=25000, verbose=True)

off_policy_policy = off_policy_agent.get_policy()
off_policy_values = off_policy_agent.get_value_function()

print(f"\nOff-Policy Monte Carlo completed!")
print(f"Final avg reward: {np.mean(off_policy_agent.episode_rewards[-200:]):.3f}")
print(f"Final avg length: {np.mean(off_policy_agent.episode_lengths[-200:]):.1f}")

# Plot learning progress
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Episode rewards
window = 300
if len(off_policy_agent.episode_rewards) >= window:
    smooth_rewards = pd.Series(off_policy_agent.episode_rewards).rolling(window=window).mean()
    axes[0].plot(smooth_rewards, 'b-', linewidth=2, label='Off-policy MC')

axes[0].plot(off_policy_agent.episode_rewards, alpha=0.2, color='blue')
axes[0].set_xlabel('Episode')
axes[0].set_ylabel('Episode Reward')
axes[0].set_title('Off-Policy MC: Learning Progress')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Importance sampling ratios
if len(off_policy_agent.importance_ratios) >= window:
    smooth_ratios = pd.Series(off_policy_agent.importance_ratios).rolling(window=window).mean()
    axes[1].plot(smooth_ratios, 'r-', linewidth=2)

axes[1].plot(off_policy_agent.importance_ratios, alpha=0.2, color='red')
axes[1].set_xlabel('Episode')
axes[1].set_ylabel('Average Importance Ratio')
axes[1].set_title('Importance Sampling Ratios')
axes[1].grid(True, alpha=0.3)

# Episode lengths
if len(off_policy_agent.episode_lengths) >= window:
    smooth_lengths = pd.Series(off_policy_agent.episode_lengths).rolling(window=window).mean()
    axes[2].plot(smooth_lengths, 'g-', linewidth=2)

axes[2].plot(off_policy_agent.episode_lengths, alpha=0.2, color='green')
axes[2].set_xlabel('Episode')
axes[2].set_ylabel('Episode Length')
axes[2].set_title('Episode Length Progress')
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 8. Comprehensive Algorithm Comparison

Let's compare all the methods we've implemented to understand their relative strengths and convergence properties.

In [None]:
# Comprehensive comparison of all methods
def comprehensive_algorithm_comparison():
    """Compare all implemented algorithms."""
    
    algorithms = {
        'Monte Carlo Control': {
            'final_reward': np.mean(mc_agent.episode_rewards[-100:]),
            'final_length': np.mean(mc_agent.episode_lengths[-100:]),
            'type': 'On-policy MC',
            'episodes': len(mc_agent.episode_rewards)
        },
        '1-Step SARSA': {
            'final_reward': n_step_results[1]['final_reward'],
            'final_length': n_step_results[1]['final_length'],
            'type': 'n-step TD (n=1)',
            'episodes': len(n_step_results[1]['agent'].episode_rewards)
        },
        '3-Step SARSA': {
            'final_reward': n_step_results[3]['final_reward'],
            'final_length': n_step_results[3]['final_length'],
            'type': 'n-step TD (n=3)',
            'episodes': len(n_step_results[3]['agent'].episode_rewards)
        },
        'SARSA(λ=0.7)': {
            'final_reward': lambda_results[0.7]['final_reward'],
            'final_length': lambda_results[0.7]['final_length'],
            'type': 'Eligibility Traces',
            'episodes': len(lambda_results[0.7]['agent'].episode_rewards)
        },
        'Off-Policy MC': {
            'final_reward': np.mean(off_policy_agent.episode_rewards[-200:]),
            'final_length': np.mean(off_policy_agent.episode_lengths[-200:]),
            'type': 'Off-policy MC + IS',
            'episodes': len(off_policy_agent.episode_rewards)
        }
    }
    
    print("=== Comprehensive Algorithm Comparison ===")
    print(f"{'Algorithm':<18} {'Type':<20} {'Final Reward':<15} {'Final Length':<15} {'Episodes':<10}")
    print("-" * 85)
    
    for name, info in algorithms.items():
        print(f"{name:<18} {info['type']:<20} {info['final_reward']:<15.3f} {info['final_length']:<15.1f} {info['episodes']:<10}")
    
    # Find best performing algorithm
    best_algorithm = max(algorithms.keys(), key=lambda x: algorithms[x]['final_reward'])
    print(f"\nBest performing: {best_algorithm} with {algorithms[best_algorithm]['final_reward']:.3f} reward")
    
    return algorithms

comparison_results = comprehensive_algorithm_comparison()

In [None]:
# Visualize algorithm comparison
def plot_algorithm_comparison():
    """Plot comprehensive comparison of all algorithms."""
    
    fig, axes = plt.subplots(2, 2, figsize=(16, 10))
    
    # Final performance comparison
    names = list(comparison_results.keys())
    rewards = [comparison_results[name]['final_reward'] for name in names]
    
    bars = axes[0, 0].bar(range(len(names)), rewards, 
                         color=['skyblue', 'lightcoral', 'lightgreen', 'orange', 'plum'],
                         alpha=0.8)
    axes[0, 0].set_xlabel('Algorithm')
    axes[0, 0].set_ylabel('Final Average Reward')
    axes[0, 0].set_title('Final Performance Comparison')
    axes[0, 0].set_xticks(range(len(names)))
    axes[0, 0].set_xticklabels(names, rotation=45, ha='right')
    
    # Add value labels on bars
    for bar, reward in zip(bars, rewards):
        axes[0, 0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1, 
                       f'{reward:.2f}', ha='center', va='bottom')
    
    # Episode lengths comparison
    lengths = [comparison_results[name]['final_length'] for name in names]
    
    bars2 = axes[0, 1].bar(range(len(names)), lengths,
                          color=['skyblue', 'lightcoral', 'lightgreen', 'orange', 'plum'],
                          alpha=0.8)
    axes[0, 1].set_xlabel('Algorithm')
    axes[0, 1].set_ylabel('Final Average Episode Length')
    axes[0, 1].set_title('Episode Length Comparison')
    axes[0, 1].set_xticks(range(len(names)))
    axes[0, 1].set_xticklabels(names, rotation=45, ha='right')
    
    # Learning curves comparison (sample)
    window = 200
    
    # Monte Carlo
    if len(mc_agent.episode_rewards) >= window:
        mc_smooth = pd.Series(mc_agent.episode_rewards).rolling(window=window).mean()
        axes[1, 0].plot(mc_smooth, label='MC Control', alpha=0.8)
    
    # n-step SARSA (n=3)
    agent_3step = n_step_results[3]['agent']
    if len(agent_3step.episode_rewards) >= window:
        sarsa3_smooth = pd.Series(agent_3step.episode_rewards).rolling(window=window).mean()
        axes[1, 0].plot(sarsa3_smooth, label='3-Step SARSA', alpha=0.8)
    
    # SARSA(λ)
    agent_lambda = lambda_results[0.7]['agent']
    if len(agent_lambda.episode_rewards) >= window:
        lambda_smooth = pd.Series(agent_lambda.episode_rewards).rolling(window=window).mean()
        axes[1, 0].plot(lambda_smooth, label='SARSA(λ=0.7)', alpha=0.8)
    
    axes[1, 0].set_xlabel('Episode')
    axes[1, 0].set_ylabel('Average Episode Reward')
    axes[1, 0].set_title('Learning Curves Comparison')
    axes[1, 0].legend()
    axes[1, 0].grid(True, alpha=0.3)
    
    # Sample efficiency comparison
    episodes_needed = []
    
    for name in names:
        if name == 'Monte Carlo Control':
            agent = mc_agent
        elif name == '3-Step SARSA':
            agent = n_step_results[3]['agent']
        elif name == 'SARSA(λ=0.7)':
            agent = lambda_results[0.7]['agent']
        elif name == 'Off-Policy MC':
            agent = off_policy_agent
        else:
            agent = n_step_results[1]['agent']
        
        # Find episodes to reach 90% of final performance
        final_perf = comparison_results[name]['final_reward']
        target = final_perf * 0.9
        episodes_to_target = len(agent.episode_rewards)
        
        window_small = min(100, len(agent.episode_rewards) // 10)
        if window_small > 1:
            for i in range(window_small, len(agent.episode_rewards)):
                if np.mean(agent.episode_rewards[i-window_small:i]) >= target:
                    episodes_to_target = i
                    break
        
        episodes_needed.append(episodes_to_target)
    
    bars3 = axes[1, 1].bar(range(len(names)), episodes_needed,
                          color=['skyblue', 'lightcoral', 'lightgreen', 'orange', 'plum'],
                          alpha=0.8)
    axes[1, 1].set_xlabel('Algorithm')
    axes[1, 1].set_ylabel('Episodes to 90% Performance')
    axes[1, 1].set_title('Sample Efficiency Comparison')
    axes[1, 1].set_xticks(range(len(names)))
    axes[1, 1].set_xticklabels(names, rotation=45, ha='right')
    
    plt.tight_layout()
    plt.show()

plot_algorithm_comparison()

## 9. Theoretical Analysis and Key Insights

From our comprehensive implementation and comparison, we can extract several key theoretical and practical insights.

In [None]:
# Analyze convergence properties and insights
def analyze_convergence_properties():
    """Analyze and summarize convergence properties of different methods."""
    
    print("=== Theoretical Analysis and Key Insights ===")
    print()
    
    print("1. BIAS-VARIANCE TRADE-OFF:")
    print("   • Monte Carlo: Low bias, high variance (uses full returns)")
    print("   • TD(0): Higher bias, lower variance (1-step bootstrap)")
    print("   • n-step: Intermediate bias-variance (n-step bootstrap)")
    print("   • TD(λ): Weighted combination of all n-step returns")
    print()
    
    print("2. SAMPLE EFFICIENCY:")
    # Analyze which method learns fastest
    fastest_learner = min(comparison_results.keys(), 
                         key=lambda x: comparison_results[x]['episodes'])
    print(f"   • Fastest convergence: {fastest_learner}")
    print("   • TD methods generally more sample efficient than MC")
    print("   • Eligibility traces accelerate learning via backward credit assignment")
    print()
    
    print("3. ON-POLICY vs OFF-POLICY:")
    on_policy_reward = comparison_results['Monte Carlo Control']['final_reward']
    off_policy_reward = comparison_results['Off-Policy MC']['final_reward']
    print(f"   • On-policy MC final reward: {on_policy_reward:.3f}")
    print(f"   • Off-policy MC final reward: {off_policy_reward:.3f}")
    if off_policy_reward < on_policy_reward:
        print("   • Off-policy learning shows higher variance due to importance sampling")
    print("   • Off-policy enables learning optimal policy while exploring")
    print()
    
    print("4. PARAMETER SENSITIVITY:")
    print("   • n-step performance:")
    for n in [1, 3, 5, 10]:
        if n in n_step_results:
            print(f"     - n={n}: {n_step_results[n]['final_reward']:.3f} reward")
    print("   • λ parameter performance:")
    for lam in [0.0, 0.3, 0.7, 0.9]:
        if lam in lambda_results:
            print(f"     - λ={lam}: {lambda_results[lam]['final_reward']:.3f} reward")
    print()
    
    print("5. COMPUTATIONAL CONSIDERATIONS:")
    print("   • MC: Simple updates, requires episode completion")
    print("   • TD: Online updates, immediate feedback")
    print("   • Eligibility traces: More memory, backward credit assignment")
    print("   • Off-policy: Additional importance sampling computation")
    print()
    
    print("6. PRACTICAL RECOMMENDATIONS:")
    best_method = max(comparison_results.keys(), 
                     key=lambda x: comparison_results[x]['final_reward'])
    print(f"   • Best overall performance: {best_method}")
    print("   • For online learning: Use TD methods")
    print("   • For sample efficiency: Consider n-step or eligibility traces")
    print("   • For exploration: Use off-policy methods with ε-greedy")
    print("   • For stability: Start with λ ∈ [0.7, 0.9] for eligibility traces")

analyze_convergence_properties()

## Summary

In this comprehensive notebook, we've explored the mathematical foundations and practical implementations of Monte Carlo and advanced Temporal Difference learning methods:

### **Monte Carlo Methods**
- **First-visit vs Every-visit**: Different approaches to handling multiple state visits
- **MC Prediction**: Unbiased value function estimation using actual returns
- **MC Control**: Policy optimization through generalized policy iteration
- **Exploring Starts**: Ensuring sufficient exploration for convergence

### **n-Step Temporal Difference Learning**
- **n-Step Returns**: Balancing bias and variance through multi-step bootstrapping
- **Parameter Selection**: Understanding the trade-offs for different n values
- **Unified Framework**: Bridging TD(0) and Monte Carlo methods

### **Eligibility Traces**
- **TD(λ)**: Efficient implementation of n-step methods
- **Forward vs Backward View**: Two equivalent perspectives on eligibility traces
- **SARSA(λ)**: Action-value function learning with traces
- **Trace Types**: Accumulating vs replacing traces

### **Off-Policy Learning**
- **Importance Sampling**: Correcting for policy mismatch
- **Ordinary vs Weighted IS**: Bias-variance trade-offs in importance sampling
- **Off-Policy Monte Carlo**: Learning target policy while following behavior policy

### **Key Theoretical Insights**
- **Bias-Variance Trade-off**: Fundamental concept affecting all RL algorithms
- **Sample Efficiency**: TD methods generally more efficient than MC
- **Credit Assignment**: Eligibility traces enable efficient backward credit assignment
- **Exploration vs Exploitation**: Off-policy methods enable optimal policy learning with exploration

### **Practical Guidelines**
- Choose **n-step methods** when you can afford to wait n steps for updates
- Use **eligibility traces** for efficient credit assignment and faster learning
- Apply **off-policy methods** when you need to learn optimal policies while exploring
- Set **λ ∈ [0.7, 0.9]** as a good starting point for eligibility trace parameters

These methods form the foundation for understanding function approximation and deep reinforcement learning, which we'll explore in the next notebook as we transition from tabular to neural network-based approaches.