# Deep Reinforcement Learning - Session 3
## Temporal Difference Learning and Q-Learning

---

## Learning Objectives

By the end of this session, you will understand:

**Core Concepts:**
- **Temporal Difference (TD) Learning**: Learning from experience without knowing the model
- **Q-Learning Algorithm**: Off-policy TD control for finding optimal policies
- **SARSA Algorithm**: On-policy TD control method
- **Exploration vs Exploitation**: Balancing learning and performance

**Practical Skills:**
- Implement TD(0) for policy evaluation
- Build Q-Learning agent from scratch
- Compare SARSA and Q-Learning performance
- Design exploration strategies (epsilon-greedy, decaying epsilon)
- Analyze convergence and learning curves

**Real-World Applications:**
- Game playing (Chess, Go, Atari games)
- Robotics control and navigation
- Resource allocation and scheduling
- Autonomous trading systems

---

## Session Overview

1. **Part 1**: From Dynamic Programming to Temporal Difference
2. **Part 2**: TD(0) Learning - Bootstrapping from Experience
3. **Part 3**: Q-Learning - Off-Policy Control
4. **Part 4**: SARSA - On-Policy Control
5. **Part 5**: Exploration Strategies
6. **Part 6**: Comparative Analysis and Experiments

---

## Transition from Session 2

**Previous Session (Session 2):**
- MDPs and Bellman equations
- Policy evaluation and improvement
- **Model-based** approaches (knowing P and R)

**Current Session (Session 3):**
- **Model-free** learning (no knowledge of P and R)
- Learning directly from experience
- Online learning algorithms

**Key Transition:**
From "I know the environment model" to "I learn by trying actions and observing results"

---

## Part 1: Introduction to Temporal Difference Learning

### The Limitation of Dynamic Programming

In Session 2, we used **Dynamic Programming** methods like Policy Iteration, which required:
- Complete knowledge of the environment model (transition probabilities P(s'|s,a))
- Complete knowledge of reward function R(s,a,s')
- Ability to sweep through all states multiple times

**Real-World Challenge**: In most practical scenarios, we don't have complete knowledge of the environment.

### What is Temporal Difference Learning?

**Temporal Difference (TD) Learning** is a method that combines ideas from:
- **Monte Carlo methods**: Learning from experience samples
- **Dynamic Programming**: Bootstrapping from current estimates

**Key Principle**: Update value estimates based on observed transitions, without needing the complete model.

### Core TD Concept: Bootstrapping

Instead of waiting for complete episodes (Monte Carlo), TD methods update estimates using:
- **Current estimate**: V(s_t)
- **Observed reward**: R_{t+1}
- **Next state estimate**: V(s_{t+1})

**TD Update Rule**:
```
V(s_t) ← V(s_t) + α[R_{t+1} + γV(s_{t+1}) - V(s_t)]
```

Where:
- α (alpha): Learning rate (0 < α ≤ 1)
- γ (gamma): Discount factor
- [R_{t+1} + γV(s_{t+1}) - V(s_t)]: **TD Error**

### The Three Learning Paradigms

| Method | Model Required | Update Frequency | Variance | Bias |
|--------|----------------|------------------|----------|------|
| **Dynamic Programming** | Yes | After full sweep | None | None (exact) |
| **Monte Carlo** | No | After episode | High | None |
| **Temporal Difference** | No | After each step | Low | Some (bootstrap) |

### TD Learning Advantages

1. **Online Learning**: Can learn while interacting with environment
2. **No Model Required**: Works without knowing P(s'|s,a) or R(s,a,s')
3. **Lower Variance**: More stable than Monte Carlo
4. **Faster Learning**: Updates after each step, not episode

### Real-World Analogy: Restaurant Reviews

**Monte Carlo**: Read all reviews after trying every dish (complete episode)
**TD Learning**: Update opinion about restaurant after each dish, considering what you expect from remaining dishes

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from typing import Dict, List, Tuple, Optional
import random
from collections import defaultdict, deque
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)
random.seed(42)

plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 12
sns.set_style("whitegrid")

print("Libraries imported successfully!")
print("Environment configured for Temporal Difference Learning")
print("Session 3: Ready to explore model-free reinforcement learning!")


In [None]:
class GridWorld:
    """
    GridWorld environment for demonstrating TD learning algorithms
    Modified from Session 2 to support episodic interaction
    """
    
    def __init__(self, size=4, goal_reward=10, step_reward=-0.1, obstacle_reward=-5):
        self.size = size
        self.goal_reward = goal_reward
        self.step_reward = step_reward
        self.obstacle_reward = obstacle_reward
        
        self.states = [(i, j) for i in range(size) for j in range(size)]
        
        self.actions = ['up', 'down', 'left', 'right']
        self.action_effects = {
            'up': (-1, 0),
            'down': (1, 0),
            'left': (0, -1),
            'right': (0, 1)
        }
        
        self.start_state = (0, 0)
        self.goal_state = (3, 3)
        self.obstacles = [(1, 1), (2, 1), (1, 2)]
        
        self.current_state = self.start_state
        
    def reset(self):
        """Reset environment to start state"""
        self.current_state = self.start_state
        return self.current_state
    
    def step(self, action):
        """
        Take action and return (next_state, reward, done, info)
        Compatible with standard RL environment interface
        """
        if self.is_terminal(self.current_state):
            return self.current_state, 0, True, {}
        
        dx, dy = self.action_effects[action]
        next_x, next_y = self.current_state[0] + dx, self.current_state[1] + dy
        
        if not (0 <= next_x < self.size and 0 <= next_y < self.size):
            next_state = self.current_state  # Stay in place
        else:
            next_state = (next_x, next_y)
        
        if next_state == self.goal_state:
            reward = self.goal_reward
        elif next_state in self.obstacles:
            reward = self.obstacle_reward
            next_state = self.current_state  # Can't move into obstacle
        else:
            reward = self.step_reward
        
        done = (next_state == self.goal_state)
        
        self.current_state = next_state
        
        return next_state, reward, done, {}
    
    def get_valid_actions(self, state):
        """Get valid actions from a state"""
        if self.is_terminal(state):
            return []
        return self.actions
    
    def is_terminal(self, state):
        """Check if state is terminal"""
        return state == self.goal_state
    
    def visualize_values(self, values, title="State Values", policy=None):
        """Visualize state values and optional policy"""
        grid = np.zeros((self.size, self.size))
        for i, j in self.obstacles:
            grid[i, j] = min(values.values()) - 1  # Make obstacles darker
        
        for i in range(self.size):
            for j in range(self.size):
                state = (i, j)
                if state not in self.obstacles:
                    grid[i, j] = values.get(state, 0)
        
        fig, ax = plt.subplots(figsize=(8, 6))
        im = ax.imshow(grid, cmap='RdYlGn', aspect='equal')
        
        arrow_map = {'up': '↑', 'down': '↓', 'left': '←', 'right': '→'}
        for i in range(self.size):
            for j in range(self.size):
                state = (i, j)
                if state == self.goal_state:
                    ax.text(j, i, 'G', ha='center', va='center', 
                           fontsize=16, fontweight='bold', color='darkgreen')
                elif state in self.obstacles:
                    ax.text(j, i, 'X', ha='center', va='center', 
                           fontsize=16, fontweight='bold', color='darkred')
                elif state == self.start_state:
                    ax.text(j, i-0.3, 'S', ha='center', va='center', 
                           fontsize=12, fontweight='bold', color='blue')
                    ax.text(j, i+0.2, f'{values.get(state, 0):.1f}', 
                           ha='center', va='center', fontsize=10)
                else:
                    ax.text(j, i, f'{values.get(state, 0):.1f}', 
                           ha='center', va='center', fontsize=10)
                
                if policy and state in policy and not self.is_terminal(state):
                    action = policy[state]
                    if action in arrow_map:
                        ax.text(j+0.3, i-0.3, arrow_map[action], 
                               ha='center', va='center', fontsize=8, color='blue')
        
        ax.set_title(title, fontsize=14, fontweight='bold')
        ax.set_xticks(range(self.size))
        ax.set_yticks(range(self.size))
        plt.colorbar(im, ax=ax)
        plt.tight_layout()
        plt.show()

env = GridWorld()
print("GridWorld environment created!")
print(f"State space: {len(env.states)} states")
print(f"Action space: {len(env.actions)} actions")
print(f"Start state: {env.start_state}")
print(f"Goal state: {env.goal_state}")
print(f"Obstacles: {env.obstacles}")

state = env.reset()
print(f"\nEnvironment reset. Current state: {state}")
next_state, reward, done, info = env.step('right')
print(f"Action 'right': next_state={next_state}, reward={reward}, done={done}")


## Part 2: TD(0) Learning - Policy Evaluation

### Understanding TD(0) Algorithm

**TD(0)** is the simplest temporal difference method for policy evaluation. It updates value estimates after each step using the observed reward and the current estimate of the next state.

### Mathematical Foundation

**Bellman Equation for V^π(s)**:
```
V^π(s) = E[R_{t+1} + γV^π(S_{t+1}) | S_t = s]
```

**TD(0) Update Rule**:
```
V(S_t) ← V(S_t) + α[R_{t+1} + γV(S_{t+1}) - V(S_t)]
```

**Components**:
- **V(S_t)**: Current value estimate
- **α**: Learning rate (step size)
- **R_{t+1}**: Observed immediate reward
- **γ**: Discount factor
- **TD Target**: R_{t+1} + γV(S_{t+1})
- **TD Error**: R_{t+1} + γV(S_{t+1}) - V(S_t)

### TD(0) vs Other Methods

| Aspect | Monte Carlo | TD(0) | Dynamic Programming |
|--------|-------------|-------|-------------------|
| **Model** | Not required | Not required | Required |
| **Update** | End of episode | Every step | Full sweep |
| **Target** | Actual return G_t | R_{t+1} + γV(S_{t+1}) | Expected value |
| **Bias** | Unbiased | Biased (bootstrap) | Unbiased |
| **Variance** | High | Low | None |

### Key Properties of TD(0)

1. **Bootstrapping**: Uses current estimates to update estimates
2. **Online Learning**: Can learn during interaction
3. **Model-Free**: No need for transition probabilities
4. **Convergence**: Converges to V^π under certain conditions

### Learning Rate (α) Impact

- **High α (e.g., 0.8)**: Fast learning, high sensitivity to recent experience
- **Low α (e.g., 0.1)**: Slow learning, more stable, averages over many experiences
- **Optimal α**: Often requires tuning based on problem characteristics

### Convergence Conditions

TD(0) converges to V^π if:
1. Policy π is fixed
2. Learning rate α satisfies: Σα_t = ∞ and Σα_t² < ∞
3. All state-action pairs are visited infinitely often

In [None]:
class TD0Agent:
    """
    TD(0) agent for policy evaluation
    Learns state values V(s) for a given policy
    """
    
    def __init__(self, env, policy, alpha=0.1, gamma=0.9):
        self.env = env
        self.policy = policy
        self.alpha = alpha  # Learning rate
        self.gamma = gamma  # Discount factor
        
        self.V = defaultdict(float)
        
        self.episode_rewards = []
        self.value_history = []
        
    def get_action(self, state):
        """Get action from policy"""
        if hasattr(self.policy, 'get_action'):
            return self.policy.get_action(state)
        else:
            valid_actions = self.env.get_valid_actions(state)
            return np.random.choice(valid_actions) if valid_actions else None
    
    def td_update(self, state, reward, next_state, done):
        """
        Perform TD(0) update
        V(s) ← V(s) + α[R + γV(s') - V(s)]
        """
        if done:
            td_target = reward  # No next state value for terminal states
        else:
            td_target = reward + self.gamma * self.V[next_state]
        
        td_error = td_target - self.V[state]
        self.V[state] += self.alpha * td_error
        
        return td_error
    
    def run_episode(self, max_steps=100):
        """Run one episode and learn"""
        state = self.env.reset()
        episode_reward = 0
        steps = 0
        
        while steps < max_steps:
            action = self.get_action(state)
            if action is None:
                break
            
            next_state, reward, done, _ = self.env.step(action)
            episode_reward += reward
            
            td_error = self.td_update(state, reward, next_state, done)
            
            state = next_state
            steps += 1
            
            if done:
                break
        
        return episode_reward, steps
    
    def train(self, num_episodes=1000, print_every=100):
        """Train the agent over multiple episodes"""
        print(f"Training TD(0) agent for {num_episodes} episodes...")
        print(f"Learning rate α = {self.alpha}, Discount factor γ = {self.gamma}")
        
        for episode in range(num_episodes):
            episode_reward, steps = self.run_episode()
            self.episode_rewards.append(episode_reward)
            
            if episode % 10 == 0:
                self.value_history.append(dict(self.V))
            
            if (episode + 1) % print_every == 0:
                avg_reward = np.mean(self.episode_rewards[-print_every:])
                print(f"Episode {episode + 1}: Average reward = {avg_reward:.2f}")
        
        print("Training completed!")
        return self.V
    
    def get_value_function(self):
        """Get current value function as dictionary"""
        return dict(self.V)

class RandomPolicy:
    """Random policy for testing TD(0)"""
    
    def __init__(self, env):
        self.env = env
    
    def get_action(self, state):
        """Return random valid action"""
        valid_actions = self.env.get_valid_actions(state)
        if not valid_actions:
            return None
        return np.random.choice(valid_actions)

print("Creating TD(0) agent with random policy...")

random_policy = RandomPolicy(env)
td_agent = TD0Agent(env, random_policy, alpha=0.1, gamma=0.9)

print("TD(0) agent created successfully!")
print(f"Initial value function (should be all zeros): {len(td_agent.V)} states initialized")


In [None]:
print("Training TD(0) agent...")
V_td = td_agent.train(num_episodes=500, print_every=100)

print("\nLearned Value Function:")
env.visualize_values(V_td, title="TD(0) Learned Value Function - Random Policy")

def plot_learning_curve(episode_rewards, title="Learning Curve"):
    """Plot learning curve showing episode rewards over time"""
    plt.figure(figsize=(12, 4))
    
    plt.subplot(1, 2, 1)
    plt.plot(episode_rewards, alpha=0.6, color='blue', linewidth=0.8)
    
    window_size = 50
    if len(episode_rewards) >= window_size:
        moving_avg = []
        for i in range(len(episode_rewards)):
            start_idx = max(0, i - window_size + 1)
            moving_avg.append(np.mean(episode_rewards[start_idx:i+1]))
        plt.plot(moving_avg, color='red', linewidth=2, label=f'Moving Average ({window_size} episodes)')
        plt.legend()
    
    plt.xlabel('Episode')
    plt.ylabel('Episode Reward')
    plt.title(f'{title} - Episode Rewards')
    plt.grid(True, alpha=0.3)
    
    plt.subplot(1, 2, 2)
    plt.hist(episode_rewards, bins=30, alpha=0.7, color='green', edgecolor='black')
    plt.xlabel('Episode Reward')
    plt.ylabel('Frequency')
    plt.title('Reward Distribution')
    plt.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    print(f"Learning Statistics:")
    print(f"Total episodes: {len(episode_rewards)}")
    print(f"Average reward: {np.mean(episode_rewards):.2f}")
    print(f"Reward std: {np.std(episode_rewards):.2f}")
    print(f"Min reward: {np.min(episode_rewards):.2f}")
    print(f"Max reward: {np.max(episode_rewards):.2f}")

plot_learning_curve(td_agent.episode_rewards, "TD(0) Learning")

key_states = [(0, 0), (1, 0), (2, 0), (3, 2), (2, 2)]
print(f"\nLearned values for key states:")
print("State\t\tTD(0) Value")
print("-" * 30)
for state in key_states:
    if state in V_td:
        print(f"{state}\t\t{V_td[state]:.3f}")
    else:
        print(f"{state}\t\t0.000")

print(f"\nTD(0) Value Function Learning Complete!")
print(f"The agent learned state values through interaction with the environment.")


## Part 3: Q-Learning - Off-Policy Control

### From Policy Evaluation to Control

**TD(0)** solves the **policy evaluation** problem: given a policy π, learn V^π(s).

**Q-Learning** solves the **control** problem: find the optimal policy π* and optimal action-value function Q*(s,a).

### Q-Learning Algorithm

**Objective**: Learn Q*(s,a) = optimal action-value function

**Q-Learning Update Rule**:
```
Q(S_t, A_t) ← Q(S_t, A_t) + α[R_{t+1} + γ max_a Q(S_{t+1}, a) - Q(S_t, A_t)]
```

**Key Components**:
- **Q(S_t, A_t)**: Current Q-value estimate
- **α**: Learning rate
- **R_{t+1}**: Observed reward
- **γ**: Discount factor
- **max_a Q(S_{t+1}, a)**: Maximum Q-value for next state (greedy action)
- **TD Target**: R_{t+1} + γ max_a Q(S_{t+1}, a)
- **TD Error**: R_{t+1} + γ max_a Q(S_{t+1}, a) - Q(S_t, A_t)

### Off-Policy Nature

**Q-Learning is Off-Policy**:
- **Behavior Policy**: The policy used to generate actions (e.g., ε-greedy)
- **Target Policy**: The policy being learned (greedy w.r.t. Q)
- **Independence**: Can learn optimal policy while following exploratory policy

### Q-Learning vs SARSA Comparison

| Aspect | Q-Learning | SARSA |
|--------|------------|--------|
| **Type** | Off-policy | On-policy |
| **Update Target** | max_a Q(s',a) | Q(s',a') where a' ~ π |
| **Policy Learned** | Optimal (greedy) | Current policy |
| **Exploration Impact** | No direct impact on target | Affects learning target |
| **Convergence** | To Q* under conditions | To Q^π of current policy |

### Mathematical Foundation

**Bellman Optimality Equation**:
```
Q*(s,a) = E[R_{t+1} + γ max_{a'} Q*(S_{t+1}, a') | S_t = s, A_t = a]
```

**Q-Learning approximates this by**:
1. Using sample transitions instead of expectations
2. Using current Q estimates instead of true Q*
3. Updating incrementally with learning rate α

### Convergence Properties

Q-Learning converges to Q* under these conditions:
1. **Infinite exploration**: All state-action pairs visited infinitely often
2. **Learning rate conditions**: Σα_t = ∞ and Σα_t² < ∞
3. **Bounded rewards**: |R| ≤ R_max < ∞

### Exploration-Exploitation Trade-off

**Problem**: Pure greedy policy may never discover optimal actions

**Solution**: ε-greedy policy
- With probability ε: Choose random action (explore)
- With probability 1-ε: Choose greedy action (exploit)

**ε-greedy variants**:
- **Fixed ε**: Constant exploration rate
- **Decaying ε**: ε decreases over time (ε_t = ε_0 / (1 + decay_rate * t))
- **Adaptive ε**: ε based on learning progress

In [None]:
class QLearningAgent:
    """
    Q-Learning agent for finding optimal policy
    Learns Q*(s,a) through off-policy temporal difference learning
    """
    
    def __init__(self, env, alpha=0.1, gamma=0.9, epsilon=0.1, epsilon_decay=0.995, epsilon_min=0.01):
        self.env = env
        self.alpha = alpha          # Learning rate
        self.gamma = gamma          # Discount factor
        self.epsilon = epsilon      # Exploration rate
        self.epsilon_decay = epsilon_decay
        self.epsilon_min = epsilon_min
        
        self.Q = defaultdict(lambda: defaultdict(float))
        
        self.episode_rewards = []
        self.episode_steps = []
        self.epsilon_history = []
        self.q_value_history = []
        
    def get_action(self, state, explore=True):
        """
        Get action using ε-greedy policy
        """
        if not explore:
            return self.get_greedy_action(state)
        
        if np.random.random() < self.epsilon:
            valid_actions = self.env.get_valid_actions(state)
            return np.random.choice(valid_actions) if valid_actions else None
        else:
            return self.get_greedy_action(state)
    
    def get_greedy_action(self, state):
        """Get greedy action (highest Q-value)"""
        valid_actions = self.env.get_valid_actions(state)
        if not valid_actions:
            return None
        
        q_values = {action: self.Q[state][action] for action in valid_actions}
        max_q = max(q_values.values())
        
        best_actions = [action for action, q in q_values.items() if q == max_q]
        return np.random.choice(best_actions)
    
    def update_q(self, state, action, reward, next_state, done):
        """
        Q-Learning update:
        Q(s,a) ← Q(s,a) + α[R + γ max_a' Q(s',a') - Q(s,a)]
        """
        current_q = self.Q[state][action]
        
        if done:
            td_target = reward
        else:
            valid_next_actions = self.env.get_valid_actions(next_state)
            if valid_next_actions:
                max_next_q = max([self.Q[next_state][a] for a in valid_next_actions])
            else:
                max_next_q = 0.0
            td_target = reward + self.gamma * max_next_q
        
        td_error = td_target - current_q
        self.Q[state][action] += self.alpha * td_error
        
        return td_error
    
    def run_episode(self, max_steps=200):
        """Run one episode and learn"""
        state = self.env.reset()
        episode_reward = 0
        steps = 0
        
        while steps < max_steps:
            action = self.get_action(state, explore=True)
            if action is None:
                break
            
            next_state, reward, done, _ = self.env.step(action)
            episode_reward += reward
            
            td_error = self.update_q(state, action, reward, next_state, done)
            
            state = next_state
            steps += 1
            
            if done:
                break
        
        return episode_reward, steps
    
    def train(self, num_episodes=1000, print_every=100):
        """Train the Q-learning agent"""
        print(f"Training Q-Learning agent for {num_episodes} episodes...")
        print(f"Parameters: α={self.alpha}, γ={self.gamma}, ε={self.epsilon}")
        
        for episode in range(num_episodes):
            episode_reward, steps = self.run_episode()
            
            self.episode_rewards.append(episode_reward)
            self.episode_steps.append(steps)
            self.epsilon_history.append(self.epsilon)
            
            if episode % 50 == 0:
                q_snapshot = {}
                for state in self.env.states:
                    q_snapshot[state] = dict(self.Q[state])
                self.q_value_history.append(q_snapshot)
            
            self.epsilon = max(self.epsilon_min, self.epsilon * self.epsilon_decay)
            
            if (episode + 1) % print_every == 0:
                avg_reward = np.mean(self.episode_rewards[-print_every:])
                avg_steps = np.mean(self.episode_steps[-print_every:])
                print(f"Episode {episode + 1}: Avg Reward = {avg_reward:.2f}, "
                      f"Avg Steps = {avg_steps:.1f}, ε = {self.epsilon:.3f}")
        
        print("Q-Learning training completed!")
    
    def get_value_function(self):
        """Extract value function V*(s) = max_a Q*(s,a)"""
        V = {}
        for state in self.env.states:
            valid_actions = self.env.get_valid_actions(state)
            if valid_actions:
                V[state] = max([self.Q[state][action] for action in valid_actions])
            else:
                V[state] = 0.0
        return V
    
    def get_policy(self):
        """Extract optimal policy π*(s) = argmax_a Q*(s,a)"""
        policy = {}
        for state in self.env.states:
            if not self.env.is_terminal(state):
                policy[state] = self.get_greedy_action(state)
        return policy
    
    def evaluate_policy(self, num_episodes=100):
        """Evaluate learned policy (no exploration)"""
        rewards = []
        steps_list = []
        
        for _ in range(num_episodes):
            state = self.env.reset()
            episode_reward = 0
            steps = 0
            
            while steps < 200:
                action = self.get_action(state, explore=False)  # No exploration
                if action is None:
                    break
                
                next_state, reward, done, _ = self.env.step(action)
                episode_reward += reward
                state = next_state
                steps += 1
                
                if done:
                    break
            
            rewards.append(episode_reward)
            steps_list.append(steps)
        
        return {
            'avg_reward': np.mean(rewards),
            'std_reward': np.std(rewards),
            'avg_steps': np.mean(steps_list),
            'success_rate': sum(1 for r in rewards if r > 5) / len(rewards)
        }

print("Creating Q-Learning agent...")
q_agent = QLearningAgent(env, alpha=0.1, gamma=0.9, epsilon=0.1, epsilon_decay=0.995)
print("Q-Learning agent created successfully!")
print("Ready to learn optimal Q-function Q*(s,a)")


In [None]:
print("Training Q-Learning agent...")
q_agent.train(num_episodes=1000, print_every=200)

V_optimal = q_agent.get_value_function()
optimal_policy = q_agent.get_policy()

print("\nLearned Optimal Value Function V*(s):")
env.visualize_values(V_optimal, title="Q-Learning: Optimal Value Function V*", policy=optimal_policy)

print("\nEvaluating learned policy...")
evaluation = q_agent.evaluate_policy(num_episodes=100)
print(f"Policy Evaluation Results:")
print(f"Average reward: {evaluation['avg_reward']:.2f} ± {evaluation['std_reward']:.2f}")
print(f"Average steps to goal: {evaluation['avg_steps']:.1f}")
print(f"Success rate: {evaluation['success_rate']*100:.1f}%")

def plot_q_learning_analysis(agent):
    """Comprehensive analysis of Q-Learning performance"""
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    
    ax1 = axes[0, 0]
    ax1.plot(agent.episode_rewards, alpha=0.6, color='blue', linewidth=0.8, label='Episode Reward')
    
    window = 50
    if len(agent.episode_rewards) >= window:
        moving_avg = pd.Series(agent.episode_rewards).rolling(window=window).mean()
        ax1.plot(moving_avg, color='red', linewidth=2, label=f'Moving Average ({window})')
    
    ax1.set_xlabel('Episode')
    ax1.set_ylabel('Episode Reward')
    ax1.set_title('Q-Learning: Episode Rewards')
    ax1.legend()
    ax1.grid(True, alpha=0.3)
    
    ax2 = axes[0, 1]
    ax2.plot(agent.episode_steps, alpha=0.7, color='green', linewidth=0.8)
    
    if len(agent.episode_steps) >= window:
        steps_avg = pd.Series(agent.episode_steps).rolling(window=window).mean()
        ax2.plot(steps_avg, color='darkgreen', linewidth=2, label=f'Moving Average ({window})')
    
    ax2.set_xlabel('Episode')
    ax2.set_ylabel('Steps to Goal')
    ax2.set_title('Q-Learning: Steps per Episode')
    ax2.legend()
    ax2.grid(True, alpha=0.3)
    
    ax3 = axes[1, 0]
    ax3.plot(agent.epsilon_history, color='purple', linewidth=2)
    ax3.set_xlabel('Episode')
    ax3.set_ylabel('Epsilon (ε)')
    ax3.set_title('Exploration Rate Decay')
    ax3.grid(True, alpha=0.3)
    
    ax4 = axes[1, 1]
    final_rewards = agent.episode_rewards[-200:]  # Last 200 episodes
    ax4.hist(final_rewards, bins=20, alpha=0.7, color='orange', edgecolor='black')
    ax4.axvline(np.mean(final_rewards), color='red', linestyle='--', 
                label=f'Mean: {np.mean(final_rewards):.2f}')
    ax4.set_xlabel('Episode Reward')
    ax4.set_ylabel('Frequency')
    ax4.set_title('Final Performance Distribution')
    ax4.legend()
    ax4.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()

plot_q_learning_analysis(q_agent)

def show_q_values(agent, states_to_show=[(0,0), (1,0), (2,0), (0,1), (2,2)]):
    """Display Q-values for specific states"""
    print("\nLearned Q-values for key states:")
    print("State\t\tAction\t\tQ-value")
    print("-" * 40)
    
    for state in states_to_show:
        if not agent.env.is_terminal(state):
            valid_actions = agent.env.get_valid_actions(state)
            for action in valid_actions:
                q_val = agent.Q[state][action]
                print(f"{state}\t\t{action}\t\t{q_val:.3f}")
            print("-" * 40)

show_q_values(q_agent)

print("\nQ-Learning has successfully learned the optimal policy!")
print("The agent can now navigate efficiently to the goal while avoiding obstacles.")


## Part 4: SARSA - On-Policy Control

### Understanding SARSA Algorithm

**SARSA** (State-Action-Reward-State-Action) is an **on-policy** temporal difference control algorithm that learns the action-value function Q^π(s,a) for the policy it is following.

### SARSA vs Q-Learning: Key Differences

| Aspect | SARSA | Q-Learning |
|--------|--------|------------|
| **Policy Type** | On-policy | Off-policy |
| **Update Target** | Q(S', A') | max_a Q(S', a) |
| **Policy Learning** | Current behavior policy | Optimal policy |
| **Exploration Effect** | Affects learned Q-values | Only affects experience collection |
| **Safety** | More conservative | More aggressive |

### SARSA Update Rule

```
Q(S_t, A_t) ← Q(S_t, A_t) + α[R_{t+1} + γQ(S_{t+1}, A_{t+1}) - Q(S_t, A_t)]
```

**SARSA Tuple**: (S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1})
- **S_t**: Current state
- **A_t**: Current action
- **R_{t+1}**: Reward received
- **S_{t+1}**: Next state
- **A_{t+1}**: Next action (chosen by current policy)

### SARSA Algorithm Steps

1. Initialize Q(s,a) arbitrarily
2. **For each episode**:
   - Initialize S
   - Choose A from S using policy derived from Q (e.g., ε-greedy)
   - **For each step of episode**:
     - Take action A, observe R, S'
     - Choose A' from S' using policy derived from Q
     - **Update**: Q(S,A) ← Q(S,A) + α[R + γQ(S',A') - Q(S,A)]
     - S ← S', A ← A'

### On-Policy Nature

**SARSA learns Q^π** where π is the policy being followed:
- The policy used to select actions IS the policy being evaluated
- Exploration actions directly affect the learned Q-values
- More conservative in dangerous environments

### Expected SARSA

**Variant**: Instead of using the next action A', use the expected value:

```
Q(S_t, A_t) ← Q(S_t, A_t) + α[R_{t+1} + γE[Q(S_{t+1}, A_{t+1})|S_{t+1}] - Q(S_t, A_t)]
```

Where: E[Q(S_{t+1}, A_{t+1})|S_{t+1}] = Σ_a π(a|S_{t+1}) Q(S_{t+1}, a)

### When to Use SARSA vs Q-Learning

**Use SARSA when**:
- Safety is important (e.g., robot navigation)
- You want to learn the policy you're actually following
- Environment has "cliffs" or dangerous states
- Conservative behavior is preferred

**Use Q-Learning when**:
- You want optimal performance
- Exploration is safe
- You can afford aggressive learning
- Sample efficiency is important

### Convergence Properties

**SARSA Convergence**:
- Converges to Q^π for the policy π being followed
- If π converges to greedy policy, SARSA converges to Q*
- Requires same conditions as Q-Learning for convergence

In [None]:
class SARSAAgent:
    """
    SARSA agent for on-policy control
    Learns Q^π(s,a) for the policy being followed
    """
    
    def __init__(self, env, alpha=0.1, gamma=0.9, epsilon=0.1, epsilon_decay=0.995, epsilon_min=0.01):
        self.env = env
        self.alpha = alpha
        self.gamma = gamma
        self.epsilon = epsilon
        self.epsilon_decay = epsilon_decay
        self.epsilon_min = epsilon_min
        
        self.Q = defaultdict(lambda: defaultdict(float))
        
        self.episode_rewards = []
        self.episode_steps = []
        self.epsilon_history = []
        
    def get_action(self, state, explore=True):
        """Get action using ε-greedy policy"""
        if not explore:
            return self.get_greedy_action(state)
        
        if np.random.random() < self.epsilon:
            valid_actions = self.env.get_valid_actions(state)
            return np.random.choice(valid_actions) if valid_actions else None
        else:
            return self.get_greedy_action(state)
    
    def get_greedy_action(self, state):
        """Get greedy action"""
        valid_actions = self.env.get_valid_actions(state)
        if not valid_actions:
            return None
        
        q_values = {action: self.Q[state][action] for action in valid_actions}
        max_q = max(q_values.values())
        best_actions = [action for action, q in q_values.items() if q == max_q]
        return np.random.choice(best_actions)
    
    def update_q_sarsa(self, state, action, reward, next_state, next_action, done):
        """
        SARSA update: Q(S,A) ← Q(S,A) + α[R + γQ(S',A') - Q(S,A)]
        """
        current_q = self.Q[state][action]
        
        if done:
            td_target = reward
        else:
            next_q = self.Q[next_state][next_action] if next_action else 0.0
            td_target = reward + self.gamma * next_q
        
        td_error = td_target - current_q
        self.Q[state][action] += self.alpha * td_error
        
        return td_error
    
    def run_episode(self, max_steps=200):
        """Run one episode using SARSA"""
        state = self.env.reset()
        action = self.get_action(state, explore=True)
        
        episode_reward = 0
        steps = 0
        
        while steps < max_steps and action is not None:
            next_state, reward, done, _ = self.env.step(action)
            episode_reward += reward
            
            if done:
                next_action = None
            else:
                next_action = self.get_action(next_state, explore=True)
            
            td_error = self.update_q_sarsa(state, action, reward, next_state, next_action, done)
            
            state = next_state
            action = next_action
            steps += 1
            
            if done:
                break
        
        return episode_reward, steps
    
    def train(self, num_episodes=1000, print_every=100):
        """Train SARSA agent"""
        print(f"Training SARSA agent for {num_episodes} episodes...")
        print(f"Parameters: α={self.alpha}, γ={self.gamma}, ε={self.epsilon}")
        
        for episode in range(num_episodes):
            episode_reward, steps = self.run_episode()
            
            self.episode_rewards.append(episode_reward)
            self.episode_steps.append(steps)
            self.epsilon_history.append(self.epsilon)
            
            self.epsilon = max(self.epsilon_min, self.epsilon * self.epsilon_decay)
            
            if (episode + 1) % print_every == 0:
                avg_reward = np.mean(self.episode_rewards[-print_every:])
                avg_steps = np.mean(self.episode_steps[-print_every:])
                print(f"Episode {episode + 1}: Avg Reward = {avg_reward:.2f}, "
                      f"Avg Steps = {avg_steps:.1f}, ε = {self.epsilon:.3f}")
        
        print("SARSA training completed!")
    
    def get_value_function(self):
        """Extract value function"""
        V = {}
        for state in self.env.states:
            valid_actions = self.env.get_valid_actions(state)
            if valid_actions:
                V[state] = max([self.Q[state][action] for action in valid_actions])
            else:
                V[state] = 0.0
        return V
    
    def get_policy(self):
        """Extract learned policy"""
        policy = {}
        for state in self.env.states:
            if not self.env.is_terminal(state):
                policy[state] = self.get_greedy_action(state)
        return policy
    
    def evaluate_policy(self, num_episodes=100):
        """Evaluate learned policy"""
        rewards = []
        steps_list = []
        
        for _ in range(num_episodes):
            state = self.env.reset()
            episode_reward = 0
            steps = 0
            
            while steps < 200:
                action = self.get_action(state, explore=False)
                if action is None:
                    break
                
                next_state, reward, done, _ = self.env.step(action)
                episode_reward += reward
                state = next_state
                steps += 1
                
                if done:
                    break
            
            rewards.append(episode_reward)
            steps_list.append(steps)
        
        return {
            'avg_reward': np.mean(rewards),
            'std_reward': np.std(rewards),
            'avg_steps': np.mean(steps_list),
            'success_rate': sum(1 for r in rewards if r > 5) / len(rewards)
        }

print("Creating SARSA agent...")
sarsa_agent = SARSAAgent(env, alpha=0.1, gamma=0.9, epsilon=0.1, epsilon_decay=0.995)

print("SARSA agent created successfully!")
print("Training SARSA agent...")
sarsa_agent.train(num_episodes=1000, print_every=200)

V_sarsa = sarsa_agent.get_value_function()
sarsa_policy = sarsa_agent.get_policy()

print("\nSARSA Learned Value Function:")
env.visualize_values(V_sarsa, title="SARSA: Learned Value Function", policy=sarsa_policy)

print("\nEvaluating SARSA policy...")
sarsa_evaluation = sarsa_agent.evaluate_policy(num_episodes=100)
print(f"SARSA Policy Evaluation:")
print(f"Average reward: {sarsa_evaluation['avg_reward']:.2f} ± {sarsa_evaluation['std_reward']:.2f}")
print(f"Average steps: {sarsa_evaluation['avg_steps']:.1f}")
print(f"Success rate: {sarsa_evaluation['success_rate']*100:.1f}%")


In [None]:
def compare_algorithms():
    """Compare TD(0), Q-Learning, and SARSA performance"""
    
    print("=" * 80)
    print("COMPREHENSIVE ALGORITHM COMPARISON")
    print("=" * 80)
    
    algorithms = {
        'TD(0)': {
            'agent': td_agent,
            'type': 'Policy Evaluation',
            'policy_type': 'Model-free evaluation',
            'learned_values': V_td,
            'evaluation': None
        },
        'Q-Learning': {
            'agent': q_agent,
            'type': 'Off-policy Control',
            'policy_type': 'Optimal policy',
            'learned_values': V_optimal,
            'evaluation': evaluation
        },
        'SARSA': {
            'agent': sarsa_agent,
            'type': 'On-policy Control',
            'policy_type': 'Behavior policy',
            'learned_values': V_sarsa,
            'evaluation': sarsa_evaluation
        }
    }
    
    print("\n1. PERFORMANCE COMPARISON")
    print("-" * 50)
    print(f"{'Algorithm':<12} {'Type':<20} {'Avg Reward':<12} {'Success Rate':<12}")
    print("-" * 50)
    
    for name, info in algorithms.items():
        if info['evaluation']:
            avg_reward = info['evaluation']['avg_reward']
            success_rate = info['evaluation']['success_rate'] * 100
            print(f"{name:<12} {info['type']:<20} {avg_reward:<12.2f} {success_rate:<12.1f}%")
        else:
            print(f"{name:<12} {info['type']:<20} {'N/A':<12} {'N/A':<12}")
    
    print("\n2. LEARNING CURVES COMPARISON")
    plt.figure(figsize=(15, 5))
    
    plt.subplot(1, 3, 1)
    if hasattr(td_agent, 'episode_rewards'):
        plt.plot(td_agent.episode_rewards, label='TD(0)', alpha=0.7, color='blue')
    plt.plot(q_agent.episode_rewards, label='Q-Learning', alpha=0.7, color='red')
    plt.plot(sarsa_agent.episode_rewards, label='SARSA', alpha=0.7, color='green')
    plt.xlabel('Episode')
    plt.ylabel('Episode Reward')
    plt.title('Episode Rewards Comparison')
    plt.legend()
    plt.grid(True, alpha=0.3)
    
    plt.subplot(1, 3, 2)
    window = 50
    
    if len(q_agent.episode_rewards) >= window:
        q_avg = pd.Series(q_agent.episode_rewards).rolling(window=window).mean()
        plt.plot(q_avg, label='Q-Learning', linewidth=2, color='red')
    
    if len(sarsa_agent.episode_rewards) >= window:
        sarsa_avg = pd.Series(sarsa_agent.episode_rewards).rolling(window=window).mean()
        plt.plot(sarsa_avg, label='SARSA', linewidth=2, color='green')
    
    plt.xlabel('Episode')
    plt.ylabel('Moving Average Reward')
    plt.title(f'Moving Average ({window} episodes)')
    plt.legend()
    plt.grid(True, alpha=0.3)
    
    plt.subplot(1, 3, 3)
    plt.plot(q_agent.epsilon_history, label='Q-Learning', color='red')
    plt.plot(sarsa_agent.epsilon_history, label='SARSA', color='green')
    plt.xlabel('Episode')
    plt.ylabel('Epsilon (ε)')
    plt.title('Exploration Rate Decay')
    plt.legend()
    plt.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    print("\n3. VALUE FUNCTION COMPARISON")
    fig, axes = plt.subplots(1, 3, figsize=(18, 5))
    
    if V_td:
        grid_td = np.zeros((env.size, env.size))
        for i, j in env.obstacles:
            grid_td[i, j] = min(V_td.values()) - 1
        for i in range(env.size):
            for j in range(env.size):
                state = (i, j)
                if state not in env.obstacles:
                    grid_td[i, j] = V_td.get(state, 0)
        
        im1 = axes[0].imshow(grid_td, cmap='RdYlGn', aspect='equal')
        axes[0].set_title('TD(0) Values')
        plt.colorbar(im1, ax=axes[0])
    
    grid_q = np.zeros((env.size, env.size))
    for i, j in env.obstacles:
        grid_q[i, j] = min(V_optimal.values()) - 1
    for i in range(env.size):
        for j in range(env.size):
            state = (i, j)
            if state not in env.obstacles:
                grid_q[i, j] = V_optimal.get(state, 0)
    
    im2 = axes[1].imshow(grid_q, cmap='RdYlGn', aspect='equal')
    axes[1].set_title('Q-Learning Values')
    plt.colorbar(im2, ax=axes[1])
    
    grid_s = np.zeros((env.size, env.size))
    for i, j in env.obstacles:
        grid_s[i, j] = min(V_sarsa.values()) - 1
    for i in range(env.size):
        for j in range(env.size):
            state = (i, j)
            if state not in env.obstacles:
                grid_s[i, j] = V_sarsa.get(state, 0)
    
    im3 = axes[2].imshow(grid_s, cmap='RdYlGn', aspect='equal')
    axes[2].set_title('SARSA Values')
    plt.colorbar(im3, ax=axes[2])
    
    for ax in axes:
        ax.set_xticks(range(env.size))
        ax.set_yticks(range(env.size))
    
    plt.tight_layout()
    plt.show()
    
    print("\n4. STATISTICAL ANALYSIS")
    print("-" * 50)
    
    key_states = [(0, 0), (1, 0), (2, 0), (3, 2), (2, 2)]
    print(f"{'State':<10} {'TD(0)':<10} {'Q-Learning':<12} {'SARSA':<10} {'Q-S Diff':<10}")
    print("-" * 55)
    
    for state in key_states:
        td_val = V_td.get(state, 0) if V_td else 0
        q_val = V_optimal.get(state, 0)
        s_val = V_sarsa.get(state, 0)
        diff = abs(q_val - s_val)
        
        print(f"{str(state):<10} {td_val:<10.2f} {q_val:<12.2f} {s_val:<10.2f} {diff:<10.3f}")
    
    return algorithms

comparison_results = compare_algorithms()

print("\n" + "=" * 80)
print("ALGORITHM ANALYSIS SUMMARY")
print("=" * 80)
print("1. Q-Learning: Learns optimal policy, aggressive exploration")
print("2. SARSA: Learns policy being followed, more conservative")
print("3. TD(0): Policy evaluation only, foundation for control methods")
print("4. Both Q-Learning and SARSA converge to good policies")
print("5. Choice depends on application requirements (safety vs optimality)")
print("=" * 80)


## Part 5: Exploration Strategies in Reinforcement Learning

### The Exploration-Exploitation Dilemma

**The Problem**: How to balance between:
- **Exploitation**: Choose actions that are currently believed to be best
- **Exploration**: Try actions that might lead to better long-term performance

**Why It Matters**: Without proper exploration, agents may:
- Get stuck in suboptimal policies
- Never discover better strategies
- Fail to adapt to changing environments

### Common Exploration Strategies

#### 1. Epsilon-Greedy (ε-greedy)

**Basic ε-greedy**:
- With probability ε: choose random action
- With probability 1-ε: choose greedy action

**Advantages**: Simple, widely used, theoretical guarantees
**Disadvantages**: Uniform random exploration, may be inefficient

#### 2. Decaying Epsilon

**Exponential Decay**: ε_t = ε_0 × decay_rate^t
**Linear Decay**: ε_t = max(ε_min, ε_0 - decay_rate × t)
**Inverse Decay**: ε_t = ε_0 / (1 + decay_rate × t)

**Rationale**: High exploration early, more exploitation as learning progresses

#### 3. Boltzmann Exploration (Softmax)

**Softmax Action Selection**:
```
P(a|s) = e^(Q(s,a)/τ) / Σ_b e^(Q(s,b)/τ)
```

Where τ (tau) is the **temperature** parameter:
- High τ: More random (high exploration)
- Low τ: More greedy (low exploration)
- τ → 0: Pure greedy
- τ → ∞: Pure random

#### 4. Upper Confidence Bound (UCB)

**UCB Action Selection**:
```
A_t = argmax_a [Q_t(a) + c√(ln(t)/N_t(a))]
```

Where:
- Q_t(a): Current value estimate
- c: Confidence parameter
- t: Time step
- N_t(a): Number of times action a has been selected

#### 5. Thompson Sampling (Bayesian)

**Concept**: Maintain probability distributions over Q-values, sample from these distributions to make decisions.

**Process**:
1. Maintain beliefs about action values
2. Sample Q-values from belief distributions
3. Choose action with highest sampled value
4. Update beliefs based on observed rewards

### Exploration in Different Environments

**Stationary Environments**: ε-greedy with decay works well
**Non-stationary Environments**: Constant ε or adaptive methods
**Sparse Reward Environments**: More sophisticated exploration needed
**Dangerous Environments**: Conservative exploration (lower ε)

In [None]:
class ExplorationStrategies:
    """Collection of exploration strategies for RL agents"""
    
    @staticmethod
    def epsilon_greedy(Q, state, valid_actions, epsilon):
        """Standard ε-greedy exploration"""
        if np.random.random() < epsilon:
            return np.random.choice(valid_actions)
        else:
            q_values = {action: Q[state][action] for action in valid_actions}
            max_q = max(q_values.values())
            best_actions = [a for a, q in q_values.items() if q == max_q]
            return np.random.choice(best_actions)
    
    @staticmethod
    def boltzmann_exploration(Q, state, valid_actions, temperature):
        """Boltzmann (softmax) exploration"""
        if temperature <= 0:
            q_values = {action: Q[state][action] for action in valid_actions}
            max_q = max(q_values.values())
            best_actions = [a for a, q in q_values.items() if q == max_q]
            return np.random.choice(best_actions)
        
        q_values = np.array([Q[state][action] for action in valid_actions])
        exp_q = np.exp(q_values / temperature)
        probabilities = exp_q / np.sum(exp_q)
        
        return np.random.choice(valid_actions, p=probabilities)
    
    @staticmethod
    def decay_epsilon(initial_epsilon, episode, decay_rate, min_epsilon, decay_type='exponential'):
        """Different epsilon decay strategies"""
        if decay_type == 'exponential':
            return max(min_epsilon, initial_epsilon * (decay_rate ** episode))
        elif decay_type == 'linear':
            return max(min_epsilon, initial_epsilon - decay_rate * episode)
        elif decay_type == 'inverse':
            return max(min_epsilon, initial_epsilon / (1 + decay_rate * episode))
        else:
            return initial_epsilon

class ExplorationExperiment:
    """Experiment with different exploration strategies"""
    
    def __init__(self, env):
        self.env = env
        
    def run_exploration_experiment(self, strategies, num_episodes=500, num_runs=3):
        """Compare different exploration strategies"""
        results = {}
        
        for strategy_name, params in strategies.items():
            print(f"Testing {strategy_name}...")
            
            strategy_results = []
            for run in range(num_runs):
                if strategy_name.startswith('epsilon'):
                    agent = QLearningAgent(self.env, alpha=0.1, gamma=0.9, 
                                         epsilon=params['epsilon'], 
                                         epsilon_decay=params.get('decay', 0.995))
                    agent.train(num_episodes=num_episodes, print_every=num_episodes)
                
                elif strategy_name == 'boltzmann':
                    agent = BoltzmannQLearning(self.env, alpha=0.1, gamma=0.9, 
                                             temperature=params['temperature'])
                    agent.train(num_episodes=num_episodes, print_every=num_episodes)
                
                evaluation = agent.evaluate_policy(num_episodes=100)
                strategy_results.append({
                    'rewards': agent.episode_rewards,
                    'evaluation': evaluation,
                    'final_epsilon': getattr(agent, 'epsilon', None)
                })
            
            results[strategy_name] = strategy_results
        
        return results

class BoltzmannQLearning:
    """Q-Learning with Boltzmann exploration"""
    
    def __init__(self, env, alpha=0.1, gamma=0.9, temperature=1.0, temp_decay=0.99, min_temp=0.01):
        self.env = env
        self.alpha = alpha
        self.gamma = gamma
        self.temperature = temperature
        self.temp_decay = temp_decay
        self.min_temp = min_temp
        
        self.Q = defaultdict(lambda: defaultdict(float))
        self.episode_rewards = []
        self.temperature_history = []
        
    def get_action(self, state, explore=True):
        """Boltzmann action selection"""
        valid_actions = self.env.get_valid_actions(state)
        if not valid_actions:
            return None
        
        if not explore:
            q_values = {action: self.Q[state][action] for action in valid_actions}
            max_q = max(q_values.values())
            best_actions = [a for a, q in q_values.items() if q == max_q]
            return np.random.choice(best_actions)
        
        return ExplorationStrategies.boltzmann_exploration(
            self.Q, state, valid_actions, self.temperature)
    
    def train(self, num_episodes=1000, print_every=100):
        """Train with Boltzmann exploration"""
        for episode in range(num_episodes):
            state = self.env.reset()
            episode_reward = 0
            steps = 0
            
            while steps < 200:
                action = self.get_action(state, explore=True)
                if action is None:
                    break
                
                next_state, reward, done, _ = self.env.step(action)
                episode_reward += reward
                
                current_q = self.Q[state][action]
                if done:
                    td_target = reward
                else:
                    valid_next_actions = self.env.get_valid_actions(next_state)
                    if valid_next_actions:
                        max_next_q = max([self.Q[next_state][a] for a in valid_next_actions])
                    else:
                        max_next_q = 0.0
                    td_target = reward + self.gamma * max_next_q
                
                self.Q[state][action] += self.alpha * (td_target - current_q)
                
                state = next_state
                steps += 1
                
                if done:
                    break
            
            self.episode_rewards.append(episode_reward)
            self.temperature_history.append(self.temperature)
            
            self.temperature = max(self.min_temp, self.temperature * self.temp_decay)
            
            if (episode + 1) % print_every == 0:
                avg_reward = np.mean(self.episode_rewards[-print_every:])
                print(f"Episode {episode + 1}: Avg Reward = {avg_reward:.2f}, Temp = {self.temperature:.3f}")
    
    def evaluate_policy(self, num_episodes=100):
        """Evaluate learned policy"""
        rewards = []
        for _ in range(num_episodes):
            state = self.env.reset()
            episode_reward = 0
            steps = 0
            
            while steps < 200:
                action = self.get_action(state, explore=False)
                if action is None:
                    break
                
                next_state, reward, done, _ = self.env.step(action)
                episode_reward += reward
                state = next_state
                steps += 1
                
                if done:
                    break
            
            rewards.append(episode_reward)
        
        return {
            'avg_reward': np.mean(rewards),
            'std_reward': np.std(rewards),
            'success_rate': sum(1 for r in rewards if r > 5) / len(rewards)
        }

print("EXPLORATION STRATEGIES EXPERIMENT")
print("=" * 50)

exploration_experiment = ExplorationExperiment(env)

strategies = {
    'epsilon_0.1': {'epsilon': 0.1, 'decay': 1.0},  # Fixed epsilon
    'epsilon_0.3': {'epsilon': 0.3, 'decay': 1.0},  # Higher fixed epsilon
    'epsilon_decay_fast': {'epsilon': 0.9, 'decay': 0.99},  # Fast decay
    'epsilon_decay_slow': {'epsilon': 0.5, 'decay': 0.995},  # Slow decay
    'boltzmann': {'temperature': 2.0}  # Boltzmann exploration
}

results = exploration_experiment.run_exploration_experiment(strategies, num_episodes=300, num_runs=2)

def analyze_exploration_results(results):
    """Analyze and visualize exploration experiment results"""
    
    print("\nEXPLORATION STRATEGY COMPARISON")
    print("-" * 60)
    print(f"{'Strategy':<20} {'Avg Reward':<12} {'Success Rate':<15} {'Std Reward':<12}")
    print("-" * 60)
    
    strategy_performance = {}
    
    for strategy, runs in results.items():
        avg_rewards = [run['evaluation']['avg_reward'] for run in runs]
        success_rates = [run['evaluation']['success_rate'] for run in runs]
        
        mean_reward = np.mean(avg_rewards)
        mean_success = np.mean(success_rates)
        std_reward = np.std(avg_rewards)
        
        strategy_performance[strategy] = {
            'mean_reward': mean_reward,
            'mean_success': mean_success,
            'std_reward': std_reward
        }
        
        print(f"{strategy:<20} {mean_reward:<12.2f} {mean_success*100:<15.1f}% {std_reward:<12.3f}")
    
    plt.figure(figsize=(15, 10))
    
    plt.subplot(2, 2, 1)
    for strategy, runs in results.items():
        avg_rewards = np.mean([run['rewards'] for run in runs], axis=0)
        plt.plot(avg_rewards, label=strategy, alpha=0.8)
    
    plt.xlabel('Episode')
    plt.ylabel('Episode Reward')
    plt.title('Learning Curves by Exploration Strategy')
    plt.legend()
    plt.grid(True, alpha=0.3)
    
    plt.subplot(2, 2, 2)
    strategies_list = list(strategy_performance.keys())
    rewards = [strategy_performance[s]['mean_reward'] for s in strategies_list]
    errors = [strategy_performance[s]['std_reward'] for s in strategies_list]
    
    bars = plt.bar(range(len(strategies_list)), rewards, yerr=errors, 
                   capsize=5, alpha=0.7, color=['blue', 'red', 'green', 'orange', 'purple'])
    plt.xticks(range(len(strategies_list)), strategies_list, rotation=45)
    plt.ylabel('Average Reward')
    plt.title('Final Performance Comparison')
    plt.grid(True, alpha=0.3)
    
    plt.subplot(2, 2, 3)
    success_rates = [strategy_performance[s]['mean_success']*100 for s in strategies_list]
    plt.bar(range(len(strategies_list)), success_rates, alpha=0.7, color='green')
    plt.xticks(range(len(strategies_list)), strategies_list, rotation=45)
    plt.ylabel('Success Rate (%)')
    plt.title('Success Rate Comparison')
    plt.grid(True, alpha=0.3)
    
    plt.subplot(2, 2, 4)
    for strategy, runs in results.items():
        if 'epsilon' in strategy and hasattr(runs[0], 'final_epsilon'):
            pass
    
    plt.xlabel('Episode')
    plt.ylabel('Exploration Parameter')
    plt.title('Exploration Parameter Evolution')
    plt.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    return strategy_performance

performance_analysis = analyze_exploration_results(results)

print("\n" + "=" * 80)
print("EXPLORATION STRATEGY INSIGHTS")
print("=" * 80)
print("1. Fixed epsilon strategies provide consistent exploration")
print("2. Decaying epsilon balances exploration and exploitation over time")
print("3. Boltzmann exploration provides principled probabilistic action selection")
print("4. Higher initial epsilon may find better solutions but converge slower")
print("5. The best strategy depends on environment characteristics")
print("=" * 80)


## Part 6: Advanced Topics and Extensions

### Double Q-Learning

**Problem with Q-Learning**: Maximization bias due to using the same Q-values for both action selection and evaluation.

**Solution**: Double Q-Learning maintains two Q-functions:
- Q_A and Q_B
- Randomly choose which one to update
- Use one for action selection, the other for evaluation

**Update Rule**:
```
If random() < 0.5:
    Q_A(S,A) ← Q_A(S,A) + α[R + γQ_B(S', argmax_a Q_A(S',a)) - Q_A(S,A)]
Else:
    Q_B(S,A) ← Q_B(S,A) + α[R + γQ_A(S', argmax_a Q_B(S',a)) - Q_B(S,A)]
```

### Experience Replay

**Concept**: Store experiences in a replay buffer and sample randomly for learning.

**Benefits**:
- Breaks temporal correlations in experience
- More sample efficient
- Enables offline learning from stored experiences

**Implementation**:
1. Store (s, a, r, s', done) tuples in buffer
2. Sample random mini-batches for updates
3. Update Q-function using sampled experiences

### Multi-Step Learning

**TD(λ)**: Generalization of TD(0) using eligibility traces
**n-step Q-learning**: Updates based on n-step returns

**n-step Return**:
```
G_t^{(n)} = R_{t+1} + γR_{t+2} + ... + γ^{n-1}R_{t+n} + γ^n Q(S_{t+n}, A_{t+n})
```

### Function Approximation

**Problem**: Large state spaces make tabular methods infeasible

**Solution**: Approximate Q(s,a) with function approximator:
- Linear functions: Q(s,a) = θ^T φ(s,a)
- Neural networks: Deep Q-Networks (DQN)

**Challenges**:
- Stability issues with function approximation
- Requires careful hyperparameter tuning
- May not converge to optimal solution

### Applications and Extensions

#### 1. Game Playing
- **Atari Games**: DQN and variants
- **Board Games**: AlphaGo, AlphaZero
- **Real-time Strategy**: StarCraft II

#### 2. Robotics
- **Navigation**: Path planning with obstacles
- **Manipulation**: Grasping and object manipulation
- **Control**: Drone flight, walking robots

#### 3. Finance and Trading
- **Portfolio Management**: Asset allocation
- **Algorithmic Trading**: Buy/sell decisions
- **Risk Management**: Dynamic hedging

#### 4. Resource Management
- **Cloud Computing**: Server allocation
- **Energy Systems**: Grid management
- **Transportation**: Traffic optimization

### Recent Developments

#### Deep Reinforcement Learning
- **DQN**: Deep Q-Networks with experience replay
- **DDQN**: Double Deep Q-Networks
- **Dueling DQN**: Separate value and advantage streams
- **Rainbow**: Combination of multiple improvements

#### Policy Gradient Methods
- **REINFORCE**: Basic policy gradient
- **Actor-Critic**: Combined value and policy learning
- **PPO**: Proximal Policy Optimization
- **SAC**: Soft Actor-Critic

#### Model-Based RL
- **Dyna-Q**: Learning with simulated experience
- **MCTS**: Monte Carlo Tree Search
- **Model-Predictive Control**: Planning with learned models

In [None]:
print("=" * 80)
print("SESSION 3 SUMMARY: TEMPORAL DIFFERENCE LEARNING")
print("=" * 80)

def print_session_summary():
    """Print comprehensive session summary"""
    
    summary_points = {
        "Core Concepts Learned": [
            "Temporal Difference Learning: Bootstrap from current estimates",
            "Q-Learning: Off-policy control for optimal policies",
            "SARSA: On-policy control for behavior policies",
            "Exploration strategies: ε-greedy, Boltzmann, decay schedules",
            "Model-free learning: No environment model required"
        ],
        
        "Mathematical Foundations": [
            "TD(0): V(s) ← V(s) + α[R + γV(s') - V(s)]",
            "Q-Learning: Q(s,a) ← Q(s,a) + α[R + γmax_a'Q(s',a') - Q(s,a)]",
            "SARSA: Q(s,a) ← Q(s,a) + α[R + γQ(s',a') - Q(s,a)]",
            "TD Error: R + γV(s') - V(s) quantifies prediction error",
            "Convergence conditions: Infinite exploration + learning rate conditions"
        ],
        
        "Algorithm Comparisons": [
            "TD(0): Policy evaluation, foundation for control methods",
            "Q-Learning: Learns optimal policy, aggressive, off-policy",
            "SARSA: Learns current policy, conservative, on-policy",
            "Exploration: Critical for discovering good policies",
            "Sample efficiency: All methods learn from individual transitions"
        ],
        
        "Practical Insights": [
            "Learning rate α controls update step size",
            "Discount factor γ balances immediate vs future rewards", 
            "Exploration rate ε balances exploration vs exploitation",
            "Decaying exploration: High initial exploration, reduce over time",
            "Environment characteristics determine best algorithm choice"
        ],
        
        "Implementation Skills": [
            "Q-table implementation for discrete state-action spaces",
            "ε-greedy exploration strategy implementation",
            "Learning curve analysis and performance evaluation",
            "Hyperparameter tuning for learning rate and exploration",
            "Comparative analysis between different algorithms"
        ]
    }
    
    for category, points in summary_points.items():
        print(f"\n{category}:")
        print("-" * len(category))
        for i, point in enumerate(points, 1):
            print(f"{i}. {point}")
    
    print("\n" + "=" * 80)
    print("ALGORITHM SELECTION GUIDE")
    print("=" * 80)
    
    selection_guide = {
        "Use TD(0) when": [
            "You need to evaluate a specific policy",
            "Building foundation for control algorithms",
            "Understanding temporal difference principles"
        ],
        
        "Use Q-Learning when": [
            "You want optimal performance",
            "Environment allows aggressive exploration",
            "Off-policy learning is acceptable",
            "Sample efficiency is important"
        ],
        
        "Use SARSA when": [
            "Safety is a primary concern", 
            "Environment has dangerous states",
            "You want conservative behavior",
            "On-policy learning is required"
        ]
    }
    
    for when, reasons in selection_guide.items():
        print(f"\n{when}:")
        for reason in reasons:
            print(f"  • {reason}")

print_session_summary()

print("\n" + "=" * 80)
print("FINAL PERFORMANCE SUMMARY")
print("=" * 80)

try:
    final_comparison = {
        "Q-Learning": {
            "Type": "Off-policy Control",
            "Performance": evaluation if 'evaluation' in globals() else "Not evaluated",
            "Convergence": "Fast to optimal policy",
            "Exploration": "ε-greedy with decay"
        },
        "SARSA": {
            "Type": "On-policy Control", 
            "Performance": sarsa_evaluation if 'sarsa_evaluation' in globals() else "Not evaluated",
            "Convergence": "Slower but safer",
            "Exploration": "ε-greedy with decay"
        }
    }
    
    print("Algorithm Performance Summary:")
    for algo, details in final_comparison.items():
        print(f"\n{algo}:")
        for key, value in details.items():
            if isinstance(value, dict) and 'avg_reward' in value:
                print(f"  {key}: Avg Reward = {value['avg_reward']:.2f}")
            else:
                print(f"  {key}: {value}")
                
except NameError:
    print("Run all algorithm implementations to see performance comparison")

print("\n" + "=" * 80)
print("NEXT STEPS AND ADVANCED TOPICS")
print("=" * 80)

next_steps = [
    "Deep Q-Networks (DQN) for large state spaces",
    "Policy Gradient methods (REINFORCE, Actor-Critic)",
    "Advanced exploration (UCB, Thompson Sampling)",
    "Multi-agent reinforcement learning",
    "Continuous action spaces and control",
    "Model-based reinforcement learning",
    "Real-world applications and deployment"
]

print("Recommended next learning topics:")
for i, topic in enumerate(next_steps, 1):
    print(f"{i}. {topic}")

print("\n" + "=" * 80)
print("CONGRATULATIONS!")
print("You have completed a comprehensive study of Temporal Difference Learning")
print("Key achievements:")
print("✓ Implemented TD(0) for policy evaluation") 
print("✓ Built Q-Learning agent from scratch")
print("✓ Implemented SARSA for on-policy control")
print("✓ Explored different exploration strategies")
print("✓ Conducted comparative algorithm analysis")
print("✓ Understanding of model-free reinforcement learning")
print("=" * 80)


In [None]:
print("=" * 80)
print("INTERACTIVE LEARNING EXERCISES")
print("=" * 80)

def self_check_questions():
    """Self-assessment questions for TD learning concepts"""
    
    questions = [
        {
            "question": "What is the main advantage of TD learning over Monte Carlo methods?",
            "options": [
                "A) TD learning requires complete episodes",
                "B) TD learning can learn online from incomplete episodes", 
                "C) TD learning has no bias",
                "D) TD learning requires a model"
            ],
            "answer": "B",
            "explanation": "TD learning updates after each step using bootstrapped estimates, enabling online learning without waiting for episode completion."
        },
        
        {
            "question": "What is the key difference between Q-Learning and SARSA?",
            "options": [
                "A) Q-Learning uses different learning rates",
                "B) Q-Learning is on-policy, SARSA is off-policy",
                "C) Q-Learning uses max operation, SARSA uses actual next action",
                "D) Q-Learning requires more memory"
            ],
            "answer": "C", 
            "explanation": "Q-Learning uses max_a Q(s',a) (off-policy), while SARSA uses Q(s',a') where a' is the actual next action chosen by the current policy (on-policy)."
        },
        
        {
            "question": "Why is exploration important in reinforcement learning?",
            "options": [
                "A) To make the algorithm run faster",
                "B) To reduce memory requirements", 
                "C) To discover potentially better actions and avoid local optima",
                "D) To satisfy convergence conditions"
            ],
            "answer": "C",
            "explanation": "Without exploration, the agent might never discover better actions and could get stuck in suboptimal policies."
        },
        
        {
            "question": "What happens when the learning rate α is too high?",
            "options": [
                "A) Learning becomes too slow",
                "B) The algorithm may not converge and become unstable",
                "C) Memory usage increases",
                "D) Exploration decreases"
            ],
            "answer": "B",
            "explanation": "High learning rates cause large updates that can overshoot optimal values and prevent convergence, making learning unstable."
        },
        
        {
            "question": "In what situation would you prefer SARSA over Q-Learning?",
            "options": [
                "A) When you want the fastest convergence",
                "B) When the environment has dangerous states and safety is important",
                "C) When you have unlimited computational resources", 
                "D) When the state space is very large"
            ],
            "answer": "B",
            "explanation": "SARSA is more conservative because it learns the policy being followed (including exploration), making it safer in dangerous environments."
        }
    ]
    
    print("SELF-CHECK QUESTIONS")
    print("-" * 40)
    print("Test your understanding of TD learning concepts:")
    print("(Think about each question, then check the answers below)\n")
    
    for i, q in enumerate(questions, 1):
        print(f"Question {i}: {q['question']}")
        for option in q['options']:
            print(f"  {option}")
        print()
    
    print("=" * 60)
    print("ANSWERS AND EXPLANATIONS")
    print("=" * 60)
    
    for i, q in enumerate(questions, 1):
        print(f"Question {i}: Answer {q['answer']}")
        print(f"Explanation: {q['explanation']}")
        print()

self_check_questions()

print("=" * 80)
print("HANDS-ON CHALLENGES")
print("=" * 80)

challenges = {
    "Challenge 1: Parameter Sensitivity Analysis": {
        "description": "Investigate how different hyperparameters affect learning",
        "tasks": [
            "Test learning rates: α ∈ {0.01, 0.1, 0.3, 0.5, 0.9}",
            "Test discount factors: γ ∈ {0.5, 0.7, 0.9, 0.95, 0.99}",
            "Test exploration rates: ε ∈ {0.01, 0.1, 0.3, 0.5}",
            "Plot learning curves for each parameter setting",
            "Identify optimal parameter combinations"
        ]
    },
    
    "Challenge 2: Environment Modifications": {
        "description": "Test algorithms on modified environments",
        "tasks": [
            "Create larger grid (6x6, 8x8)",
            "Add more obstacles in different patterns",
            "Implement stochastic transitions (wind effects)",
            "Create multiple goals with different rewards",
            "Compare algorithm performance across environments"
        ]
    },
    
    "Challenge 3: Advanced Exploration": {
        "description": "Implement and compare advanced exploration strategies",
        "tasks": [
            "Implement UCB (Upper Confidence Bound) exploration",
            "Implement optimistic initialization", 
            "Implement curiosity-driven exploration",
            "Compare convergence speed and final performance",
            "Analyze exploration efficiency in different environments"
        ]
    },
    
    "Challenge 4: Algorithm Extensions": {
        "description": "Implement extensions and variants",
        "tasks": [
            "Implement Double Q-Learning to reduce maximization bias",
            "Implement Expected SARSA",
            "Implement n-step Q-Learning",
            "Add experience replay buffer",
            "Compare performance with basic algorithms"
        ]
    },
    
    "Challenge 5: Real-World Application": {
        "description": "Apply TD learning to a practical problem",
        "tasks": [
            "Design a simple inventory management problem",
            "Implement a basic trading strategy simulation", 
            "Create a path planning scenario with dynamic obstacles",
            "Apply Q-Learning or SARSA to solve the problem",
            "Analyze and visualize the learned policies"
        ]
    }
}

for challenge_name, details in challenges.items():
    print(f"{challenge_name}:")
    print(f"Description: {details['description']}")
    print("Tasks:")
    for i, task in enumerate(details['tasks'], 1):
        print(f"  {i}. {task}")
    print()

print("=" * 80)
print("DEBUGGING AND TROUBLESHOOTING GUIDE") 
print("=" * 80)

debugging_tips = [
    "Learning not converging? Try reducing learning rate (α)",
    "Convergence too slow? Check if exploration rate is too high",
    "Poor final performance? Increase exploration during training",
    "Unstable learning? Check for implementation bugs in TD updates",
    "Agent taking random actions? Verify ε-greedy implementation",
    "Q-values exploding? Add bounds or reduce learning rate",
    "Not reaching goal? Check environment transition logic",
    "Identical performance across runs? Verify random seed handling"
]

print("Common issues and solutions:")
for i, tip in enumerate(debugging_tips, 1):
    print(f"{i}. {tip}")

print("\n" + "=" * 80)
print("FINAL THOUGHTS")
print("=" * 80)
print("Temporal Difference learning bridges the gap between model-based")
print("dynamic programming and model-free Monte Carlo methods.")
print("")
print("Key insights from this session:")
print("• TD learning enables online learning from experience")
print("• Exploration is crucial for discovering optimal policies") 
print("• Algorithm choice depends on problem characteristics")
print("• Hyperparameter tuning significantly affects performance")
print("• TD methods form the foundation of modern RL algorithms")
print("")
print("You are now ready to explore deep reinforcement learning,")
print("policy gradient methods, and advanced RL applications!")
print("=" * 80)
