# Computer Assignment 13: Advanced Model-Based RL and World Models

## Course Information
- **Course**: Deep Reinforcement Learning (DRL)
- **Instructor**: Dr. [Instructor Name]
- **Institution**: Sharif University of Technology
- **Semester**: Fall 2024
- **Assignment Number**: CA13

## Learning Objectives

By completing this assignment, students will be able to:

1. **Understand Model-Based vs Model-Free RL Trade-offs**: Analyze the fundamental differences between model-free and model-based reinforcement learning approaches, including their respective advantages, limitations, and appropriate use cases.

2. **Master World Model Architectures**: Design and implement variational world models using VAEs for learning compact latent representations of environment dynamics, including encoder-decoder architectures and stochastic dynamics modeling.

3. **Implement Imagination-Based Learning**: Develop agents that leverage learned world models for planning and decision-making in latent space, enabling sample-efficient learning through imagined trajectories.

4. **Apply Sample Efficiency Techniques**: Utilize advanced techniques such as prioritized experience replay, data augmentation, and auxiliary tasks to improve learning efficiency in deep RL.

5. **Design Transfer Learning Systems**: Build agents capable of transferring knowledge across related tasks through shared representations, fine-tuning, and meta-learning approaches.

6. **Develop Hierarchical RL Frameworks**: Implement hierarchical decision-making systems using options framework, enabling temporal abstraction and skill composition for complex task solving.

## Prerequisites

Before starting this assignment, ensure you have:

- **Mathematical Background**: 
  - Probability theory and stochastic processes
  - Linear algebra and matrix operations
  - Optimization and gradient-based methods
  - Information theory (KL divergence, entropy)

- **Technical Skills**:
  - Python programming with PyTorch
  - Deep learning fundamentals (neural networks, autoencoders)
  - Basic reinforcement learning concepts (MDPs, value functions, policies)
  - Experience with Gymnasium environments

- **Prior Knowledge**:
  - Completion of CA1-CA12 assignments
  - Understanding of model-free RL algorithms (DQN, policy gradients)
  - Familiarity with neural network architectures

## Roadmap

This assignment is structured as follows:

### Section 1: Model-Free vs Model-Based Reinforcement Learning
- Theoretical foundations of model-free and model-based approaches
- Mathematical formulations and trade-off analysis
- Hybrid algorithms combining both paradigms
- Practical implementation and comparison

### Section 2: World Models and Imagination-Based Learning
- Variational autoencoders for world modeling
- Stochastic dynamics prediction in latent space
- Imagination-based planning and policy optimization
- Dreamer algorithm and modern variants

### Section 3: Sample Efficiency and Transfer Learning
- Prioritized experience replay and data augmentation
- Auxiliary tasks for improved learning
- Transfer learning techniques and meta-learning
- Domain adaptation and curriculum learning

### Section 4: Hierarchical Reinforcement Learning
- Options framework and temporal abstraction
- Hierarchical policy architectures
- Skill discovery and composition
- Applications to complex task domains

## Project Structure

```
CA13/
├── CA13.ipynb              # Main assignment notebook
├── agents/                 # RL agent implementations
│   ├── model_free_agent.py # Model-free RL agents
│   ├── model_based_agent.py# Model-based RL agents
│   ├── world_model_agent.py# World model-based agents
│   └── hierarchical_agent.py# Hierarchical RL agents
├── models/                 # Neural network architectures
│   ├── world_model.py      # VAE-based world models
│   ├── dynamics_model.py   # Environment dynamics models
│   └── policy_networks.py  # Hierarchical policy networks
├── environments/           # Custom environments
│   ├── wrappers.py         # Environment wrappers
│   └── complex_tasks.py    # Complex task environments
├── experiments/            # Training and evaluation scripts
│   ├── train_world_model.py# World model training
│   ├── compare_efficiency.py# Sample efficiency comparison
│   └── transfer_learning.py# Transfer learning experiments
└── utils/                  # Utility functions
    ├── visualization.py    # Plotting and analysis tools
    ├── data_augmentation.py# Data augmentation utilities
    └── evaluation.py       # Performance evaluation metrics
```

## Contents Overview

### Theoretical Foundations
- **Model-Based RL Mathematics**: Transition and reward model learning, planning algorithms
- **World Model Theory**: Variational inference, latent space dynamics, imagination-based learning
- **Sample Efficiency**: Experience replay, prioritization, auxiliary learning objectives
- **Transfer Learning**: Representation learning, fine-tuning, meta-learning algorithms

### Implementation Components
- **VAE World Models**: Encoder-decoder architectures with stochastic latent variables
- **Imagination-Based Agents**: Planning in learned latent space using world models
- **Sample-Efficient Algorithms**: Prioritized replay, data augmentation, auxiliary tasks
- **Transfer Learning Systems**: Multi-task learning, fine-tuning, domain adaptation

### Advanced Topics
- **Hierarchical RL**: Options framework, skill hierarchies, temporal abstraction
- **Meta-Learning**: Few-shot adaptation, gradient-based meta-learning
- **Curriculum Learning**: Automatic difficulty progression, teacher-student frameworks

## Evaluation Criteria

Your implementation will be evaluated based on:

1. **Correctness (40%)**: Accurate implementation of algorithms and mathematical formulations
2. **Efficiency (25%)**: Sample efficiency improvements and computational performance
3. **Innovation (20%)**: Creative extensions and novel approaches to the problems
4. **Analysis (15%)**: Quality of experimental analysis and insights

## Getting Started

1. **Environment Setup**: Ensure all dependencies are installed
2. **Code Review**: Understand the provided base implementations
3. **Incremental Development**: Start with simpler components and build complexity
4. **Testing**: Validate each component before integration
5. **Experimentation**: Run comprehensive experiments and analyze results

## Expected Outcomes

By the end of this assignment, you will have:

- **Comprehensive Understanding**: Deep knowledge of advanced model-based RL techniques
- **Practical Skills**: Ability to implement complex RL systems from scratch
- **Research Perspective**: Insight into current challenges and future directions
- **Portfolio Piece**: High-quality implementation demonstrating advanced RL capabilities

---

**Note**: This assignment represents the culmination of the Deep RL course, integrating concepts from model-free and model-based learning, advanced architectures, and practical deployment considerations. Focus on understanding the theoretical foundations while developing robust, efficient implementations.

Let's begin our exploration of advanced model-based reinforcement learning and world models! 🚀

In [None]:

class ModelFreeAgent:
    """Base class for model-free RL agents."""
    
    def __init__(self, state_dim, action_dim, lr=1e-3):
        self.state_dim = state_dim
        self.action_dim = action_dim
        self.lr = lr
        self.replay_buffer = ReplayBuffer(10000)
        
    def act(self, state, epsilon=0.1):
        """Select action using epsilon-greedy policy."""
        if np.random.random() < epsilon:
            return np.random.randint(self.action_dim)
        return self.get_best_action(state)
    
    def get_best_action(self, state):
        """Get best action according to current policy."""
        raise NotImplementedError
    
    def update(self, batch):
        """Update agent from batch of experiences."""
        raise NotImplementedError

class DQNAgent(ModelFreeAgent):
    """Deep Q-Network agent (model-free)."""
    
    def __init__(self, state_dim, action_dim, lr=1e-3, gamma=0.99):
        super().__init__(state_dim, action_dim, lr)
        self.gamma = gamma
        
        self.q_network = nn.Sequential(
            nn.Linear(state_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 128),
            nn.ReLU(),
            nn.Linear(128, action_dim)
        )
        
        self.target_network = copy.deepcopy(self.q_network)
        self.optimizer = optim.Adam(self.q_network.parameters(), lr=lr)
        
        self.update_count = 0
        self.losses = []
        
    def get_best_action(self, state):
        """Get action with highest Q-value."""
        with torch.no_grad():
            state_tensor = torch.FloatTensor(state).unsqueeze(0)
            q_values = self.q_network(state_tensor)
            return q_values.argmax().item()
    
    def update(self, batch):
        """Update Q-network using DQN loss."""
        states, actions, rewards, next_states, dones = batch
        
        states = torch.FloatTensor(states)
        actions = torch.LongTensor(actions)
        rewards = torch.FloatTensor(rewards)
        next_states = torch.FloatTensor(next_states)
        dones = torch.BoolTensor(dones)
        
        current_q = self.q_network(states).gather(1, actions.unsqueeze(1))
        
        with torch.no_grad():
            next_q = self.target_network(next_states).max(1)[0]
            target_q = rewards + (self.gamma * next_q * (~dones))
        
        loss = F.mse_loss(current_q.squeeze(), target_q)
        
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        
        self.losses.append(loss.item())
        self.update_count += 1
        
        if self.update_count % 100 == 0:
            self.target_network.load_state_dict(self.q_network.state_dict())
        
        return loss.item()

class ModelBasedAgent:
    """Model-based RL agent using learned dynamics."""
    
    def __init__(self, state_dim, action_dim, lr=1e-3, planning_horizon=5):
        self.state_dim = state_dim
        self.action_dim = action_dim
        self.lr = lr
        self.planning_horizon = planning_horizon
        
        self.dynamics_model = nn.Sequential(
            nn.Linear(state_dim + action_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 128),
            nn.ReLU(),
            nn.Linear(128, state_dim)
        )
        
        self.reward_model = nn.Sequential(
            nn.Linear(state_dim + action_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Linear(64, 1)
        )
        
        self.value_network = nn.Sequential(
            nn.Linear(state_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Linear(64, 1)
        )
        
        self.dynamics_optimizer = optim.Adam(self.dynamics_model.parameters(), lr=lr)
        self.reward_optimizer = optim.Adam(self.reward_model.parameters(), lr=lr)
        self.value_optimizer = optim.Adam(self.value_network.parameters(), lr=lr)
        
        self.model_buffer = ReplayBuffer(10000)
        self.planning_buffer = ReplayBuffer(5000)
        
        self.model_losses = []
        self.value_losses = []
        
    def act(self, state, epsilon=0.1):
        """Select action using model-based planning."""
        if np.random.random() < epsilon:
            return np.random.randint(self.action_dim)
        
        return self.plan_action(state)
    
    def plan_action(self, state):
        """Plan best action using learned model."""
        best_action = 0
        best_value = float('-inf')
        
        for action in range(self.action_dim):
            value = self.simulate_trajectory(state, action)
            if value > best_value:
                best_value = value
                best_action = action
        
        return best_action
    
    def simulate_trajectory(self, initial_state, initial_action):
        """Simulate trajectory using learned model."""
        state = torch.FloatTensor(initial_state)
        total_reward = 0.0
        gamma = 0.99
        
        for step in range(self.planning_horizon):
            if step == 0:
                action = initial_action
            else:
                action = self.get_greedy_action(state)
            
            action_tensor = torch.FloatTensor([action])
            action_one_hot = F.one_hot(action_tensor.long(), self.action_dim).float()
            
            model_input = torch.cat([state, action_one_hot], dim=-1)
            
            with torch.no_grad():
                next_state = self.dynamics_model(model_input)
                reward = self.reward_model(model_input).item()
            
            total_reward += (gamma ** step) * reward
            state = next_state
        
        with torch.no_grad():
            terminal_value = self.value_network(state).item()
            total_reward += (gamma ** self.planning_horizon) * terminal_value
        
        return total_reward
    
    def get_greedy_action(self, state):
        """Get greedy action for planning."""
        best_action = 0
        best_q = float('-inf')
        
        for action in range(self.action_dim):
            action_tensor = torch.FloatTensor([action])
            action_one_hot = F.one_hot(action_tensor.long(), self.action_dim).float()
            model_input = torch.cat([state, action_one_hot], dim=-1)
            
            with torch.no_grad():
                q_value = self.reward_model(model_input).item()
            
            if q_value > best_q:
                best_q = q_value
                best_action = action
        
        return best_action
    
    def update_model(self, batch):
        """Update dynamics and reward models."""
        states, actions, rewards, next_states, dones = batch
        
        states = torch.FloatTensor(states)
        actions = torch.LongTensor(actions)
        rewards = torch.FloatTensor(rewards)
        next_states = torch.FloatTensor(next_states)
        
        actions_one_hot = F.one_hot(actions, self.action_dim).float()
        model_input = torch.cat([states, actions_one_hot], dim=-1)
        
        pred_next_states = self.dynamics_model(model_input)
        dynamics_loss = F.mse_loss(pred_next_states, next_states)
        
        self.dynamics_optimizer.zero_grad()
        dynamics_loss.backward()
        self.dynamics_optimizer.step()
        
        pred_rewards = self.reward_model(model_input).squeeze()
        reward_loss = F.mse_loss(pred_rewards, rewards)
        
        self.reward_optimizer.zero_grad()
        reward_loss.backward()
        self.reward_optimizer.step()
        
        total_model_loss = dynamics_loss.item() + reward_loss.item()
        self.model_losses.append(total_model_loss)
        
        return total_model_loss
    
    def update_value_function(self, batch):
        """Update value function using temporal difference learning."""
        states, actions, rewards, next_states, dones = batch
        
        states = torch.FloatTensor(states)
        rewards = torch.FloatTensor(rewards)
        next_states = torch.FloatTensor(next_states)
        dones = torch.BoolTensor(dones)
        
        current_values = self.value_network(states).squeeze()
        
        with torch.no_grad():
            next_values = self.value_network(next_states).squeeze()
            targets = rewards + 0.99 * next_values * (~dones)
        
        value_loss = F.mse_loss(current_values, targets)
        
        self.value_optimizer.zero_grad()
        value_loss.backward()
        self.value_optimizer.step()
        
        self.value_losses.append(value_loss.item())
        
        return value_loss.item()

class HybridDynaAgent:
    """Dyna-Q style hybrid agent combining model-free and model-based learning."""
    
    def __init__(self, state_dim, action_dim, lr=1e-3, planning_steps=5):
        self.state_dim = state_dim
        self.action_dim = action_dim
        self.planning_steps = planning_steps
        
        self.q_table = defaultdict(lambda: np.zeros(action_dim))
        self.lr = lr
        self.gamma = 0.99
        
        self.model = {}  # (s,a) -> (r, s', done)
        self.visited_states = set()       
        self.experience_buffer = deque(maxlen=10000)
        
    def act(self, state, epsilon=0.1):
        """Epsilon-greedy action selection."""
        if np.random.random() < epsilon:
            return np.random.randint(self.action_dim)
        
        state_key = tuple(state) if isinstance(state, np.ndarray) else state
        return np.argmax(self.q_table[state_key])
    
    def update(self, state, action, reward, next_state, done):
        """Dyna-Q update: direct RL + model learning + planning."""
        state_key = tuple(state) if isinstance(state, np.ndarray) else state
        next_state_key = tuple(next_state) if isinstance(next_state, np.ndarray) else next_state
        
        if done:
            target = reward
        else:
            target = reward + self.gamma * np.max(self.q_table[next_state_key])
        
        self.q_table[state_key][action] += self.lr * (target - self.q_table[state_key][action])
        
        self.model[(state_key, action)] = (reward, next_state_key, done)
        self.visited_states.add(state_key)
        self.experience_buffer.append((state_key, action, reward, next_state_key, done))
        
        self.planning_updates()
    
    def planning_updates(self):
        """Perform planning updates using learned model."""
        if len(self.experience_buffer) == 0:
            return
        
        for _ in range(self.planning_steps):
            if len(self.experience_buffer) > 0:
                state_key, action, reward, next_state_key, done = random.choice(self.experience_buffer)
                
                if done:
                    target = reward
                else:
                    target = reward + self.gamma * np.max(self.q_table[next_state_key])
                
                self.q_table[state_key][action] += self.lr * (target - self.q_table[state_key][action])

class ReplayBuffer:
    """Experience replay buffer for storing transitions."""
    
    def __init__(self, capacity):
        self.buffer = deque(maxlen=capacity)
    
    def push(self, state, action, reward, next_state, done):
        self.buffer.append((state, action, reward, next_state, done))
    
    def sample(self, batch_size):
        batch = random.sample(self.buffer, batch_size)
        states, actions, rewards, next_states, dones = zip(*batch)
        return states, actions, rewards, next_states, dones
    
    def __len__(self):
        return len(self.buffer)

class SimpleGridWorld:
    
    def __init__(self, size=5):
        self.size = size
        self.state = [0, 0]  # [row, col]
        self.goal = [size-1, size-1]
        self.action_space = 4  # up, right, down, left
        
    def reset(self):
        self.state = [0, 0]
        return np.array(self.state, dtype=np.float32)
    
    def step(self, action):
        moves = [(-1, 0), (0, 1), (1, 0), (0, -1)]
        
        new_row = max(0, min(self.size-1, self.state[0] + moves[action][0]))
        new_col = max(0, min(self.size-1, self.state[1] + moves[action][1]))
        
        self.state = [new_row, new_col]
        
        if self.state == self.goal:
            reward = 10.0
            done = True
        else:
            reward = -0.1  # Small negative reward for each step
            done = False
        
        return np.array(self.state, dtype=np.float32), reward, done, {}

def compare_agents_performance():
    print("🔬 Comparing Model-Free vs Model-Based RL Performance")
    
    env = SimpleGridWorld(size=5)
    
    model_free_agent = DQNAgent(state_dim=2, action_dim=4, lr=1e-3)
    model_based_agent = ModelBasedAgent(state_dim=2, action_dim=4, lr=1e-3)
    hybrid_agent = HybridDynaAgent(state_dim=2, action_dim=4, lr=0.1)
    
    agents = {
        'Model-Free (DQN)': model_free_agent,
        'Model-Based': model_based_agent,
        'Hybrid (Dyna-Q)': hybrid_agent
    }
    
    results = {name: {'episodes': [], 'rewards': [], 'steps': []} for name in agents.keys()}
    
    num_episodes = 200
    batch_size = 32
    
    for episode in range(num_episodes):
        for agent_name, agent in agents.items():
            state = env.reset()
            episode_reward = 0
            episode_steps = 0
            max_steps = 100
            
            for step in range(max_steps):
                if agent_name == 'Hybrid (Dyna-Q)':
                    action = agent.act(state, epsilon=max(0.1, 1.0 - episode/100))
                else:
                    action = agent.act(state, epsilon=max(0.1, 1.0 - episode/100))
                
                next_state, reward, done, _ = env.step(action)
                episode_reward += reward
                episode_steps += 1
                
                if agent_name == 'Model-Free (DQN)':
                    agent.replay_buffer.push(state, action, reward, next_state, done)
                    if len(agent.replay_buffer) > batch_size:
                        batch = agent.replay_buffer.sample(batch_size)
                        agent.update(batch)
                
                elif agent_name == 'Model-Based':
                    agent.model_buffer.push(state, action, reward, next_state, done)
                    if len(agent.model_buffer) > batch_size:
                        batch = agent.model_buffer.sample(batch_size)
                        agent.update_model(batch)
                        agent.update_value_function(batch)
                
                elif agent_name == 'Hybrid (Dyna-Q)':
                    agent.update(state, action, reward, next_state, done)
                
                if done:
                    break
                
                state = next_state
            
            results[agent_name]['episodes'].append(episode)
            results[agent_name]['rewards'].append(episode_reward)
            results[agent_name]['steps'].append(episode_steps)
        
        if episode % 50 == 0:
            print(f"Episode {episode}:"
                  for agent_name in agents.keys():
                      recent_reward = np.mean(results[agent_name]['rewards'][-10:]) if len(results[agent_name]['rewards']) >= 10 else 0
                      print(f"  {agent_name}: {recent_reward:.2f} avg reward")
    
    return results

def visualize_comparison(results):
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    
    for agent_name, data in results.items():
        window_size = 20
        if len(data['rewards']) >= window_size:
            smoothed_rewards = pd.Series(data['rewards']).rolling(window_size).mean()
            axes[0,0].plot(data['episodes'], smoothed_rewards, label=agent_name, linewidth=2)
    
    axes[0,0].set_title('Learning Curves (Smoothed Rewards)')
    axes[0,0].set_xlabel('Episode')
    axes[0,0].set_ylabel('Episode Reward')
    axes[0,0].legend()
    axes[0,0].grid(True, alpha=0.3)
    
    for agent_name, data in results.items():
        cumulative_steps = np.cumsum(data['steps'])
        axes[0,1].plot(cumulative_steps, data['rewards'], label=agent_name, linewidth=2)
    
    axes[0,1].set_title('Sample Efficiency')
    axes[0,1].set_xlabel('Total Environment Steps')
    axes[0,1].set_ylabel('Episode Reward')
    axes[0,1].legend()
    axes[0,1].grid(True, alpha=0.3)
    
    final_performance = {}
    for agent_name, data in results.items():
        final_performance[agent_name] = np.mean(data['rewards'][-20:])  # Last 20 episodes
    
    agent_names = list(final_performance.keys())
    performance_values = list(final_performance.values())
    
    bars = axes[1,0].bar(agent_names, performance_values, color=['skyblue', 'lightcoral', 'lightgreen'])
    axes[1,0].set_title('Final Performance (Last 20 Episodes)')
    axes[1,0].set_ylabel('Average Episode Reward')
    axes[1,0].tick_params(axis='x', rotation=45)
    
    for bar, value in zip(bars, performance_values):
        axes[1,0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.1,
                      f'{value:.2f}', ha='center', va='bottom')
    
    steps_to_completion = {}
    for agent_name, data in results.items():
        steps_to_completion[agent_name] = np.mean(data['steps'][-20:])
    
    agent_names = list(steps_to_completion.keys())
    steps_values = list(steps_to_completion.values())
    
    bars = axes[1,1].bar(agent_names, steps_values, color=['skyblue', 'lightcoral', 'lightgreen'])
    axes[1,1].set_title('Average Steps to Completion')
    axes[1,1].set_ylabel('Steps per Episode')
    axes[1,1].tick_params(axis='x', rotation=45)
    
    for bar, value in zip(bars, steps_values):
        axes[1,1].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.5,
                      f'{value:.1f}', ha='center', va='bottom')
    
    plt.tight_layout()
    plt.show()
    
    return final_performance

print("🚀 Starting Model-Free vs Model-Based RL Comparison!")
comparison_results = compare_agents_performance()
final_performance = visualize_comparison(comparison_results)

print("\\n📊 Comparison Results Summary:")
for agent_name, performance in final_performance.items():
    print(f"  {agent_name}: {performance:.2f} average reward")
    
print("\\n💡 Key Insights:")
print("  • Model-free methods often achieve higher asymptotic performance")
print ("  • Model-based methods typically learn faster initially")
print("  • Hybrid approaches can combine benefits of both")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}


SyntaxError: unterminated string literal (detected at line 557) (4118917461.py, line 557)

# Section 2: World Models and Imagination-Based Learning

## 2.1 Theoretical Foundations of World Models

World models represent learned internal representations of environment dynamics that enable agents to "imagine" and plan without direct interaction with the environment.

### Core Concepts

**World Model Components:**
1. **Representation Learning**: Encode high-dimensional observations into compact latent states
2. **Dynamics Model**: Predict next latent state given current state and action
3. **Reward Model**: Predict rewards in the latent space
4. **Decoder Model**: Reconstruct observations from latent states

**Mathematical Framework:**
- **Encoder**: $z_t = \text{Encode}(o_t)$ maps observation $o_t$ to latent state $z_t$
- **Dynamics**: $z_{t+1} = f(z_t, a_t) + \epsilon_t$ where $\epsilon_t \sim \mathcal{N}(0, \Sigma)$
- **Reward**: $r_t = R(z_t, a_t)$
- **Decoder**: $\hat{o}_t = \text{Decode}(z_t)$

## 2.2 Variational World Models

### Variational Autoencoders (VAE) for World Modeling

World models often use VAEs to learn stochastic latent representations:

**Encoder (Recognition Model):**
$$q_\phi(z_t | o_t) = \mathcal{N}(z_t; \mu_\phi(o_t), \sigma_\phi^2(o_t))$$

**Prior (Dynamics Model):**
$$p_\theta(z_{t+1} | z_t, a_t) = \mathcal{N}(z_{t+1}; \mu_\theta(z_t, a_t), \sigma_\theta^2(z_t, a_t))$$

**Decoder (Generative Model):**
$$p_\psi(o_t | z_t) = \mathcal{N}(o_t; \mu_\psi(z_t), \sigma_\psi^2(z_t))$$

**ELBO Objective:**
$$\mathcal{L}_{ELBO} = \mathbb{E}_{q_\phi(z|o)} [\log p_\psi(o|z)] - D_{KL}[q_\phi(z|o) || p(z)]$$

## 2.3 Planning in Learned Latent Space

Once a world model is learned, planning can be performed in the compact latent space:

### Model Predictive Control (MPC) in Latent Space
1. **Imagination Rollout**: Use world model to simulate future trajectories
2. **Action Optimization**: Optimize action sequences to maximize predicted rewards
3. **Execution**: Execute only the first action, then replan

**Planning Objective:**
$$a^*_{1:H} = \arg\max_{a_{1:H}} \mathbb{E}_{z_{1:H} \sim p_\theta} \left[ \sum_{t=1}^H R(z_t, a_t) \right]$$

### Dreamer Algorithm
Dreamer combines world models with policy gradients:
1. **Collect Experience**: Gather real environment data
2. **Learn World Model**: Train VAE-based world model
3. **Imagine Trajectories**: Generate synthetic experience in latent space  
4. **Learn Behaviors**: Train actor-critic in imagined trajectories

## 2.4 Advantages and Challenges

### Advantages of World Models:
- **Sample Efficiency**: Learn from imagined experience
- **Transfer Learning**: Models can generalize across tasks
- **Interpretability**: Learned representations can be visualized
- **Planning**: Enable sophisticated planning algorithms

### Challenges:
- **Model Bias**: Errors compound during long rollouts
- **Representation Learning**: High-dimensional observations are challenging
- **Stochasticity**: Modeling complex stochastic dynamics
- **Computational Cost**: Training and maintaining world models

## 2.5 Modern Approaches

### MuZero
Combines tree search with learned models:
- Learns value, policy, and dynamics jointly
- Uses tree search for planning
- Achieves superhuman performance in Go, Chess, and Shogi

### Dreamer V2/V3
Improvements to original Dreamer:
- Better regularization techniques
- Improved world model architectures
- Enhanced policy learning in imagination

### Model-Based Meta-Learning
Using world models for few-shot adaptation:
- Learn generalizable world model components
- Quickly adapt to new environments
- Transfer dynamics knowledge across domains

In [None]:

class VariationalWorldModel(nn.Module):
    """VAE-based world model for learning environment dynamics."""
    
    def __init__(self, obs_dim, action_dim, latent_dim=64, hidden_dim=128):
        super().__init__()
        self.obs_dim = obs_dim
        self.action_dim = action_dim
        self.latent_dim = latent_dim
        self.hidden_dim = hidden_dim
        
        self.encoder = nn.Sequential(
            nn.Linear(obs_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU()
        )
        
        self.encoder_mu = nn.Linear(hidden_dim, latent_dim)
        self.encoder_logvar = nn.Linear(hidden_dim, latent_dim)
        
        self.dynamics = nn.Sequential(
            nn.Linear(latent_dim + action_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU()
        )
        
        self.dynamics_mu = nn.Linear(hidden_dim, latent_dim)
        self.dynamics_logvar = nn.Linear(hidden_dim, latent_dim)
        
        self.reward_model = nn.Sequential(
            nn.Linear(latent_dim + action_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim // 2),
            nn.ReLU(),
            nn.Linear(hidden_dim // 2, 1)
        )
        
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, obs_dim)
        )
    
    def encode(self, obs):
        """Encode observation to latent distribution parameters."""
        h = self.encoder(obs)
        mu = self.encoder_mu(h)
        logvar = self.encoder_logvar(h)
        return mu, logvar
    
    def reparameterize(self, mu, logvar):
        """Reparameterization trick for VAE."""
        if self.training:
            std = torch.exp(0.5 * logvar)
            eps = torch.randn_like(std)
            return mu + eps * std
        else:
            return mu
    
    def dynamics_forward(self, latent_state, action):
        """Predict next latent state given current state and action."""
        if len(action.shape) == 1:
            action_one_hot = F.one_hot(action.long(), self.action_dim).float()
        else:
            action_one_hot = action
        
        dynamics_input = torch.cat([latent_state, action_one_hot], dim=-1)
        h = self.dynamics(dynamics_input)
        
        mu = self.dynamics_mu(h)
        logvar = self.dynamics_logvar(h)
        
        return mu, logvar
    
    def predict_reward(self, latent_state, action):
        """Predict reward given latent state and action."""
        if len(action.shape) == 1:
            action_one_hot = F.one_hot(action.long(), self.action_dim).float()
        else:
            action_one_hot = action
        
        reward_input = torch.cat([latent_state, action_one_hot], dim=-1)
        return self.reward_model(reward_input)
    
    def decode(self, latent_state):
        """Decode latent state to observation."""
        return self.decoder(latent_state)
    
    def forward(self, obs, action=None):
        """Full forward pass through world model."""
        mu_enc, logvar_enc = self.encode(obs)
        latent_state = self.reparameterize(mu_enc, logvar_enc)
        
        recon_obs = self.decode(latent_state)
        
        results = {
            'latent_state': latent_state,
            'mu_enc': mu_enc,
            'logvar_enc': logvar_enc,
            'recon_obs': recon_obs
        }
        
        if action is not None:
            mu_dyn, logvar_dyn = self.dynamics_forward(latent_state, action)
            next_latent = self.reparameterize(mu_dyn, logvar_dyn)
            pred_reward = self.predict_reward(latent_state, action)
            
            results.update({
                'mu_dyn': mu_dyn,
                'logvar_dyn': logvar_dyn,
                'next_latent': next_latent,
                'pred_reward': pred_reward
            })
        
        return results
    
    def imagine_trajectory(self, initial_obs, actions):
        """Imagine trajectory given initial observation and action sequence."""
        batch_size = initial_obs.shape[0]
        sequence_length = len(actions)
        
        mu_enc, logvar_enc = self.encode(initial_obs)
        current_latent = self.reparameterize(mu_enc, logvar_enc)
        
        trajectory = {
            'latent_states': [current_latent],
            'observations': [self.decode(current_latent)],
            'rewards': [],
            'actions': []
        }
        
        for t in range(sequence_length):
            action = actions[t]
            trajectory['actions'].append(action)
            
            pred_reward = self.predict_reward(current_latent, action)
            trajectory['rewards'].append(pred_reward)
            
            mu_dyn, logvar_dyn = self.dynamics_forward(current_latent, action)
            next_latent = self.reparameterize(mu_dyn, logvar_dyn)
            
            current_latent = next_latent
            trajectory['latent_states'].append(current_latent)
            trajectory['observations'].append(self.decode(current_latent))
        
        return trajectory

class WorldModelLoss:
    """Loss functions for training world models."""
    
    def __init__(self, recon_weight=1.0, kl_weight=1.0, reward_weight=1.0, dynamics_weight=1.0):
        self.recon_weight = recon_weight
        self.kl_weight = kl_weight
        self.reward_weight = reward_weight
        self.dynamics_weight = dynamics_weight
    
    def reconstruction_loss(self, recon_obs, target_obs):
        """Reconstruction loss between predicted and actual observations."""
        return F.mse_loss(recon_obs, target_obs)
    
    def kl_divergence_loss(self, mu, logvar):
        """KL divergence loss for VAE regularization."""
        return -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp()) / mu.shape[0]
    
    def reward_loss(self, pred_reward, target_reward):
        """Reward prediction loss."""
        return F.mse_loss(pred_reward.squeeze(), target_reward)
    
    def dynamics_loss(self, pred_next_latent, target_next_latent):
        """Dynamics prediction loss in latent space."""
        return F.mse_loss(pred_next_latent, target_next_latent)
    
    def compute_total_loss(self, model_output, target_obs, target_reward=None, target_next_obs=None):
        """Compute total loss for world model training."""
        losses = {}
        
        recon_loss = self.reconstruction_loss(model_output['recon_obs'], target_obs)
        losses['reconstruction'] = recon_loss
        
        kl_loss = self.kl_divergence_loss(model_output['mu_enc'], model_output['logvar_enc'])
        losses['kl_divergence'] = kl_loss
        
        total_loss = self.recon_weight * recon_loss + self.kl_weight * kl_loss
        
        if target_reward is not None and 'pred_reward' in model_output:
            reward_loss = self.reward_loss(model_output['pred_reward'], target_reward)
            losses['reward'] = reward_loss
            total_loss += self.reward_weight * reward_loss
        
        if target_next_obs is not None and 'mu_dyn' in model_output:
            with torch.no_grad():
                target_mu, _ = model_output['mu_enc'], model_output['logvar_enc']  # Placeholder - need next obs encoding
            
            dynamics_loss = F.mse_loss(model_output['mu_dyn'], model_output['next_latent'])
            losses['dynamics'] = dynamics_loss
            total_loss += self.dynamics_weight * dynamics_loss
        
        losses['total'] = total_loss
        return losses

class ImaginationBasedAgent:
    """Agent that uses world model for planning and learning."""
    
    def __init__(self, obs_dim, action_dim, latent_dim=64, planning_horizon=8):
        self.obs_dim = obs_dim
        self.action_dim = action_dim
        self.latent_dim = latent_dim
        self.planning_horizon = planning_horizon
        
        self.world_model = VariationalWorldModel(obs_dim, action_dim, latent_dim)
        self.world_model_optimizer = optim.Adam(self.world_model.parameters(), lr=1e-3)
        self.world_model_loss = WorldModelLoss()
        
        self.policy_net = nn.Sequential(
            nn.Linear(latent_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Linear(64, action_dim),
            nn.Softmax(dim=-1)
        )
        
        self.value_net = nn.Sequential(
            nn.Linear(latent_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Linear(64, 1)
        )
        
        self.policy_optimizer = optim.Adam(self.policy_net.parameters(), lr=1e-3)
        self.value_optimizer = optim.Adam(self.value_net.parameters(), lr=1e-3)
        
        self.experience_buffer = ReplayBuffer(10000)
        
        self.world_model_losses = []
        self.policy_losses = []
        
    def act(self, obs, epsilon=0.1):
        """Select action using imagination-based planning."""
        if np.random.random() < epsilon:
            return np.random.randint(self.action_dim)
        
        return self.plan_with_world_model(obs)
    
    def plan_with_world_model(self, obs):
        """Plan action using world model imagination."""
        obs_tensor = torch.FloatTensor(obs).unsqueeze(0)
        
        with torch.no_grad():
            mu, logvar = self.world_model.encode(obs_tensor)
            current_latent = self.world_model.reparameterize(mu, logvar)
        
        best_action = 0
        best_value = float('-inf')
        
        for action in range(self.action_dim):
            imagined_value = self.imagine_value(current_latent, action)
            if imagined_value > best_value:
                best_value = imagined_value
                best_action = action
        
        return best_action
    
    def imagine_value(self, initial_latent, initial_action):
        """Estimate value of taking initial action using imagination."""
        current_latent = initial_latent
        total_value = 0.0
        gamma = 0.99
        
        with torch.no_grad():
            for step in range(self.planning_horizon):
                if step == 0:
                    action = initial_action
                else:
                    action_probs = self.policy_net(current_latent)
                    action = action_probs.argmax().item()
                
                action_tensor = torch.tensor([action])
                pred_reward = self.world_model.predict_reward(current_latent, action_tensor)
                total_value += (gamma ** step) * pred_reward.item()
                
                mu_dyn, logvar_dyn = self.world_model.dynamics_forward(current_latent, action_tensor)
                current_latent = self.world_model.reparameterize(mu_dyn, logvar_dyn)
            
            terminal_value = self.value_net(current_latent)
            total_value += (gamma ** self.planning_horizon) * terminal_value.item()
        
        return total_value
    
    def update_world_model(self, batch_size=32):
        """Update world model from experience buffer."""
        if len(self.experience_buffer) < batch_size:
            return None
        
        states, actions, rewards, next_states, dones = self.experience_buffer.sample(batch_size)
        
        states = torch.FloatTensor(states)
        actions = torch.LongTensor(actions)
        rewards = torch.FloatTensor(rewards)
        next_states = torch.FloatTensor(next_states)
        
        model_output = self.world_model(states, actions)
        
        losses = self.world_model_loss.compute_total_loss(
            model_output, states, rewards, next_states
        )
        
        self.world_model_optimizer.zero_grad()
        losses['total'].backward()
        self.world_model_optimizer.step()
        
        self.world_model_losses.append(losses['total'].item())
        
        return losses
    
    def update_policy_with_imagination(self, num_imagination_episodes=10):
        """Update policy using imagined trajectories."""
        if len(self.experience_buffer) == 0:
            return None
        
        total_policy_loss = 0.0
        total_value_loss = 0.0
        
        for _ in range(num_imagination_episodes):
            states, _, _, _, _ = self.experience_buffer.sample(1)
            initial_obs = torch.FloatTensor(states[0]).unsqueeze(0)
            
            actions = [torch.randint(0, self.action_dim, (1,)) for _ in range(self.planning_horizon)]
            
            trajectory = self.world_model.imagine_trajectory(initial_obs, actions)
            
            policy_loss = 0.0
            value_loss = 0.0
            
            for t, (latent_state, action, reward) in enumerate(zip(
                trajectory['latent_states'][:-1], 
                trajectory['actions'], 
                trajectory['rewards']
            )):
                action_probs = self.policy_net(latent_state)
                log_prob = torch.log(action_probs.gather(1, action.unsqueeze(1)))
                policy_loss -= log_prob * reward.detach()
                
                value_pred = self.value_net(latent_state)
                value_loss += F.mse_loss(value_pred, reward.detach())
            
            total_policy_loss += policy_loss.item()
            total_value_loss += value_loss.item()
            
            self.policy_optimizer.zero_grad()
            policy_loss.backward(retain_graph=True)
            self.policy_optimizer.step()
            
            self.value_optimizer.zero_grad()
            value_loss.backward()
            self.value_optimizer.step()
        
        avg_policy_loss = total_policy_loss / num_imagination_episodes
        avg_value_loss = total_value_loss / num_imagination_episodes
        
        self.policy_losses.append(avg_policy_loss)
        
        return {
            'policy_loss': avg_policy_loss,
            'value_loss': avg_value_loss
        }

def demonstrate_world_model_learning():
    """Demonstrate world model learning and imagination-based planning."""
    print("🌍 Demonstrating World Model Learning and Imagination")
    
    env = SimpleGridWorld(size=4)
    
    agent = ImaginationBasedAgent(obs_dim=2, action_dim=4, latent_dim=16, planning_horizon=5)
    
    num_episodes = 150
    batch_size = 16
    world_model_updates = 5
    imagination_updates = 3
    
    episode_rewards = []
    world_model_loss_history = []
    
    print("Starting training...")
    
    for episode in range(num_episodes):
        state = env.reset()
        episode_reward = 0
        episode_steps = 0
        max_steps = 50
        
        for step in range(max_steps):
            epsilon = max(0.1, 1.0 - episode / 80)
            action = agent.act(state, epsilon=epsilon)
            
            next_state, reward, done, _ = env.step(action)
            episode_reward += reward
            episode_steps += 1
            
            agent.experience_buffer.push(state, action, reward, next_state, done)
            
            if len(agent.experience_buffer) > batch_size:
                for _ in range(world_model_updates):
                    losses = agent.update_world_model(batch_size)
                    if losses:
                        world_model_loss_history.append(losses['total'].item())
            
            if episode > 20 and len(agent.experience_buffer) > batch_size:
                agent.update_policy_with_imagination(imagination_updates)
            
            if done:
                break
            
            state = next_state
        
        episode_rewards.append(episode_reward)
        
        if episode % 30 == 0 and episode > 0:
            recent_reward = np.mean(episode_rewards[-10:])
            recent_wm_loss = np.mean(world_model_loss_history[-50:]) if world_model_loss_history else 0
            print(f"Episode {episode}: Avg Reward: {recent_reward:.2f}, WM Loss: {recent_wm_loss:.4f}")
    
    return agent, episode_rewards, world_model_loss_history

def visualize_world_model_performance(agent, episode_rewards, world_model_losses):
    """Visualize world model learning performance."""
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    
    window_size = 10
    if len(episode_rewards) >= window_size:
        smoothed_rewards = pd.Series(episode_rewards).rolling(window_size).mean()
        axes[0,0].plot(smoothed_rewards, linewidth=2, color='blue')
    else:
        axes[0,0].plot(episode_rewards, linewidth=2, color='blue')
    
    axes[0,0].set_title('Learning Curve (Episode Rewards)')
    axes[0,0].set_xlabel('Episode')
    axes[0,0].set_ylabel('Episode Reward')
    axes[0,0].grid(True, alpha=0.3)
    
    if world_model_losses:
        axes[0,1].plot(world_model_losses, linewidth=1, alpha=0.7, color='red')
        if len(world_model_losses) >= 20:
            smoothed_losses = pd.Series(world_model_losses).rolling(20).mean()
            axes[0,1].plot(smoothed_losses, linewidth=2, color='darkred')
    
    axes[0,1].set_title('World Model Training Loss')
    axes[0,1].set_xlabel('Update Step')
    axes[0,1].set_ylabel('Loss')
    axes[0,1].grid(True, alpha=0.3)
    
    if len(agent.experience_buffer) > 50:
        states, _, _, _, _ = agent.experience_buffer.sample(50)
        states_tensor = torch.FloatTensor(states)
        
        with torch.no_grad():
            mu, _ = agent.world_model.encode(states_tensor)
            latent_states = mu.numpy()
        
        if latent_states.shape[1] >= 2:
            axes[1,0].scatter(latent_states[:, 0], latent_states[:, 1], alpha=0.6, c=range(len(latent_states)))
            axes[1,0].set_title('Learned Latent Space Representation')
            axes[1,0].set_xlabel('Latent Dimension 1')
            axes[1,0].set_ylabel('Latent Dimension 2')
            axes[1,0].grid(True, alpha=0.3)
    
    planning_horizons = [1, 3, 5, 8, 10]
    planning_performance = []
    
    for horizon in planning_horizons:
        test_agent = ImaginationBasedAgent(obs_dim=2, action_dim=4, planning_horizon=horizon)
        test_agent.world_model = agent.world_model  # Use trained world model
        test_agent.policy_net = agent.policy_net    # Use trained policy
        test_agent.value_net = agent.value_net      # Use trained value function
        
        test_env = SimpleGridWorld(size=4)
        test_rewards = []
        
        for _ in range(10):  # Quick test
            state = test_env.reset()
            episode_reward = 0
            
            for _ in range(20):
                action = test_agent.plan_with_world_model(state)
                next_state, reward, done, _ = test_env.step(action)
                episode_reward += reward
                
                if done:
                    break
                state = next_state
            
            test_rewards.append(episode_reward)
        
        planning_performance.append(np.mean(test_rewards))
    
    axes[1,1].plot(planning_horizons, planning_performance, 'o-', linewidth=2, markersize=8)
    axes[1,1].set_title('Planning Horizon vs Performance')
    axes[1,1].set_xlabel('Planning Horizon')
    axes[1,1].set_ylabel('Average Reward')
    axes[1,1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    return planning_performance

print("🚀 Starting World Model Learning Demonstration!")
trained_agent, episode_rewards, wm_losses = demonstrate_world_model_learning()
planning_analysis = visualize_world_model_performance(trained_agent, episode_rewards, wm_losses)

print("\n🌍 World Model Learning Results:")
print(f"  • Final average reward: {np.mean(episode_rewards[-10:]):.2f}")
print(f"  • World model converged to loss: {np.mean(wm_losses[-20:]):.4f}")
print(f"  • Optimal planning horizon: {[1,3,5,8,10][np.argmax(planning_analysis)]}")

print("\n💡 Key Insights from World Model Learning:")
print("  • World models enable sample-efficient learning through imagination")
print("  • Planning horizon affects performance - too short lacks foresight, too long accumulates errors")
print("  • Learned latent representations capture environment structure")
print("  • Imagination-based policy updates improve without real environment interaction")


# Section 3: Sample Efficiency and Transfer Learning

## 3.1 Sample Efficiency Challenges in Deep RL

Sample efficiency is one of the most critical challenges in deep reinforcement learning, particularly for real-world applications where data collection is expensive or dangerous.

### Why is Sample Efficiency Important?

**Real-World Constraints:**
- **Cost**: Real-world interactions can be expensive (robotics, autonomous vehicles)
- **Time**: Learning from millions of samples is often impractical
- **Safety**: Exploratory actions in safety-critical domains can be dangerous
- **Reproducibility**: Limited samples make experiments more reliable

**Sample Complexity Factors:**
- **Environment Complexity**: High-dimensional state/action spaces
- **Sparse Rewards**: Learning signals are infrequent
- **Stochasticity**: Environmental noise requires more samples
- **Exploration**: Discovering good policies requires extensive exploration

## 3.2 Sample Efficiency Techniques

### 3.2.1 Experience Replay and Prioritization

**Experience Replay Benefits:**
- Reuse past experiences multiple times
- Break temporal correlations in data
- Enable off-policy learning

**Prioritized Experience Replay:**
Prioritize experiences based on temporal difference (TD) error:
$$P(i) = \frac{p_i^\alpha}{\sum_k p_k^\alpha}$$

Where $p_i = |\delta_i| + \epsilon$ and $\delta_i$ is the TD error.

### 3.2.2 Data Augmentation

**Techniques:**
- **Random Crops**: For image-based environments
- **Color Jittering**: Robust to lighting variations  
- **Random Shifts**: Translation invariance
- **Gaussian Noise**: Regularization effect

### 3.2.3 Auxiliary Tasks

Learn multiple tasks simultaneously to improve sample efficiency:
- **Pixel Control**: Predict pixel changes
- **Feature Control**: Control learned feature representations
- **Reward Prediction**: Predict future rewards
- **Value Function Replay**: Replay value function updates

## 3.3 Transfer Learning in Reinforcement Learning

Transfer learning enables agents to leverage knowledge from previous tasks to learn new tasks more efficiently.

### 3.3.1 Types of Transfer in RL

**Policy Transfer:**
$$\pi_{target}(a|s) = f(\pi_{source}(a|s), s, \theta_{adapt})$$

**Value Function Transfer:**
$$Q_{target}(s,a) = g(Q_{source}(s,a), s, a, \phi_{adapt})$$

**Representation Transfer:**
$$\phi_{target}(s) = h(\phi_{source}(s), \psi_{adapt})$$

### 3.3.2 Transfer Learning Approaches

#### Fine-tuning
1. Pre-train on source task
2. Initialize target model with source weights
3. Fine-tune on target task with lower learning rate

#### Progressive Networks
- Freeze source network columns
- Add new columns for target tasks
- Use lateral connections between columns

#### Universal Value Functions (UVF)
Learn value functions conditioned on goals:
$$Q(s, a, g) = \text{Value of action } a \text{ in state } s \text{ for goal } g$$

## 3.4 Meta-Learning and Few-Shot Adaptation

Meta-learning enables agents to quickly adapt to new tasks with limited experience.

### 3.4.1 Model-Agnostic Meta-Learning (MAML)

**Objective:**
$$\min_\theta \sum_{\tau \sim p(\mathcal{T})} \mathcal{L}_\tau(f_{\theta_\tau'})$$

Where $\theta_\tau' = \theta - \alpha \nabla_\theta \mathcal{L}_\tau(f_\theta)$

**MAML Algorithm:**
1. Sample batch of tasks
2. For each task, compute adapted parameters via gradient descent
3. Update meta-parameters using gradient through adaptation process

### 3.4.2 Gradient-Based Meta-Learning

**Reptile Algorithm:**
Simpler alternative to MAML:
$$\theta \leftarrow \theta + \beta \frac{1}{n} \sum_{i=1}^n (\phi_i - \theta)$$

Where $\phi_i$ is the result of training on task $i$.

## 3.5 Domain Adaptation and Sim-to-Real Transfer

### 3.5.1 Domain Randomization

**Technique:**
Randomize simulation parameters during training:
- Physical properties (mass, friction, damping)
- Visual appearance (textures, lighting, colors)
- Sensor characteristics (noise, resolution, field of view)

**Benefits:**
- Learned policies are robust to domain variations
- Improved transfer from simulation to real world
- Reduced need for domain-specific engineering

### 3.5.2 Domain Adversarial Training

**Objective:**
$$\min_\theta \mathcal{L}_{task}(\theta) + \lambda \mathcal{L}_{domain}(\theta)$$

Where $\mathcal{L}_{domain}$ encourages domain-invariant features.

## 3.6 Curriculum Learning

Structure learning to progress from simple to complex tasks.

### 3.6.1 Curriculum Design Principles

**Manual Curriculum:**
- Hand-designed progression of tasks
- Expert knowledge of difficulty ordering
- Fixed curriculum regardless of agent performance

**Automatic Curriculum:**
- Adaptive task selection based on agent performance
- Learning progress as curriculum signal
- Self-paced learning approaches

### 3.6.2 Curriculum Learning Algorithms

**Teacher-Student Framework:**
- Teacher selects appropriate tasks for student
- Task difficulty based on student's current capability
- Optimize task selection for maximum learning progress

**Self-Play Curriculum:**
- Agent plays against previous versions of itself
- Automatic difficulty adjustment
- Prevents catastrophic forgetting of simpler strategies

In [None]:

class PrioritizedReplayBuffer:
    """Prioritized experience replay buffer for improved sample efficiency."""
    
    def __init__(self, capacity, alpha=0.6, beta=0.4, beta_increment=1e-4):
        self.capacity = capacity
        self.alpha = alpha  # Priority exponent
        self.beta = beta    # Importance sampling exponent
        self.beta_increment = beta_increment
        
        self.buffer = []
        self.priorities = np.zeros(capacity, dtype=np.float32)
        self.position = 0
        self.max_priority = 1.0
        
    def push(self, state, action, reward, next_state, done):
        """Add experience to buffer with maximum priority."""
        if len(self.buffer) < self.capacity:
            self.buffer.append(None)
        
        self.buffer[self.position] = (state, action, reward, next_state, done)
        self.priorities[self.position] = self.max_priority
        
        self.position = (self.position + 1) % self.capacity
    
    def sample(self, batch_size):
        """Sample batch with prioritized sampling."""
        if len(self.buffer) < batch_size:
            return None
        
        valid_priorities = self.priorities[:len(self.buffer)]
        probs = valid_priorities ** self.alpha
        probs /= probs.sum()
        
        indices = np.random.choice(len(self.buffer), batch_size, p=probs)
        
        experiences = [self.buffer[idx] for idx in indices]
        states, actions, rewards, next_states, dones = zip(*experiences)
        
        total = len(self.buffer)
        weights = (total * probs[indices]) ** (-self.beta)
        weights /= weights.max()
        
        self.beta = min(1.0, self.beta + self.beta_increment)
        
        return (states, actions, rewards, next_states, dones), indices, weights
    
    def update_priorities(self, indices, td_errors):
        """Update priorities based on TD errors."""
        for idx, td_error in zip(indices, td_errors):
            priority = (abs(td_error) + 1e-6) ** self.alpha
            self.priorities[idx] = priority
            self.max_priority = max(self.max_priority, priority)
    
    def __len__(self):
        return len(self.buffer)

class DataAugmentationDQN(nn.Module):
    """DQN with data augmentation for improved sample efficiency."""
    
    def __init__(self, input_dim, action_dim, hidden_dim=128):
        super().__init__()
        self.input_dim = input_dim
        self.action_dim = action_dim
        
        self.q_network = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, action_dim)
        )
        
        self.reward_predictor = nn.Sequential(
            nn.Linear(input_dim + action_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 1)
        )
        
        self.next_state_predictor = nn.Sequential(
            nn.Linear(input_dim + action_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, input_dim)
        )
    
    def forward(self, state, action=None):
        """Forward pass with optional auxiliary predictions."""
        q_values = self.q_network(state)
        
        if action is not None:
            if len(action.shape) == 1:
                action_one_hot = F.one_hot(action.long(), self.action_dim).float()
            else:
                action_one_hot = action
            
            aux_input = torch.cat([state, action_one_hot], dim=-1)
            reward_pred = self.reward_predictor(aux_input)
            next_state_pred = self.next_state_predictor(aux_input)
            
            return q_values, reward_pred, next_state_pred
        
        return q_values
    
    def apply_augmentation(self, state, augmentation_type='noise'):
        """Apply data augmentation to state."""
        if augmentation_type == 'noise':
            noise = torch.randn_like(state) * 0.1
            return state + noise
        
        elif augmentation_type == 'dropout':
            dropout_mask = torch.rand_like(state) > 0.1
            return state * dropout_mask.float()
        
        elif augmentation_type == 'scaling':
            scale = torch.rand(1).item() * 0.4 + 0.8  # Scale between 0.8 and 1.2
            return state * scale
        
        return state

class SampleEfficientAgent:
    """Agent with multiple sample efficiency techniques."""
    
    def __init__(self, state_dim, action_dim, lr=1e-3):
        self.state_dim = state_dim
        self.action_dim = action_dim
        
        self.network = DataAugmentationDQN(state_dim, action_dim)
        self.target_network = copy.deepcopy(self.network)
        self.optimizer = optim.Adam(self.network.parameters(), lr=lr)
        
        self.replay_buffer = PrioritizedReplayBuffer(capacity=10000)
        
        self.gamma = 0.99
        self.target_update_freq = 100
        self.update_count = 0
        
        self.aux_reward_weight = 0.1
        self.aux_dynamics_weight = 0.1
        
        self.losses = []
        self.td_errors = []
    
    def act(self, state, epsilon=0.1):
        """Select action with epsilon-greedy policy."""
        if np.random.random() < epsilon:
            return np.random.randint(self.action_dim)
        
        with torch.no_grad():
            state_tensor = torch.FloatTensor(state).unsqueeze(0)
            q_values = self.network(state_tensor)
            return q_values.argmax().item()
    
    def update(self, batch_size=32, use_aux_tasks=True, augmentation=True):
        """Update agent with prioritized replay and auxiliary tasks."""
        sample_result = self.replay_buffer.sample(batch_size)
        if sample_result is None:
            return None
        
        experiences, indices, weights = sample_result
        states, actions, rewards, next_states, dones = experiences
        
        states = torch.FloatTensor(states)
        actions = torch.LongTensor(actions)
        rewards = torch.FloatTensor(rewards)
        next_states = torch.FloatTensor(next_states)
        dones = torch.BoolTensor(dones)
        weights = torch.FloatTensor(weights)
        
        if augmentation:
            aug_type = np.random.choice(['noise', 'dropout', 'scaling'])\n            states = self.network.apply_augmentation(states, aug_type)
            next_states = self.network.apply_augmentation(next_states, aug_type)
        
        current_q_values = self.network(states).gather(1, actions.unsqueeze(1))
        
        with torch.no_grad():
            next_q_values = self.target_network(next_states).max(1)[0]
            target_q_values = rewards + (self.gamma * next_q_values * (~dones))
        
        td_errors = (current_q_values.squeeze() - target_q_values).detach().numpy()
        
        q_loss = (weights * F.mse_loss(current_q_values.squeeze(), target_q_values, reduction='none')).mean()
        
        total_loss = q_loss
        
        if use_aux_tasks:
            q_values, reward_pred, next_state_pred = self.network(states, actions)
            
            aux_reward_loss = F.mse_loss(reward_pred.squeeze(), rewards)
            aux_dynamics_loss = F.mse_loss(next_state_pred, next_states)
            
            total_loss += self.aux_reward_weight * aux_reward_loss
            total_loss += self.aux_dynamics_weight * aux_dynamics_loss
        
        self.optimizer.zero_grad()
        total_loss.backward()
        torch.nn.utils.clip_grad_norm_(self.network.parameters(), max_norm=1.0)
        self.optimizer.step()
        
        self.replay_buffer.update_priorities(indices, td_errors)
        
        self.update_count += 1
        if self.update_count % self.target_update_freq == 0:
            self.target_network.load_state_dict(self.network.state_dict())
        
        self.losses.append(total_loss.item())
        self.td_errors.extend(td_errors.tolist())
        
        return {
            'total_loss': total_loss.item(),
            'q_loss': q_loss.item(),
            'aux_reward_loss': aux_reward_loss.item() if use_aux_tasks else 0,
            'aux_dynamics_loss': aux_dynamics_loss.item() if use_aux_tasks else 0
        }

class TransferLearningAgent:
    """Agent with transfer learning capabilities."""
    
    def __init__(self, state_dim, action_dim, lr=1e-3):
        self.state_dim = state_dim
        self.action_dim = action_dim
        
        self.feature_extractor = nn.Sequential(
            nn.Linear(state_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 64),
            nn.ReLU()
        )
        
        self.policy_heads = {}
        self.value_heads = {}
        
        self.feature_optimizer = optim.Adam(self.feature_extractor.parameters(), lr=lr)
        self.head_optimizers = {}
        
        self.transfer_performance = {}
    
    def add_task(self, task_name, action_dim=None):
        """Add a new task with its own policy and value heads."""
        if action_dim is None:
            action_dim = self.action_dim
        
        self.policy_heads[task_name] = nn.Sequential(
            nn.Linear(64, 64),
            nn.ReLU(),
            nn.Linear(64, action_dim),
            nn.Softmax(dim=-1)
        )
        
        self.value_heads[task_name] = nn.Sequential(
            nn.Linear(64, 64),
            nn.ReLU(),
            nn.Linear(64, 1)
        )
        
        task_params = list(self.policy_heads[task_name].parameters()) + \
                     list(self.value_heads[task_name].parameters())
        self.head_optimizers[task_name] = optim.Adam(task_params, lr=1e-3)
        
        self.transfer_performance[task_name] = []
    
    def get_action(self, state, task_name):
        """Get action for specific task."""
        with torch.no_grad():
            state_tensor = torch.FloatTensor(state).unsqueeze(0)
            features = self.feature_extractor(state_tensor)
            action_probs = self.policy_heads[task_name](features)
            return Categorical(action_probs).sample().item()
    
    def get_value(self, state, task_name):
        """Get value estimate for specific task."""
        with torch.no_grad():
            state_tensor = torch.FloatTensor(state).unsqueeze(0)
            features = self.feature_extractor(state_tensor)
            return self.value_heads[task_name](features).item()
    
    def update(self, states, actions, rewards, task_name, update_features=True):
        """Update agent for specific task."""
        states = torch.FloatTensor(states)
        actions = torch.LongTensor(actions)
        rewards = torch.FloatTensor(rewards)
        
        features = self.feature_extractor(states)
        action_probs = self.policy_heads[task_name](features)
        values = self.value_heads[task_name](features).squeeze()
        
        log_probs = torch.log(action_probs.gather(1, actions.unsqueeze(1))).squeeze()
        advantages = rewards - values.detach()
        policy_loss = -(log_probs * advantages).mean()
        
        value_loss = F.mse_loss(values, rewards)
        
        total_loss = policy_loss + 0.5 * value_loss
        
        self.head_optimizers[task_name].zero_grad()
        if update_features:
            self.feature_optimizer.zero_grad()
        
        total_loss.backward()
        
        self.head_optimizers[task_name].step()
        if update_features:
            self.feature_optimizer.step()
        
        return {
            'policy_loss': policy_loss.item(),
            'value_loss': value_loss.item()
        }
    
    def fine_tune_for_task(self, source_task, target_task, fine_tune_lr=1e-4):
        """Fine-tune from source task to target task."""
        self.policy_heads[target_task] = copy.deepcopy(self.policy_heads[source_task])
        self.value_heads[target_task] = copy.deepcopy(self.value_heads[source_task])
        
        task_params = list(self.policy_heads[target_task].parameters()) + \
                     list(self.value_heads[target_task].parameters())
        self.head_optimizers[target_task] = optim.Adam(task_params, lr=fine_tune_lr)
        
        self.transfer_performance[target_task] = []

class CurriculumLearningFramework:
    """Framework for curriculum learning with automatic difficulty adjustment."""
    
    def __init__(self, environments, agent, difficulty_measure='success_rate'):
        self.environments = environments  # List of environments with increasing difficulty
        self.agent = agent
        self.difficulty_measure = difficulty_measure
        
        self.current_level = 0
        self.level_performance = [[] for _ in environments]
        self.progression_threshold = 0.8  # Success rate threshold to advance
        self.regression_threshold = 0.3   # Success rate threshold to regress
        
        self.curriculum_history = []
    
    def get_current_environment(self):
        """Get current environment based on curriculum level."""
        return self.environments[self.current_level]
    
    def evaluate_performance(self, episode_rewards, episode_successes=None):
        """Evaluate performance on current level."""
        if self.difficulty_measure == 'success_rate' and episode_successes is not None:
            return np.mean(episode_successes[-10:]) if len(episode_successes) >= 10 else 0
        elif self.difficulty_measure == 'reward':
            return np.mean(episode_rewards[-10:]) if len(episode_rewards) >= 10 else 0
        else:
            return np.mean(episode_rewards[-10:]) if len(episode_rewards) >= 10 else 0
    
    def update_curriculum(self, performance):
        """Update curriculum level based on performance."""
        old_level = self.current_level
        
        if performance >= self.progression_threshold and self.current_level < len(self.environments) - 1:
            self.current_level += 1
            print(f\"📈 Advanced to level {self.current_level} (performance: {performance:.2f})\"
        
        elif performance < self.regression_threshold and self.current_level > 0:
            self.current_level = max(0, self.current_level - 1)
            print(f\"📉 Regressed to level {self.current_level} (performance: {performance:.2f})\"
        
        if old_level != self.current_level:
            self.curriculum_history.append({
                'episode': len(self.level_performance[old_level]),
                'old_level': old_level,
                'new_level': self.current_level,
                'performance': performance
            })
        
        return self.current_level != old_level
    
    def train_with_curriculum(self, num_episodes=1000):
        """Train agent using curriculum learning."""
        episode_rewards = []
        episode_successes = []
        
        for episode in range(num_episodes):
            env = self.get_current_environment()
            
            state = env.reset()
            episode_reward = 0
            episode_success = False
            
            for step in range(100):  # Max episode length
                action = self.agent.act(state, epsilon=max(0.1, 1.0 - episode/500))
                next_state, reward, done, info = env.step(action)
                
                self.agent.replay_buffer.push(state, action, reward, next_state, done)
                
                episode_reward += reward
                if done and reward > 5:  # Define success condition
                    episode_success = True
                
                if done:
                    break
                
                state = next_state
            
            if len(self.agent.replay_buffer) > 32:
                self.agent.update(32)
            
            episode_rewards.append(episode_reward)
            episode_successes.append(episode_success)
            self.level_performance[self.current_level].append(episode_reward)
            
            if episode % 20 == 0:
                performance = self.evaluate_performance(episode_rewards, episode_successes)
                self.update_curriculum(performance)
            
            if episode % 100 == 0:
                recent_reward = np.mean(episode_rewards[-10:])
                recent_success = np.mean(episode_successes[-10:])
                print(f\"Episode {episode}: Level {self.current_level}, \"
                      f\"Reward: {recent_reward:.2f}, Success: {recent_success:.2f}\"
        
        return episode_rewards, episode_successes

def compare_sample_efficiency():
    \"\"\"Compare sample efficiency of different techniques.\"\"\"
    print(\"⚡ Comparing Sample Efficiency Techniques\")
    
    env = SimpleGridWorld(size=6)
    
    baseline_agent = DQNAgent(state_dim=2, action_dim=4)
    efficient_agent = SampleEfficientAgent(state_dim=2, action_dim=4)
    
    agents = {
        'Baseline DQN': baseline_agent,
        'Sample Efficient': efficient_agent
    }
    
    results = {name: {'rewards': [], 'episodes': []} for name in agents.keys()}
    
    num_episodes = 300
    
    for episode in range(num_episodes):
        for agent_name, agent in agents.items():
            state = env.reset()
            episode_reward = 0
            
            for step in range(50):
                action = agent.act(state, epsilon=max(0.1, 1.0 - episode/200))
                next_state, reward, done, _ = env.step(action)
                episode_reward += reward
                
                if agent_name == 'Baseline DQN':
                    agent.replay_buffer.push(state, action, reward, next_state, done)
                    if len(agent.replay_buffer) > 32:
                        batch = agent.replay_buffer.sample(32)
                        agent.update(batch)
                else:
                    agent.replay_buffer.push(state, action, reward, next_state, done)
                    if len(agent.replay_buffer) > 32:
                        agent.update(32)
                
                if done:
                    break
                
                state = next_state
            
            results[agent_name]['rewards'].append(episode_reward)
            results[agent_name]['episodes'].append(episode)
    
    return results

def demonstrate_transfer_learning():
    \"\"\"Demonstrate transfer learning between related tasks.\"\"\"
    print(\"🔄 Demonstrating Transfer Learning\")
    
    agent = TransferLearningAgent(state_dim=2, action_dim=4)
    
    def create_task_env(goal_position, reward_scale=1.0):
        env = SimpleGridWorld(size=4)
        env.goal = goal_position
        env.reward_scale = reward_scale
        return env
    
    tasks = {
        'task_1': create_task_env([3, 3], 1.0),     # Original goal
        'task_2': create_task_env([3, 0], 1.0),     # Different goal
        'task_3': create_task_env([0, 3], 1.0),     # Another goal
    }
    
    for task_name in tasks.keys():
        agent.add_task(task_name)
    
    print(\"Training on Task 1...\")
    task_1_env = tasks['task_1']
    
    for episode in range(200):
        state = task_1_env.reset()
        episode_states, episode_actions, episode_rewards = [], [], []
        
        for step in range(30):
            action = agent.get_action(state, 'task_1')
            next_state, reward, done, _ = task_1_env.step(action)
            
            episode_states.append(state)
            episode_actions.append(action)
            episode_rewards.append(reward)
            
            if done:
                break
            
            state = next_state
        
        if episode_rewards:
            agent.update(episode_states, episode_actions, episode_rewards, 'task_1')
    
    transfer_results = {}
    
    for new_task in ['task_2', 'task_3']:
        print(f\"Transferring to {new_task}...\")
        
        agent.fine_tune_for_task('task_1', new_task)
        
        task_env = tasks[new_task]
        task_rewards = []
        
        for episode in range(50):  # Limited training
            state = task_env.reset()
            episode_reward = 0
            episode_states, episode_actions, episode_rewards = [], [], []
            
            for step in range(30):
                action = agent.get_action(state, new_task)
                next_state, reward, done, _ = task_env.step(action)
                
                episode_states.append(state)
                episode_actions.append(action)
                episode_rewards.append(reward)
                episode_reward += reward
                
                if done:
                    break
                
                state = next_state
            
            if episode_rewards:
                agent.update(episode_states, episode_actions, episode_rewards, 
                           new_task, update_features=False)
            
            task_rewards.append(episode_reward)
        
        transfer_results[new_task] = task_rewards
        print(f\"  Final performance on {new_task}: {np.mean(task_rewards[-10:]):.2f}\")
    
    return transfer_results

print(\"🚀 Starting Sample Efficiency and Transfer Learning Demonstrations!\")

efficiency_results = compare_sample_efficiency()

transfer_results = demonstrate_transfer_learning()

fig, axes = plt.subplots(1, 2, figsize=(15, 6))

for agent_name, data in efficiency_results.items():
    window_size = 20
    if len(data['rewards']) >= window_size:
        smoothed_rewards = pd.Series(data['rewards']).rolling(window_size).mean()
        axes[0].plot(data['episodes'], smoothed_rewards, label=agent_name, linewidth=2)

axes[0].set_title('Sample Efficiency Comparison')
axes[0].set_xlabel('Episode')
axes[0].set_ylabel('Episode Reward (Smoothed)')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

for task_name, rewards in transfer_results.items():
    axes[1].plot(rewards, label=f'Transfer to {task_name}', linewidth=2)

axes[1].set_title('Transfer Learning Performance')
axes[1].set_xlabel('Episode (Limited Training)')
axes[1].set_ylabel('Episode Reward')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(\"\\n📊 Sample Efficiency Results:\")
for agent_name, data in efficiency_results.items():
    final_perf = np.mean(data['rewards'][-20:])
    print(f\"  {agent_name}: {final_perf:.2f} final performance\")

print(\"\\n🔄 Transfer Learning Results:\")
for task_name, rewards in transfer_results.items():
    final_perf = np.mean(rewards[-10:])
    print(f\"  {task_name}: {final_perf:.2f} final performance with limited training\")

print(\"\\n💡 Key Insights:\")
print(\"  • Prioritized replay and auxiliary tasks improve sample efficiency\")
print(\"  • Data augmentation provides regularization benefits\")
print(\"  • Transfer learning enables rapid adaptation to new tasks\")
print(\"  • Shared representations capture generalizable knowledge\")


# Section 4: Hierarchical Reinforcement Learning

## 4.1 Theory: Hierarchical Decision Making

Hierarchical Reinforcement Learning (HRL) addresses the challenge of learning complex behaviors by decomposing tasks into hierarchical structures. This approach enables agents to:

1. **Learn at Multiple Time Scales**: High-level policies select goals or skills, while low-level policies execute primitive actions
2. **Achieve Better Generalization**: Skills learned in one context can be reused in others
3. **Improve Sample Efficiency**: By leveraging temporal abstractions and skill composition

### Key Components

#### Options Framework
An **option** $\omega$ is defined by a tuple $(I_\omega, \pi_\omega, \beta_\omega)$:
- **Initiation Set** $I_\omega \subseteq \mathcal{S}$: States where the option can be initiated
- **Policy** $\pi_\omega: \mathcal{S} \times \mathcal{A} \rightarrow [0,1]$: Action selection within the option
- **Termination Condition** $\beta_\omega: \mathcal{S} \rightarrow [0,1]$: Probability of termination

#### Hierarchical Value Functions
The value function for options follows the Bellman equation:
$$Q^\pi(s,\omega) = \mathbb{E}_\pi\left[\sum_{t=0}^{\tau-1} \gamma^t r_{t+1} + \gamma^\tau Q^\pi(s_\tau, \omega') \mid s_0=s, \omega_0=\omega\right]$$

where $\tau$ is the termination time and $\omega'$ is the next option selected.

#### Feudal Networks
Feudal Networks implement a manager-worker hierarchy:
- **Manager Network**: Sets goals $g_t$ for workers: $g_t = f_{manager}(s_t, h_{t-1}^{manager})$
- **Worker Network**: Executes actions conditioned on goals: $a_t = \pi_{worker}(s_t, g_t)$
- **Intrinsic Motivation**: Workers receive intrinsic rewards based on goal achievement

### Mathematical Framework

#### Intrinsic Reward Signal
The intrinsic reward for achieving subgoals:
$$r_t^{intrinsic} = \cos(\text{achieved\_goal}_t - \text{desired\_goal}_t) \cdot ||s_{t+1} - s_t||$$

#### Hierarchical Policy Gradient
The gradient for the manager policy:
$$\nabla_{\theta_m} J_m = \mathbb{E}\left[\nabla_{\theta_m} \log \pi_m(g_t|s_t) \cdot A_m(s_t, g_t)\right]$$

And for the worker policy:
$$\nabla_{\theta_w} J_w = \mathbb{E}\left[\nabla_{\theta_w} \log \pi_w(a_t|s_t, g_t) \cdot A_w(s_t, a_t, g_t)\right]$$

## 4.2 Implementation: Hierarchical RL Architectures

We'll implement several HRL approaches:
1. **Options-Critic Architecture**: Learn options and policies jointly
2. **Feudal Networks**: Manager-worker hierarchies
3. **Hindsight Experience Replay with Goals**: Sample efficiency for goal-conditioned tasks

In [None]:

class OptionsCriticNetwork(nn.Module):
    """Options-Critic architecture for learning hierarchical policies."""
    
    def __init__(self, state_dim, action_dim, num_options=4, hidden_dim=128):
        super().__init__()
        self.state_dim = state_dim
        self.action_dim = action_dim
        self.num_options = num_options
        
        self.feature_net = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU()
        )
        
        self.option_net = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, num_options),
            nn.Softmax(dim=-1)
        )
        
        self.intra_option_nets = nn.ModuleList([
            nn.Sequential(
                nn.Linear(hidden_dim, hidden_dim),
                nn.ReLU(),
                nn.Linear(hidden_dim, action_dim),
                nn.Softmax(dim=-1)
            ) for _ in range(num_options)
        ])
        
        self.termination_nets = nn.ModuleList([
            nn.Sequential(
                nn.Linear(hidden_dim, hidden_dim // 2),
                nn.ReLU(),
                nn.Linear(hidden_dim // 2, 1),
                nn.Sigmoid()
            ) for _ in range(num_options)
        ])
        
        self.value_net = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, num_options)
        )
    
    def forward(self, state):
        """Forward pass through the Options-Critic architecture."""
        features = self.feature_net(state)
        
        option_probs = self.option_net(features)
        
        action_probs = torch.stack([net(features) for net in self.intra_option_nets], dim=1)
        
        termination_probs = torch.stack([net(features) for net in self.termination_nets], dim=1).squeeze(-1)
        
        option_values = self.value_net(features)
        
        return option_probs, action_probs, termination_probs, option_values

class OptionsCriticAgent:
    """Agent using Options-Critic for hierarchical learning."""
    
    def __init__(self, state_dim, action_dim, num_options=4, lr=1e-3):
        self.state_dim = state_dim
        self.action_dim = action_dim
        self.num_options = num_options
        
        self.network = OptionsCriticNetwork(state_dim, action_dim, num_options)
        self.optimizer = optim.Adam(self.network.parameters(), lr=lr)
        
        self.current_option = None
        self.option_length = 0
        self.max_option_length = 10
        
        self.gamma = 0.99
        self.beta_reg = 0.01  # Regularization for termination
        
        self.option_usage = np.zeros(num_options)
        self.option_lengths = []
        self.losses = []
    
    def select_option(self, state):
        """Select option using epsilon-greedy on option probabilities."""
        with torch.no_grad():
            state_tensor = torch.FloatTensor(state).unsqueeze(0)
            option_probs, _, _, _ = self.network(state_tensor)
            return Categorical(option_probs).sample().item()
    
    def select_action(self, state, option):
        """Select action using the intra-option policy."""
        with torch.no_grad():
            state_tensor = torch.FloatTensor(state).unsqueeze(0)
            _, action_probs, _, _ = self.network(state_tensor)
            return Categorical(action_probs[0, option]).sample().item()
    
    def should_terminate(self, state, option):
        """Check if current option should terminate."""
        with torch.no_grad():
            state_tensor = torch.FloatTensor(state).unsqueeze(0)
            _, _, termination_probs, _ = self.network(state_tensor)
            return np.random.random() < termination_probs[0, option].item()
    
    def act(self, state):
        """Full action selection with option management."""
        if self.current_option is None or self.should_terminate(state, self.current_option) or \
           self.option_length >= self.max_option_length:
            self.current_option = self.select_option(state)
            self.option_usage[self.current_option] += 1
            if self.option_length > 0:
                self.option_lengths.append(self.option_length)
            self.option_length = 0
        
        action = self.select_action(state, self.current_option)
        self.option_length += 1
        
        return action, self.current_option
    
    def update(self, trajectory):
        """Update using Options-Critic learning algorithm."""
        if len(trajectory) < 2:
            return None
        
        states, actions, rewards, options = zip(*trajectory)
        states = torch.FloatTensor(states)
        actions = torch.LongTensor(actions)
        rewards = torch.FloatTensor(rewards)
        options = torch.LongTensor(options)
        
        option_probs, action_probs, termination_probs, option_values = self.network(states)
        
        returns = torch.zeros_like(rewards)
        G = 0
        for t in reversed(range(len(rewards))):
            G = rewards[t] + self.gamma * G
            returns[t] = G
        
        current_option_values = option_values.gather(1, options.unsqueeze(1)).squeeze()
        value_loss = F.mse_loss(current_option_values, returns.detach())
        
        advantages = returns - current_option_values.detach()
        
        selected_action_probs = []
        for i in range(len(actions)):
            selected_action_probs.append(action_probs[i, options[i], actions[i]])
        selected_action_probs = torch.stack(selected_action_probs)
        
        policy_loss = -(torch.log(selected_action_probs) * advantages).mean()
        
        termination_reg = self.beta_reg * termination_probs.mean()
        
        total_loss = value_loss + policy_loss + termination_reg
        
        self.optimizer.zero_grad()
        total_loss.backward()
        torch.nn.utils.clip_grad_norm_(self.network.parameters(), max_norm=1.0)
        self.optimizer.step()
        
        self.losses.append(total_loss.item())
        
        return {
            'total_loss': total_loss.item(),
            'value_loss': value_loss.item(),
            'policy_loss': policy_loss.item(),
            'termination_reg': termination_reg.item()
        }

class FeudalNetwork(nn.Module):
    """Feudal Network with Manager-Worker hierarchy."""
    
    def __init__(self, state_dim, action_dim, goal_dim=8, hidden_dim=128, temporal_horizon=10):
        super().__init__()
        self.state_dim = state_dim
        self.action_dim = action_dim
        self.goal_dim = goal_dim
        self.temporal_horizon = temporal_horizon
        
        self.manager_net = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.LSTM(hidden_dim, hidden_dim),
        )
        self.manager_goal_net = nn.Linear(hidden_dim, goal_dim)
        
        self.worker_net = nn.Sequential(
            nn.Linear(state_dim + goal_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, action_dim),
            nn.Softmax(dim=-1)
        )
        
        self.manager_value_net = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 1)
        )
        
        self.worker_value_net = nn.Sequential(
            nn.Linear(state_dim + goal_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 1)
        )
        
        self.manager_hidden = None
    
    def forward(self, state, goal=None):
        """Forward pass through Feudal Network."""
        batch_size = state.size(0) if len(state.shape) > 1 else 1
        if len(state.shape) == 1:
            state = state.unsqueeze(0)
        
        manager_features = self.manager_net[0](state)  # First layer
        manager_features = self.manager_net[1](manager_features)  # ReLU
        
        if self.manager_hidden is None or self.manager_hidden[0].size(1) != batch_size:
            self.manager_hidden = (
                torch.zeros(1, batch_size, self.manager_net[2].hidden_size),
                torch.zeros(1, batch_size, self.manager_net[2].hidden_size)
            )
        
        lstm_out, self.manager_hidden = self.manager_net[2](
            manager_features.unsqueeze(0), self.manager_hidden
        )
        manager_features = lstm_out.squeeze(0)
        
        goals = self.manager_goal_net(manager_features)
        goals = F.normalize(goals, p=2, dim=-1)  # Unit normalize goals
        
        manager_value = self.manager_value_net(manager_features)
        
        if goal is None:
            goal = goals
        
        worker_input = torch.cat([state, goal], dim=-1)
        action_probs = self.worker_net(worker_input)
        worker_value = self.worker_value_net(worker_input)
        
        return goals, action_probs, manager_value, worker_value
    
    def reset_manager_state(self):
        \"\"\"Reset manager LSTM state.\"\"\"
        self.manager_hidden = None

class FeudalAgent:
    \"\"\"Feudal Networks agent with hierarchical learning.\"\"\"
    
    def __init__(self, state_dim, action_dim, goal_dim=8, lr=1e-3, temporal_horizon=10):
        self.state_dim = state_dim
        self.action_dim = action_dim
        self.goal_dim = goal_dim
        self.temporal_horizon = temporal_horizon
        
        self.network = FeudalNetwork(state_dim, action_dim, goal_dim, temporal_horizon=temporal_horizon)
        self.optimizer = optim.Adam(self.network.parameters(), lr=lr)
        
        self.current_goal = None
        self.goal_step_count = 0
        
        self.gamma = 0.99
        self.intrinsic_reward_scale = 0.5
        
        self.manager_losses = []
        self.worker_losses = []
        self.goal_changes = []
    
    def compute_intrinsic_reward(self, state, next_state, goal):
        \"\"\"Compute intrinsic reward based on goal achievement.\"\"\"
        state_diff = next_state - state
        state_diff_norm = np.linalg.norm(state_diff)
        
        if state_diff_norm > 1e-6:
            cosine_sim = np.dot(state_diff, goal) / (state_diff_norm * np.linalg.norm(goal))
            return self.intrinsic_reward_scale * cosine_sim * state_diff_norm
        return 0.0
    
    def act(self, state):
        \"\"\"Select action using feudal hierarchy.\"\"\"
        with torch.no_grad():
            state_tensor = torch.FloatTensor(state).unsqueeze(0)
            
            if self.current_goal is None or self.goal_step_count >= self.temporal_horizon:
                goals, _, _, _ = self.network(state_tensor)
                self.current_goal = goals[0].numpy()
                self.goal_step_count = 0
                self.goal_changes.append(len(self.goal_changes))
            
            goal_tensor = torch.FloatTensor(self.current_goal).unsqueeze(0)
            _, action_probs, _, _ = self.network(state_tensor, goal_tensor)
            action = Categorical(action_probs).sample().item()
            
            self.goal_step_count += 1
        
        return action
    
    def update(self, trajectories):
        \"\"\"Update feudal networks using hierarchical returns.\"\"\"
        if not trajectories:
            return None
        
        total_manager_loss = 0
        total_worker_loss = 0
        num_updates = 0
        
        for traj in trajectories:
            if len(traj) < 2:
                continue
            
            states, actions, rewards, next_states = zip(*traj)
            states = torch.FloatTensor(states)
            actions = torch.LongTensor(actions)
            rewards = torch.FloatTensor(rewards)
            next_states = torch.FloatTensor(next_states)
            
            self.network.reset_manager_state()
            
            goals, action_probs, manager_values, worker_values = self.network(states)
            
            intrinsic_rewards = []
            for i in range(len(states)-1):
                intrinsic_reward = self.compute_intrinsic_reward(
                    states[i].numpy(), next_states[i].numpy(), goals[i].numpy()
                )
                intrinsic_rewards.append(intrinsic_reward)
            intrinsic_rewards = torch.FloatTensor(intrinsic_rewards)
            
            manager_returns = torch.zeros_like(rewards)
            G = 0
            for t in reversed(range(len(rewards))):
                G = rewards[t] + self.gamma * G
                manager_returns[t] = G
            
            manager_advantages = manager_returns - manager_values.squeeze()
            manager_loss = (manager_advantages ** 2).mean()
            
            total_rewards = rewards[:-1] + intrinsic_rewards
            worker_returns = torch.zeros_like(total_rewards)
            G = 0
            for t in reversed(range(len(total_rewards))):
                G = total_rewards[t] + self.gamma * G
                worker_returns[t] = G
            
            worker_advantages = worker_returns - worker_values[:-1].squeeze()
            
            selected_action_probs = action_probs[:-1].gather(1, actions[:-1].unsqueeze(1)).squeeze()
            worker_policy_loss = -(torch.log(selected_action_probs) * worker_advantages.detach()).mean()
            worker_value_loss = (worker_advantages ** 2).mean()
            worker_loss = worker_policy_loss + 0.5 * worker_value_loss
            
            total_manager_loss += manager_loss
            total_worker_loss += worker_loss
            num_updates += 1
        
        if num_updates == 0:
            return None
        
        avg_manager_loss = total_manager_loss / num_updates
        avg_worker_loss = total_worker_loss / num_updates
        total_loss = avg_manager_loss + avg_worker_loss
        
        self.optimizer.zero_grad()
        total_loss.backward()
        torch.nn.utils.clip_grad_norm_(self.network.parameters(), max_norm=1.0)
        self.optimizer.step()
        
        self.manager_losses.append(avg_manager_loss.item())
        self.worker_losses.append(avg_worker_loss.item())
        
        return {
            'manager_loss': avg_manager_loss.item(),
            'worker_loss': avg_worker_loss.item(),
            'total_loss': total_loss.item()
        }

class HindsightExperienceReplay:
    \"\"\"Hindsight Experience Replay for goal-conditioned RL.\"\"\"
    
    def __init__(self, capacity, goal_dim, strategy='future', k=4):
        self.capacity = capacity
        self.goal_dim = goal_dim
        self.strategy = strategy  # 'future', 'final', 'episode', 'random'
        self.k = k  # Number of additional goals to sample
        
        self.buffer = []
        self.position = 0
    
    def push_episode(self, episode_trajectory):
        \"\"\"Store an entire episode and generate hindsight goals.\"\"\"
        if not episode_trajectory:
            return
        
        states, actions, rewards, next_states, goals, achieved_goals = zip(*episode_trajectory)
        
        for i, transition in enumerate(episode_trajectory):
            self._store_transition(transition)
        
        if self.strategy == 'future':
            self._generate_future_goals(episode_trajectory)
        elif self.strategy == 'final':
            self._generate_final_goals(episode_trajectory)
        elif self.strategy == 'episode':
            self._generate_episode_goals(episode_trajectory)
        elif self.strategy == 'random':
            self._generate_random_goals(episode_trajectory)
    
    def _store_transition(self, transition):
        \"\"\"Store a single transition in the buffer.\"\"\"
        if len(self.buffer) < self.capacity:
            self.buffer.append(None)
        
        self.buffer[self.position] = transition
        self.position = (self.position + 1) % self.capacity
    
    def _generate_future_goals(self, episode_trajectory):
        \"\"\"Generate goals from future achieved goals in the episode.\"\"\"
        for i in range(len(episode_trajectory)):
            future_indices = np.random.choice(
                range(i, len(episode_trajectory)), 
                size=min(self.k, len(episode_trajectory) - i), 
                replace=False
            )
            
            for future_idx in future_indices:
                state, action, _, next_state, _, _ = episode_trajectory[i]
                _, _, _, _, _, achieved_goal = episode_trajectory[future_idx]
                
                reward = self._compute_goal_reward(next_state, achieved_goal)
                
                hindsight_transition = (state, action, reward, next_state, achieved_goal, achieved_goal)
                self._store_transition(hindsight_transition)
    
    def _generate_final_goals(self, episode_trajectory):
        \"\"\"Use the final achieved goal for all transitions.\"\"\"
        if not episode_trajectory:
            return
        
        final_achieved_goal = episode_trajectory[-1][5]  # achieved_goal from last transition
        
        for transition in episode_trajectory:
            state, action, _, next_state, _, _ = transition
            reward = self._compute_goal_reward(next_state, final_achieved_goal)
            
            hindsight_transition = (state, action, reward, next_state, final_achieved_goal, final_achieved_goal)
            self._store_transition(hindsight_transition)
    
    def _compute_goal_reward(self, achieved_goal, desired_goal, threshold=0.1):
        \"\"\"Compute reward based on goal achievement.\"\"\"
        distance = np.linalg.norm(achieved_goal - desired_goal)
        return 1.0 if distance < threshold else -1.0
    
    def sample(self, batch_size):
        \"\"\"Sample a batch of transitions.\"\"\"
        if len(self.buffer) < batch_size:
            return None
        
        indices = np.random.choice(len(self.buffer), batch_size, replace=False)
        batch = [self.buffer[i] for i in indices]
        
        states, actions, rewards, next_states, goals, achieved_goals = zip(*batch)
        
        return (
            torch.FloatTensor(states),
            torch.LongTensor(actions), 
            torch.FloatTensor(rewards),
            torch.FloatTensor(next_states),
            torch.FloatTensor(goals),
            torch.FloatTensor(achieved_goals)
        )
    
    def __len__(self):
        return len(self.buffer)

def demonstrate_hierarchical_rl():
    \"\"\"Demonstrate hierarchical RL approaches.\"\"\"
    print(\"🏗️ Demonstrating Hierarchical Reinforcement Learning\")
    
    env = SimpleGridWorld(size=8)
    
    agents = {
        'Options-Critic': OptionsCriticAgent(state_dim=2, action_dim=4, num_options=4),
        'Feudal Network': FeudalAgent(state_dim=2, action_dim=4, goal_dim=4)
    }
    
    results = {name: {'rewards': [], 'episode_lengths': []} for name in agents.keys()}
    
    num_episodes = 200
    
    for episode in range(num_episodes):
        for agent_name, agent in agents.items():
            state = env.reset()
            episode_reward = 0
            episode_length = 0
            trajectory = []
            
            for step in range(100):  # Max episode length
                if agent_name == 'Options-Critic':
                    action, option = agent.act(state)
                    trajectory.append((state, action, 0, option))  # Reward added later
                else:
                    action = agent.act(state)
                
                next_state, reward, done, _ = env.step(action)
                episode_reward += reward
                episode_length += 1
                
                if agent_name == 'Options-Critic':
                    trajectory[-1] = (state, action, reward, option)
                else:
                    trajectory.append((state, action, reward, next_state))
                
                if done:
                    break
                
                state = next_state
            
            if agent_name == 'Options-Critic' and len(trajectory) > 1:
                agent.update(trajectory)
            elif agent_name == 'Feudal Network' and len(trajectory) > 1:
                agent.update([trajectory])
            
            results[agent_name]['rewards'].append(episode_reward)
            results[agent_name]['episode_lengths'].append(episode_length)
    
    return results, agents

print(\"🚀 Starting Hierarchical RL Demonstration!\")
hierarchical_results, hierarchical_agents = demonstrate_hierarchical_rl()

fig, axes = plt.subplots(2, 2, figsize=(15, 12))

for agent_name, data in hierarchical_results.items():
    window_size = 20
    if len(data['rewards']) >= window_size:
        smoothed_rewards = pd.Series(data['rewards']).rolling(window_size).mean()
        axes[0, 0].plot(smoothed_rewards, label=agent_name, linewidth=2)

axes[0, 0].set_title('Hierarchical RL Learning Curves')
axes[0, 0].set_xlabel('Episode')
axes[0, 0].set_ylabel('Episode Reward (Smoothed)')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

for agent_name, data in hierarchical_results.items():
    window_size = 20
    if len(data['episode_lengths']) >= window_size:
        smoothed_lengths = pd.Series(data['episode_lengths']).rolling(window_size).mean()
        axes[0, 1].plot(smoothed_lengths, label=agent_name, linewidth=2)

axes[0, 1].set_title('Episode Length Over Time')
axes[0, 1].set_xlabel('Episode')
axes[0, 1].set_ylabel('Episode Length (Smoothed)')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

if 'Options-Critic' in hierarchical_agents:
    agent = hierarchical_agents['Options-Critic']
    axes[1, 0].bar(range(agent.num_options), agent.option_usage)
    axes[1, 0].set_title('Option Usage Distribution')
    axes[1, 0].set_xlabel('Option ID')
    axes[1, 0].set_ylabel('Usage Count')
    axes[1, 0].grid(True, alpha=0.3)

if 'Options-Critic' in hierarchical_agents:
    agent = hierarchical_agents['Options-Critic']
    if agent.option_lengths:
        axes[1, 1].hist(agent.option_lengths, bins=20, alpha=0.7, edgecolor='black')
        axes[1, 1].set_title('Option Length Distribution')
        axes[1, 1].set_xlabel('Option Length')
        axes[1, 1].set_ylabel('Frequency')
        axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(\"\\n📊 Hierarchical RL Results:\")
for agent_name, data in hierarchical_results.items():
    final_perf = np.mean(data['rewards'][-20:])
    avg_length = np.mean(data['episode_lengths'][-20:])
    print(f\"  {agent_name}:\")
    print(f\"    Final Performance: {final_perf:.2f}\")
    print(f\"    Avg Episode Length: {avg_length:.2f}\")

if 'Options-Critic' in hierarchical_agents:
    agent = hierarchical_agents['Options-Critic']
    print(f\"\\n🎯 Options-Critic Analysis:\")
    print(f\"  Options Used: {np.sum(agent.option_usage > 0)} / {agent.num_options}\")
    print(f\"  Most Used Option: {np.argmax(agent.option_usage)}\")
    if agent.option_lengths:
        print(f\"  Avg Option Length: {np.mean(agent.option_lengths):.2f}\")

print(\"\\n💡 Key Insights:\")
print(\"  • Hierarchical methods learn temporal abstractions\")
print(\"  • Options provide reusable behavioral primitives\")
print(\"  • Feudal networks enable goal-directed exploration\")
print(\"  • HRL scales to complex, long-horizon tasks\")


# Section 5: Comprehensive Evaluation and Advanced Techniques Integration

## 5.1 Multi-Method Performance Analysis

This section provides comprehensive evaluation comparing all implemented advanced Deep RL techniques:

### Performance Metrics
1. **Sample Efficiency**: Episodes to convergence
2. **Final Performance**: Asymptotic reward
3. **Robustness**: Performance variance
4. **Computational Efficiency**: Training time and memory usage
5. **Transfer Capability**: Performance on related tasks

### Evaluation Framework
We evaluate methods across multiple dimensions:
- **Simple Tasks**: Basic navigation and control
- **Complex Tasks**: Multi-step reasoning and planning
- **Transfer Tasks**: Adaptation to new environments
- **Long-Horizon Tasks**: Extended episode planning

## 5.2 Practical Implementation Considerations

### When to Use Each Method:

#### Model-Free Methods (DQN, Policy Gradient)
- ✅ **Use when**: Simple tasks, abundant data, unknown dynamics
- ❌ **Avoid when**: Sample efficiency critical, complex planning needed

#### Model-Based Methods  
- ✅ **Use when**: Sample efficiency critical, dynamics learnable
- ❌ **Avoid when**: High-dimensional observations, stochastic dynamics

#### World Models
- ✅ **Use when**: Rich sensory input, imagination beneficial
- ❌ **Avoid when**: Simple state spaces, real-time constraints

#### Hierarchical Methods
- ✅ **Use when**: Long-horizon tasks, reusable skills needed
- ❌ **Avoid when**: Simple tasks, flat action spaces

#### Sample Efficiency Techniques
- ✅ **Use when**: Limited data, expensive environments
- ❌ **Avoid when**: Abundant cheap data, simple tasks

## 5.3 Advanced Techniques Summary

This comprehensive assignment covered cutting-edge Deep RL methods:

### Core Contributions:
1. **Sample Efficiency**: Prioritized replay, data augmentation, auxiliary tasks
2. **World Models**: VAE-based dynamics, imagination planning
3. **Transfer Learning**: Shared representations, meta-learning
4. **Hierarchical Learning**: Options framework, feudal networks
5. **Integration**: Multi-method evaluation and practical guidelines

In [None]:

class AdvancedRLEvaluator:
    """Comprehensive evaluation framework for advanced RL methods."""
    
    def __init__(self, environments, agents, metrics=['reward', 'sample_efficiency', 'robustness']):
        self.environments = environments
        self.agents = agents
        self.metrics = metrics
        self.results = {}
        
        self.num_trials = 5
        self.num_episodes = 300
        self.evaluation_interval = 50
        
    def evaluate_sample_efficiency(self, agent, env, convergence_threshold=0.8):
        """Measure episodes to convergence."""
        max_rewards = []
        convergence_episodes = []
        
        for trial in range(self.num_trials):
            episode_rewards = []
            
            if hasattr(agent, 'reset'):
                agent.reset()
            
            for episode in range(self.num_episodes):
                state = env.reset()
                episode_reward = 0
                
                for step in range(100):
                    if hasattr(agent, 'act'):
                        if 'Options' in str(type(agent)):
                            action, _ = agent.act(state)
                        else:
                            action = agent.act(state)
                    else:
                        action = env.action_space.sample()
                    
                    next_state, reward, done, _ = env.step(action)
                    episode_reward += reward
                    
                    if hasattr(agent, 'replay_buffer'):
                        agent.replay_buffer.push(state, action, reward, next_state, done)
                        if len(agent.replay_buffer) > 32:
                            if hasattr(agent, 'update'):
                                agent.update(32)
                    
                    if done:
                        break
                    
                    state = next_state
                
                episode_rewards.append(episode_reward)
                
                if len(episode_rewards) >= 20:
                    recent_performance = np.mean(episode_rewards[-20:])
                    if recent_performance >= convergence_threshold * np.max(episode_rewards[:max(1, episode-20)]):
                        convergence_episodes.append(episode)
                        break
            
            max_rewards.append(np.max(episode_rewards))
            if not convergence_episodes or len(convergence_episodes) <= trial:
                convergence_episodes.append(self.num_episodes)
        
        return {
            'convergence_episodes': np.mean(convergence_episodes),
            'convergence_std': np.std(convergence_episodes),
            'max_reward': np.mean(max_rewards),
            'max_reward_std': np.std(max_rewards)
        }
    
    def evaluate_transfer_capability(self, agent, source_env, target_envs):
        """Evaluate transfer learning capability."""
        source_performance = []
        state = source_env.reset()
        
        for episode in range(100):  # Limited training
            episode_reward = 0
            for step in range(50):
                action = agent.act(state) if hasattr(agent, 'act') else source_env.action_space.sample()
                next_state, reward, done, _ = source_env.step(action)
                episode_reward += reward
                
                if hasattr(agent, 'replay_buffer'):
                    agent.replay_buffer.push(state, action, reward, next_state, done)
                    if len(agent.replay_buffer) > 32 and hasattr(agent, 'update'):
                        agent.update(32)
                
                if done:
                    break
                state = next_state
            
            source_performance.append(episode_reward)
        
        transfer_results = {}
        for target_name, target_env in target_envs.items():
            target_rewards = []
            
            for episode in range(20):  # Quick evaluation
                state = target_env.reset()
                episode_reward = 0
                
                for step in range(50):
                    action = agent.act(state) if hasattr(agent, 'act') else target_env.action_space.sample()
                    next_state, reward, done, _ = target_env.step(action)
                    episode_reward += reward
                    
                    if done:
                        break
                    state = next_state
                
                target_rewards.append(episode_reward)
            
            transfer_results[target_name] = {
                'mean_reward': np.mean(target_rewards),
                'std_reward': np.std(target_rewards)
            }
        
        return {
            'source_performance': np.mean(source_performance[-20:]),
            'transfer_results': transfer_results
        }
    
    def comprehensive_evaluation(self):
        """Run comprehensive evaluation across all agents and environments."""
        print(\"🔬 Starting Comprehensive Evaluation...\")
        
        for agent_name, agent in self.agents.items():
            print(f\"\\n📊 Evaluating {agent_name}...\")
            self.results[agent_name] = {}
            
            if 'sample_efficiency' in self.metrics:
                env = self.environments[0] if self.environments else SimpleGridWorld(size=5)
                efficiency_results = self.evaluate_sample_efficiency(agent, env)
                self.results[agent_name]['sample_efficiency'] = efficiency_results
                print(f\"  Sample Efficiency: {efficiency_results['convergence_episodes']:.1f} ± {efficiency_results['convergence_std']:.1f} episodes\")
            
            if 'transfer' in self.metrics and len(self.environments) > 1:
                source_env = self.environments[0]
                target_envs = {f'env_{i}': env for i, env in enumerate(self.environments[1:])}
                transfer_results = self.evaluate_transfer_capability(agent, source_env, target_envs)
                self.results[agent_name]['transfer'] = transfer_results
                print(f\"  Transfer Capability: Source performance {transfer_results['source_performance']:.2f}\")
        
        return self.results
    
    def generate_report(self):
        \"\"\"Generate comprehensive evaluation report.\"\"\"
        if not self.results:
            self.comprehensive_evaluation()
        
        print(\"\\n\" + \"=\"*60)
        print(\"🏆 COMPREHENSIVE EVALUATION REPORT\")
        print(\"=\"*60)
        
        if any('sample_efficiency' in results for results in self.results.values()):
            print(\"\\n📈 Sample Efficiency Ranking:\")
            efficiency_scores = []
            for agent_name, results in self.results.items():
                if 'sample_efficiency' in results:
                    score = results['sample_efficiency']['convergence_episodes']
                    efficiency_scores.append((agent_name, score))
            
            efficiency_scores.sort(key=lambda x: x[1])  # Lower is better
            for rank, (agent_name, score) in enumerate(efficiency_scores, 1):
                print(f\"  {rank}. {agent_name}: {score:.1f} episodes to convergence\")
        
        print(\"\\n🎯 Final Performance Comparison:\")
        performance_scores = []
        for agent_name, results in self.results.items():
            if 'sample_efficiency' in results:
                score = results['sample_efficiency']['max_reward']
                performance_scores.append((agent_name, score))
        
        performance_scores.sort(key=lambda x: x[1], reverse=True)  # Higher is better
        for rank, (agent_name, score) in enumerate(performance_scores, 1):
            print(f\"  {rank}. {agent_name}: {score:.2f} max reward\")
        
        print(\"\\n💡 Method Recommendations:\")
        
        if efficiency_scores:
            best_efficiency = efficiency_scores[0][0]
            print(f\"  • Best Sample Efficiency: {best_efficiency}\")
        
        if performance_scores:
            best_performance = performance_scores[0][0]
            print(f\"  • Best Final Performance: {best_performance}\")
        
        print(\"\\n🔧 Implementation Guidelines:\")
        print(\"  • Use prioritized replay for sample efficiency\")
        print(\"  • Apply data augmentation for robustness\")
        print(\"  • Consider world models for planning tasks\")
        print(\"  • Employ hierarchical methods for long-horizon problems\")
        print(\"  • Leverage transfer learning for related domains\")

class IntegratedAdvancedAgent:
    \"\"\"Agent integrating multiple advanced RL techniques.\"\"\"
    
    def __init__(self, state_dim, action_dim, config=None):
        self.state_dim = state_dim
        self.action_dim = action_dim
        
        default_config = {
            'use_prioritized_replay': True,
            'use_auxiliary_tasks': True,
            'use_data_augmentation': True,
            'use_world_model': False,
            'use_hierarchical': False,
            'lr': 1e-3,
            'buffer_size': 10000
        }
        self.config = {**default_config, **(config or {})}
        
        self._initialize_components()
        
        self.training_stats = {
            'episode_rewards': [],
            'losses': [],
            'sample_efficiency': [],
            'component_usage': {}
        }
    
    def _initialize_components(self):
        \"\"\"Initialize RL components based on configuration.\"\"\"
        if self.config['use_auxiliary_tasks']:
            self.network = DataAugmentationDQN(self.state_dim, self.action_dim)
        else:
            self.network = DQNAgent(self.state_dim, self.action_dim).network
        
        self.target_network = copy.deepcopy(self.network)
        self.optimizer = optim.Adam(self.network.parameters(), lr=self.config['lr'])
        
        if self.config['use_prioritized_replay']:
            self.replay_buffer = PrioritizedReplayBuffer(self.config['buffer_size'])
        else:
            self.replay_buffer = ReplayBuffer(self.config['buffer_size'])
        
        if self.config['use_world_model']:
            self.world_model = VariationalWorldModel(self.state_dim, self.action_dim)
        
        if self.config['use_hierarchical']:
            self.hierarchical_agent = OptionsCriticAgent(self.state_dim, self.action_dim)
        
        self.gamma = 0.99
        self.update_count = 0
        self.target_update_freq = 100
    
    def act(self, state, epsilon=0.1):
        \"\"\"Select action using integrated approach.\"\"\"
        if self.config['use_hierarchical']:
            action, option = self.hierarchical_agent.act(state)
            self.training_stats['component_usage']['hierarchical'] = \
                self.training_stats['component_usage'].get('hierarchical', 0) + 1
            return action
        
        if np.random.random() < epsilon:
            return np.random.randint(self.action_dim)
        
        with torch.no_grad():
            state_tensor = torch.FloatTensor(state).unsqueeze(0)
            
            if self.config['use_data_augmentation'] and np.random.random() < 0.1:
                state_tensor = self.network.apply_augmentation(state_tensor, 'noise')
            
            q_values = self.network(state_tensor)
            if isinstance(q_values, tuple):
                q_values = q_values[0]  # Extract Q-values from auxiliary network
            
            return q_values.argmax().item()
    
    def update(self, batch_size=32):
        \"\"\"Update agent using integrated advanced techniques.\"\"\"
        if isinstance(self.replay_buffer, PrioritizedReplayBuffer):
            sample_result = self.replay_buffer.sample(batch_size)
            if sample_result is None:
                return None
            experiences, indices, weights = sample_result
        else:
            batch = self.replay_buffer.sample(batch_size)
            if batch is None:
                return None
            experiences = batch
            weights = torch.ones(batch_size)
            indices = None
        
        states, actions, rewards, next_states, dones = experiences
        states = torch.FloatTensor(states)
        actions = torch.LongTensor(actions)
        rewards = torch.FloatTensor(rewards)
        next_states = torch.FloatTensor(next_states)
        dones = torch.BoolTensor(dones)
        weights = torch.FloatTensor(weights) if not isinstance(weights, torch.Tensor) else weights
        
        if self.config['use_data_augmentation']:
            aug_type = np.random.choice(['noise', 'dropout', 'scaling'])
            states = self.network.apply_augmentation(states, aug_type)
            next_states = self.network.apply_augmentation(next_states, aug_type)
            self.training_stats['component_usage']['augmentation'] = \
                self.training_stats['component_usage'].get('augmentation', 0) + 1
        
        if self.config['use_auxiliary_tasks']:
            current_q_values, reward_pred, next_state_pred = self.network(states, actions)
            current_q_values = current_q_values.gather(1, actions.unsqueeze(1)).squeeze()
        else:
            current_q_values = self.network(states).gather(1, actions.unsqueeze(1)).squeeze()
        
        with torch.no_grad():
            if self.config['use_auxiliary_tasks'] and hasattr(self.target_network, 'forward'):
                next_q_values = self.target_network(next_states)
                if isinstance(next_q_values, tuple):
                    next_q_values = next_q_values[0]
            else:
                next_q_values = self.target_network(next_states)
            
            max_next_q_values = next_q_values.max(1)[0]
            target_q_values = rewards + (self.gamma * max_next_q_values * (~dones))
        
        td_errors = (current_q_values - target_q_values).detach()
        q_loss = (weights * F.mse_loss(current_q_values, target_q_values, reduction='none')).mean()
        
        total_loss = q_loss
        
        if self.config['use_auxiliary_tasks']:
            aux_reward_loss = F.mse_loss(reward_pred.squeeze(), rewards)
            aux_dynamics_loss = F.mse_loss(next_state_pred, next_states)
            total_loss += 0.1 * aux_reward_loss + 0.1 * aux_dynamics_loss
            self.training_stats['component_usage']['auxiliary'] = \
                self.training_stats['component_usage'].get('auxiliary', 0) + 1
        
        if self.config['use_world_model']:
            world_model_loss = self.world_model.compute_loss(states, actions, next_states)
            total_loss += 0.1 * world_model_loss
            self.training_stats['component_usage']['world_model'] = \
                self.training_stats['component_usage'].get('world_model', 0) + 1
        
        self.optimizer.zero_grad()
        total_loss.backward()
        torch.nn.utils.clip_grad_norm_(self.network.parameters(), max_norm=1.0)
        self.optimizer.step()
        
        if indices is not None:
            self.replay_buffer.update_priorities(indices, td_errors.numpy())
        
        self.update_count += 1
        if self.update_count % self.target_update_freq == 0:
            self.target_network.load_state_dict(self.network.state_dict())
        
        self.training_stats['losses'].append(total_loss.item())
        
        return {
            'total_loss': total_loss.item(),
            'q_loss': q_loss.item()
        }

def comprehensive_advanced_rl_demo():
    \"\"\"Comprehensive demonstration of all advanced RL techniques.\"\"\"
    print(\"🎓 COMPREHENSIVE ADVANCED DEEP RL DEMONSTRATION\")
    print(\"=\" * 55)
    
    environments = [
        SimpleGridWorld(size=5),
        SimpleGridWorld(size=6),
        SimpleGridWorld(size=7)
    ]
    
    agents = {
        'Baseline DQN': DQNAgent(state_dim=2, action_dim=4),
        'Sample Efficient': SampleEfficientAgent(state_dim=2, action_dim=4),
        'Options-Critic': OptionsCriticAgent(state_dim=2, action_dim=4),
        'Feudal Network': FeudalAgent(state_dim=2, action_dim=4),
        'Integrated Advanced': IntegratedAdvancedAgent(
            state_dim=2, 
            action_dim=4, 
            config={
                'use_prioritized_replay': True,
                'use_auxiliary_tasks': True,
                'use_data_augmentation': True
            }
        )
    }
    
    evaluator = AdvancedRLEvaluator(
        environments=environments,
        agents=agents,
        metrics=['sample_efficiency', 'reward', 'transfer']
    )
    
    evaluator.generate_report()
    
    print(\"\\n🎯 ADVANCED DEEP RL ASSIGNMENT COMPLETED!\")
    print(\"\\n📚 Concepts Covered:\")
    print(\"  ✓ Model-Free vs Model-Based RL Comparison\")
    print(\"  ✓ World Models with VAE Architecture\") 
    print(\"  ✓ Imagination-Based Planning\")
    print(\"  ✓ Sample Efficiency Techniques\")
    print(\"  ✓ Prioritized Experience Replay\")
    print(\"  ✓ Data Augmentation & Auxiliary Tasks\")
    print(\"  ✓ Transfer Learning & Meta-Learning\")
    print(\"  ✓ Hierarchical Reinforcement Learning\")
    print(\"  ✓ Options-Critic Architecture\")
    print(\"  ✓ Feudal Networks\")
    print(\"  ✓ Comprehensive Evaluation Framework\")
    
    print(\"\\n🔬 Key Takeaways:\")
    print(\"  • Advanced RL methods address sample efficiency and scalability\")
    print(\"  • World models enable planning and imagination\")
    print(\"  • Hierarchical methods tackle long-horizon tasks\")
    print(\"  • Transfer learning accelerates adaptation\")
    print(\"  • Integration of techniques often yields best results\")
    
    print(\"\\n🚀 Ready for Real-World Advanced RL Applications!\")
    
    return evaluator.results

print(\"Starting final comprehensive demonstration...\"
final_results = comprehensive_advanced_rl_demo()

print(\"\\n\" + \"=\" * 60)
print(\"📖 ASSIGNMENT 13: ADVANCED DEEP RL - COMPLETE! ✅\"
print(\"=\" * 60)


# CA13: Advanced Deep Reinforcement Learning - Model-Free vs Model-Based Methods and Real-World Applications

## Deep Reinforcement Learning - Session 13

**Advanced Deep RL Topics: Model-Free vs Model-Based Methods, World Models, and Real-World Deployment**

This notebook explores advanced deep reinforcement learning concepts, including the comparison between model-free and model-based approaches, world models, sample efficiency techniques, transfer learning, and practical considerations for real-world deployment.

### Learning Objectives:
1. Understand the fundamental differences between model-free and model-based RL
2. Implement and compare various world modeling approaches
3. Master sample-efficient learning techniques and transfer learning
4. Explore hierarchical reinforcement learning and temporal abstraction
5. Understand safe reinforcement learning and constrained optimization
6. Implement real-world deployment strategies and robustness techniques
7. Analyze offline reinforcement learning and batch methods
8. Apply meta-learning and few-shot adaptation in RL contexts

### Notebook Structure:
1. **Model-Free vs Model-Based RL** - Theoretical foundations and trade-offs
2. **World Models and Imagination** - Learning environment dynamics
3. **Sample Efficiency Techniques** - Maximizing learning from limited data
4. **Hierarchical Reinforcement Learning** - Temporal abstraction and skills
5. **Safe and Constrained RL** - Safety-aware learning algorithms
6. **Transfer Learning and Meta-Learning** - Knowledge reuse and adaptation
7. **Offline and Batch RL** - Learning from pre-collected data
8. **Real-World Applications** - Deployment strategies and case studies

---

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.distributions import Normal, Categorical

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import deque, defaultdict
import random
import gym
import copy
from typing import List, Tuple, Dict, Optional, Union
import pickle
import json
import time
from dataclasses import dataclass
from abc import ABC, abstractmethod

import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
import pandas as pd

plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

torch.manual_seed(42)
np.random.seed(42)
random.seed(42)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"🚀 Using device: {device}")
print(f"📊 PyTorch version: {torch.__version__}")
print(f"🤖 Starting Advanced Deep RL Session 13!")
