In [3]:
# Setup sys.path for CA13 package imports
import sys
import os
sys.path.insert(0, os.path.abspath("."))
sys.path.insert(0, os.path.abspath(".."))
print("Configured sys.path for CA13 imports")


Configured sys.path for CA13 imports


# Computer Assignment 13: Advanced Model-based Rl and World Models

## Course Information
- **Course**: Deep Reinforcement Learning (DRL)
- **Instructor**: Dr. [Instructor Name]
- **Institution**: Sharif University of Technology
- **Semester**: Fall 2024
- **Assignment Number**: CA13

## Learning Objectives

By completing this assignment, students will be able to:

1. **Understand Model-Based vs Model-Free RL Trade-offs**: Analyze the fundamental differences between model-free and model-based reinforcement learning approaches, including their respective advantages, limitations, and appropriate use cases.

2. **Master World Model Architectures**: Design and implement variational world models using VAEs for learning compact latent representations of environment dynamics, including encoder-decoder architectures and stochastic dynamics modeling.

3. **Implement Imagination-Based Learning**: Develop agents that leverage learned world models for planning and decision-making in latent space, enabling sample-efficient learning through imagined trajectories.

4. **Apply Sample Efficiency Techniques**: Utilize advanced techniques such as prioritized experience replay, data augmentation, and auxiliary tasks to improve learning efficiency in deep RL.

5. **Design Transfer Learning Systems**: Build agents capable of transferring knowledge across related tasks through shared representations, fine-tuning, and meta-learning approaches.

6. **Develop Hierarchical RL Frameworks**: Implement hierarchical decision-making systems using options framework, enabling temporal abstraction and skill composition for complex task solving.

## Prerequisites

Before starting this assignment, ensure you have:

- **Mathematical Background**: 
- Probability theory and stochastic processes
- Linear algebra and matrix operations
- Optimization and gradient-based methods
- Information theory (KL divergence, entropy)

- **Technical Skills**:
- Python programming with PyTorch
- Deep learning fundamentals (neural networks, autoencoders)
- Basic reinforcement learning concepts (MDPs, value functions, policies)
- Experience with Gymnasium environments

- **Prior Knowledge**:
- Completion of CA1-CA12 assignments
- Understanding of model-free RL algorithms (DQN, policy gradients)
- Familiarity with neural network architectures

## Roadmap

This assignment is structured as follows:

### Section 1: Model-free Vs Model-based Reinforcement Learning
- Theoretical foundations of model-free and model-based approaches
- Mathematical formulations and trade-off analysis
- Hybrid algorithms combining both paradigms
- Practical implementation and comparison

### Section 2: World Models and Imagination-based Learning
- Variational autoencoders for world modeling
- Stochastic dynamics prediction in latent space
- Imagination-based planning and policy optimization
- Dreamer algorithm and modern variants

### Section 3: Sample Efficiency and Transfer Learning
- Prioritized experience replay and data augmentation
- Auxiliary tasks for improved learning
- Transfer learning techniques and meta-learning
- Domain adaptation and curriculum learning

### Section 4: Hierarchical Reinforcement Learning
- Options framework and temporal abstraction
- Hierarchical policy architectures
- Skill discovery and composition
- Applications to complex task domains

## Project Structure

```
CA13/
├── CA13.ipynb              # Main assignment notebook
├── agents/                 # RL agent implementations
│   ├── model*free*agent.py # Model-free RL agents
│   ├── model*based*agent.py# Model-based RL agents
│   ├── world*model*agent.py# World model-based agents
│   └── hierarchical_agent.py# Hierarchical RL agents
├── models/                 # Neural network architectures
│   ├── world_model.py      # VAE-based world models
│   ├── dynamics_model.py   # Environment dynamics models
│   └── policy_networks.py  # Hierarchical policy networks
├── environments/           # Custom environments
│   ├── wrappers.py         # Environment wrappers
│   └── complex_tasks.py    # Complex task environments
├── experiments/            # Training and evaluation scripts
│   ├── train*world*model.py# World model training
│   ├── compare_efficiency.py# Sample efficiency comparison
│   └── transfer_learning.py# Transfer learning experiments
└── utils/                  # Utility functions
    ├── visualization.py    # Plotting and analysis tools
    ├── data_augmentation.py# Data augmentation utilities
    └── evaluation.py       # Performance evaluation metrics
```

## Contents Overview

### Theoretical Foundations
- **Model-Based RL Mathematics**: Transition and reward model learning, planning algorithms
- **World Model Theory**: Variational inference, latent space dynamics, imagination-based learning
- **Sample Efficiency**: Experience replay, prioritization, auxiliary learning objectives
- **Transfer Learning**: Representation learning, fine-tuning, meta-learning algorithms

### Implementation Components
- **VAE World Models**: Encoder-decoder architectures with stochastic latent variables
- **Imagination-Based Agents**: Planning in learned latent space using world models
- **Sample-Efficient Algorithms**: Prioritized replay, data augmentation, auxiliary tasks
- **Transfer Learning Systems**: Multi-task learning, fine-tuning, domain adaptation

### Advanced Topics
- **Hierarchical RL**: Options framework, skill hierarchies, temporal abstraction
- **Meta-Learning**: Few-shot adaptation, gradient-based meta-learning
- **Curriculum Learning**: Automatic difficulty progression, teacher-student frameworks

## Evaluation Criteria

Your implementation will be evaluated based on:

1. **Correctness (40%)**: Accurate implementation of algorithms and mathematical formulations
2. **Efficiency (25%)**: Sample efficiency improvements and computational performance
3. **Innovation (20%)**: Creative extensions and novel approaches to the problems
4. **Analysis (15%)**: Quality of experimental analysis and insights

## Getting Started

1. **Environment Setup**: Ensure all dependencies are installed
2. **Code Review**: Understand the provided base implementations
3. **Incremental Development**: Start with simpler components and build complexity
4. **Testing**: Validate each component before integration
5. **Experimentation**: Run comprehensive experiments and analyze results

## Expected Outcomes

By the end of this assignment, you will have:

- **Comprehensive Understanding**: Deep knowledge of advanced model-based RL techniques
- **Practical Skills**: Ability to implement complex RL systems from scratch
- **Research Perspective**: Insight into current challenges and future directions
- **Portfolio Piece**: High-quality implementation demonstrating advanced RL capabilities

---

**Note**: This assignment represents the culmination of the Deep RL course, integrating concepts from model-free and model-based learning, advanced architectures, and practical deployment considerations. Focus on understanding the theoretical foundations while developing robust, efficient implementations.

Let's begin our exploration of advanced model-based reinforcement learning and world models! 🚀

In [2]:
# Import implementations from CA13 package modules
from agents.model_free import ModelFreeAgent, DQNAgent
from agents.model_based import ModelBasedAgent
from buffers.replay_buffer import ReplayBuffer
from environments.grid_world import SimpleGridWorld

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import deque
import random

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
print("✅ CA13 modules imported successfully")


Using device: cpu
✅ CA13 modules imported successfully


# Section 2: World Models and Imagination-based Learning

## 2.1 Theoretical Foundations of World Models

World models represent learned internal representations of environment dynamics that enable agents to "imagine" and plan without direct interaction with the environment.

### Core Concepts

**World Model Components:**
1. **Representation Learning**: Encode high-dimensional observations into compact latent states
2. **Dynamics Model**: Predict next latent state given current state and action
3. **Reward Model**: Predict rewards in the latent space
4. **Decoder Model**: Reconstruct observations from latent states

**Mathematical Framework:**
- **Encoder**: $z*t = \text{Encode}(o*t)$ maps observation $o*t$ to latent state $z*t$
- **Dynamics**: $z*{t+1} = f(z*t, a*t) + \epsilon*t$ where $\epsilon_t \sim \mathcal{N}(0, \Sigma)$
- **Reward**: $r*t = R(z*t, a_t)$
- **Decoder**: $\hat{o}*t = \text{Decode}(z*t)$

## 2.2 Variational World Models

### Variational Autoencoders (vae) for World Modeling

World models often use VAEs to learn stochastic latent representations:

**Encoder (Recognition Model):**
$$q*\phi(z*t | o*t) = \mathcal{N}(z*t; \mu*\phi(o*t), \sigma*\phi^2(o*t))$$

**Prior (Dynamics Model):**
$$p*\theta(z*{t+1} | z*t, a*t) = \mathcal{N}(z*{t+1}; \mu*\theta(z*t, a*t), \sigma*\theta^2(z*t, a_t))$$

**Decoder (Generative Model):**
$$p*\psi(o*t | z*t) = \mathcal{N}(o*t; \mu*\psi(z*t), \sigma*\psi^2(z*t))$$

**ELBO Objective:**
$$\mathcal{L}*{ELBO} = \mathbb{E}*{q*\phi(z|o)} [\log p*\psi(o|z)] - D*{KL}[q*\phi(z|o) || p(z)]$$

## 2.3 Planning in Learned Latent Space

Once a world model is learned, planning can be performed in the compact latent space:

### Model Predictive Control (mpc) in Latent Space
1. **Imagination Rollout**: Use world model to simulate future trajectories
2. **Action Optimization**: Optimize action sequences to maximize predicted rewards
3. **Execution**: Execute only the first action, then replan

**Planning Objective:**
$$a^**{1:H} = \arg\max*{a*{1:H}} \mathbb{E}*{z*{1:H} \sim p*\theta} \left[ \sum*{t=1}^H R(z*t, a_t) \right]$$

### Dreamer Algorithm
Dreamer combines world models with policy gradients:
1. **Collect Experience**: Gather real environment data
2. **Learn World Model**: Train VAE-based world model
3. **Imagine Trajectories**: Generate synthetic experience in latent space  
4. **Learn Behaviors**: Train actor-critic in imagined trajectories

## 2.4 Advantages and Challenges

### Advantages of World Models:
- **Sample Efficiency**: Learn from imagined experience
- **Transfer Learning**: Models can generalize across tasks
- **Interpretability**: Learned representations can be visualized
- **Planning**: Enable sophisticated planning algorithms

### Challenges:
- **Model Bias**: Errors compound during long rollouts
- **Representation Learning**: High-dimensional observations are challenging
- **Stochasticity**: Modeling complex stochastic dynamics
- **Computational Cost**: Training and maintaining world models

## 2.5 Modern Approaches

### Muzero
Combines tree search with learned models:
- Learns value, policy, and dynamics jointly
- Uses tree search for planning
- Achieves superhuman performance in Go, Chess, and Shogi

### Dreamer V2/v3
Improvements to original Dreamer:
- Better regularization techniques
- Improved world model architectures
- Enhanced policy learning in imagination

### Model-based Meta-learning
Using world models for few-shot adaptation:
- Learn generalizable world model components
- Quickly adapt to new environments
- Transfer dynamics knowledge across domains

In [None]:
# Import VariationalWorldModel from package
from models.world_model import VariationalWorldModel

print("VariationalWorldModel imported from models.world_model package")
print("This model provides VAE-based world modeling for learning environment dynamics")


✅ VariationalWorldModel imported from models.world_model package
💡 This model provides VAE-based world modeling for learning environment dynamics


# Section 3: Sample Efficiency and Transfer Learning

## 3.1 Sample Efficiency Challenges in Deep Rl

Sample efficiency is one of the most critical challenges in deep reinforcement learning, particularly for real-world applications where data collection is expensive or dangerous.

### Why Is Sample Efficiency Important?

**Real-World Constraints:**
- **Cost**: Real-world interactions can be expensive (robotics, autonomous vehicles)
- **Time**: Learning from millions of samples is often impractical
- **Safety**: Exploratory actions in safety-critical domains can be dangerous
- **Reproducibility**: Limited samples make experiments more reliable

**Sample Complexity Factors:**
- **Environment Complexity**: High-dimensional state/action spaces
- **Sparse Rewards**: Learning signals are infrequent
- **Stochasticity**: Environmental noise requires more samples
- **Exploration**: Discovering good policies requires extensive exploration

## 3.2 Sample Efficiency Techniques

### 3.2.1 Experience Replay and Prioritization

**Experience Replay Benefits:**
- Reuse past experiences multiple times
- Break temporal correlations in data
- Enable off-policy learning

**Prioritized Experience Replay:**
Prioritize experiences based on temporal difference (TD) error:
$$P(i) = \frac{p*i^\alpha}{\sum*k p_k^\alpha}$$

Where $p*i = |\delta*i| + \epsilon$ and $\delta_i$ is the TD error.

### 3.2.2 Data Augmentation

**Techniques:**
- **Random Crops**: For image-based environments
- **Color Jittering**: Robust to lighting variations  
- **Random Shifts**: Translation invariance
- **Gaussian Noise**: Regularization effect

### 3.2.3 Auxiliary Tasks

Learn multiple tasks simultaneously to improve sample efficiency:
- **Pixel Control**: Predict pixel changes
- **Feature Control**: Control learned feature representations
- **Reward Prediction**: Predict future rewards
- **Value Function Replay**: Replay value function updates

## 3.3 Transfer Learning in Reinforcement Learning

Transfer learning enables agents to leverage knowledge from previous tasks to learn new tasks more efficiently.

### 3.3.1 Types of Transfer in Rl

**Policy Transfer:**
$$\pi*{target}(a|s) = f(\pi*{source}(a|s), s, \theta_{adapt})$$

**Value Function Transfer:**
$$Q*{target}(s,a) = g(Q*{source}(s,a), s, a, \phi_{adapt})$$

**Representation Transfer:**
$$\phi*{target}(s) = h(\phi*{source}(s), \psi_{adapt})$$

### 3.3.2 Transfer Learning Approaches

#### Fine-tuning
1. Pre-train on source task
2. Initialize target model with source weights
3. Fine-tune on target task with lower learning rate

#### Progressive Networks
- Freeze source network columns
- Add new columns for target tasks
- Use lateral connections between columns

#### Universal Value Functions (uvf)
Learn value functions conditioned on goals:
$$Q(s, a, g) = \text{Value of action } a \text{ in state } s \text{ for goal } g$$

## 3.4 Meta-learning and Few-shot Adaptation

Meta-learning enables agents to quickly adapt to new tasks with limited experience.

### 3.4.1 Model-agnostic Meta-learning (maml)

**Objective:**
$$\min*\theta \sum*{\tau \sim p(\mathcal{T})} \mathcal{L}*\tau(f*{\theta_\tau'})$$

Where $\theta*\tau' = \theta - \alpha \nabla*\theta \mathcal{L}*\tau(f*\theta)$

**MAML Algorithm:**
1. Sample batch of tasks
2. For each task, compute adapted parameters via gradient descent
3. Update meta-parameters using gradient through adaptation process

### 3.4.2 Gradient-based Meta-learning

**Reptile Algorithm:**
Simpler alternative to MAML:
$$\theta \leftarrow \theta + \beta \frac{1}{n} \sum*{i=1}^n (\phi*i - \theta)$$

Where $\phi_i$ is the result of training on task $i$.

## 3.5 Domain Adaptation and Sim-to-real Transfer

### 3.5.1 Domain Randomization

**Technique:**
Randomize simulation parameters during training:
- Physical properties (mass, friction, damping)
- Visual appearance (textures, lighting, colors)
- Sensor characteristics (noise, resolution, field of view)

**Benefits:**
- Learned policies are robust to domain variations
- Improved transfer from simulation to real world
- Reduced need for domain-specific engineering

### 3.5.2 Domain Adversarial Training

**Objective:**
$$\min*\theta \mathcal{L}*{task}(\theta) + \lambda \mathcal{L}_{domain}(\theta)$$

Where $\mathcal{L}_{domain}$ encourages domain-invariant features.

## 3.6 Curriculum Learning

Structure learning to progress from simple to complex tasks.

### 3.6.1 Curriculum Design Principles

**Manual Curriculum:**
- Hand-designed progression of tasks
- Expert knowledge of difficulty ordering
- Fixed curriculum regardless of agent performance

**Automatic Curriculum:**
- Adaptive task selection based on agent performance
- Learning progress as curriculum signal
- Self-paced learning approaches

### 3.6.2 Curriculum Learning Algorithms

**Teacher-Student Framework:**
- Teacher selects appropriate tasks for student
- Task difficulty based on student's current capability
- Optimize task selection for maximum learning progress

**Self-Play Curriculum:**
- Agent plays against previous versions of itself
- Automatic difficulty adjustment
- Prevents catastrophic forgetting of simpler strategies

In [None]:

class PrioritizedReplayBuffer:
    """Prioritized experience replay buffer for improved sample efficiency."""
    
    def __init__(self, capacity, alpha=0.6, beta=0.4, beta_increment=1e-4):
        self.capacity = capacity
        self.alpha = alpha  # Priority exponent
        self.beta = beta    # Importance sampling exponent
        self.beta_increment = beta_increment
        
        self.buffer = []
        self.priorities = np.zeros(capacity, dtype=np.float32)
        self.position = 0
        self.max_priority = 1.0
        
    def push(self, state, action, reward, next_state, done):
        """Add experience to buffer with maximum priority."""
        if len(self.buffer) < self.capacity:
            self.buffer.append(None)
        
        self.buffer[self.position] = (state, action, reward, next_state, done)
        self.priorities[self.position] = self.max_priority
        
        self.position = (self.position + 1) % self.capacity
    
    def sample(self, batch_size):
        """Sample batch with prioritized sampling."""
        if len(self.buffer) < batch_size:
            return None
        
        valid_priorities = self.priorities[:len(self.buffer)]
        probs = valid_priorities ** self.alpha
        probs /= probs.sum()
        
        indices = np.random.choice(len(self.buffer), batch_size, p=probs)
        
        experiences = [self.buffer[idx] for idx in indices]
        states, actions, rewards, next_states, dones = zip(*experiences)
        
        total = len(self.buffer)
        weights = (total * probs[indices]) ** (-self.beta)
        weights /= weights.max()
        
        self.beta = min(1.0, self.beta + self.beta_increment)
        
        return (states, actions, rewards, next_states, dones), indices, weights
    
    def update_priorities(self, indices, td_errors):
        """Update priorities based on TD errors."""
        for idx, td_error in zip(indices, td_errors):
            priority = (abs(td_error) + 1e-6) ** self.alpha
            self.priorities[idx] = priority
            self.max_priority = max(self.max_priority, priority)
    
    def __len__(self):
        return len(self.buffer)

class DataAugmentationDQN(nn.Module):
    """DQN with data augmentation for improved sample efficiency."""
    
    def __init__(self, input_dim, action_dim, hidden_dim=128):
        super().__init__()
        self.input_dim = input_dim
        self.action_dim = action_dim
        
        self.q_network = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, action_dim)
        )
        
        self.reward_predictor = nn.Sequential(
            nn.Linear(input_dim + action_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 1)
        )
        
        self.next_state_predictor = nn.Sequential(
            nn.Linear(input_dim + action_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, input_dim)
        )
    
    def forward(self, state, action=None):
        """Forward pass with optional auxiliary predictions."""
        q_values = self.q_network(state)
        
        if action is not None:
            if len(action.shape) == 1:
                action_one_hot = F.one_hot(action.long(), self.action_dim).float()
            else:
                action_one_hot = action
            
            aux_input = torch.cat([state, action_one_hot], dim=-1)
            reward_pred = self.reward_predictor(aux_input)
            next_state_pred = self.next_state_predictor(aux_input)
            
            return q_values, reward_pred, next_state_pred
        
        return q_values
    
    def apply_augmentation(self, state, augmentation_type='noise'):
        """Apply data augmentation to state."""
        if augmentation_type == 'noise':
            noise = torch.randn_like(state) * 0.1
            return state + noise
        
        elif augmentation_type == 'dropout':
            dropout_mask = torch.rand_like(state) > 0.1
            return state * dropout_mask.float()
        
        elif augmentation_type == 'scaling':
            scale = torch.rand(1).item() * 0.4 + 0.8  # Scale between 0.8 and 1.2
            return state * scale
        
        return state

class SampleEfficientAgent:
    """Agent with multiple sample efficiency techniques."""
    
    def __init__(self, state_dim, action_dim, lr=1e-3):
        self.state_dim = state_dim
        self.action_dim = action_dim
        
        self.network = DataAugmentationDQN(state_dim, action_dim)
        self.target_network = copy.deepcopy(self.network)
        self.optimizer = optim.Adam(self.network.parameters(), lr=lr)
        
        self.replay_buffer = PrioritizedReplayBuffer(capacity=10000)
        
        self.gamma = 0.99
        self.target_update_freq = 100
        self.update_count = 0
        
        self.aux_reward_weight = 0.1
        self.aux_dynamics_weight = 0.1
        
        self.losses = []
        self.td_errors = []
    
    def act(self, state, epsilon=0.1):
        """Select action with epsilon-greedy policy."""
        if np.random.random() < epsilon:
            return np.random.randint(self.action_dim)
        
        with torch.no_grad():
            state_tensor = torch.FloatTensor(state).unsqueeze(0)
            q_values = self.network(state_tensor)
            return q_values.argmax().item()
    
    def update(self, batch_size=32, use_aux_tasks=True, augmentation=True):
        """Update agent with prioritized replay and auxiliary tasks."""
        sample_result = self.replay_buffer.sample(batch_size)
        if sample_result is None:
            return None
        
        experiences, indices, weights = sample_result
        states, actions, rewards, next_states, dones = experiences
        
        states = torch.FloatTensor(states)
        actions = torch.LongTensor(actions)
        rewards = torch.FloatTensor(rewards)
        next_states = torch.FloatTensor(next_states)
        dones = torch.BoolTensor(dones)
        weights = torch.FloatTensor(weights)
        
        if augmentation:
            aug_type = np.random.choice(['noise', 'dropout', 'scaling'])\n            states = self.network.apply_augmentation(states, aug_type)
            next_states = self.network.apply_augmentation(next_states, aug_type)
        
        current_q_values = self.network(states).gather(1, actions.unsqueeze(1))
        
        with torch.no_grad():
            next_q_values = self.target_network(next_states).max(1)[0]
            target_q_values = rewards + (self.gamma * next_q_values * (~dones))
        
        td_errors = (current_q_values.squeeze() - target_q_values).detach().numpy()
        
        q_loss = (weights * F.mse_loss(current_q_values.squeeze(), target_q_values, reduction='none')).mean()
        
        total_loss = q_loss
        
        if use_aux_tasks:
            q_values, reward_pred, next_state_pred = self.network(states, actions)
            
            aux_reward_loss = F.mse_loss(reward_pred.squeeze(), rewards)
            aux_dynamics_loss = F.mse_loss(next_state_pred, next_states)
            
            total_loss += self.aux_reward_weight * aux_reward_loss
            total_loss += self.aux_dynamics_weight * aux_dynamics_loss
        
        self.optimizer.zero_grad()
        total_loss.backward()
        torch.nn.utils.clip_grad_norm_(self.network.parameters(), max_norm=1.0)
        self.optimizer.step()
        
        self.replay_buffer.update_priorities(indices, td_errors)
        
        self.update_count += 1
        if self.update_count % self.target_update_freq == 0:
            self.target_network.load_state_dict(self.network.state_dict())
        
        self.losses.append(total_loss.item())
        self.td_errors.extend(td_errors.tolist())
        
        return {
            'total_loss': total_loss.item(),
            'q_loss': q_loss.item(),
            'aux_reward_loss': aux_reward_loss.item() if use_aux_tasks else 0,
            'aux_dynamics_loss': aux_dynamics_loss.item() if use_aux_tasks else 0
        }

class TransferLearningAgent:
    """Agent with transfer learning capabilities."""
    
    def __init__(self, state_dim, action_dim, lr=1e-3):
        self.state_dim = state_dim
        self.action_dim = action_dim
        
        self.feature_extractor = nn.Sequential(
            nn.Linear(state_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 64),
            nn.ReLU()
        )
        
        self.policy_heads = {}
        self.value_heads = {}
        
        self.feature_optimizer = optim.Adam(self.feature_extractor.parameters(), lr=lr)
        self.head_optimizers = {}
        
        self.transfer_performance = {}
    
    def add_task(self, task_name, action_dim=None):
        """Add a new task with its own policy and value heads."""
        if action_dim is None:
            action_dim = self.action_dim
        
        self.policy_heads[task_name] = nn.Sequential(
            nn.Linear(64, 64),
            nn.ReLU(),
            nn.Linear(64, action_dim),
            nn.Softmax(dim=-1)
        )
        
        self.value_heads[task_name] = nn.Sequential(
            nn.Linear(64, 64),
            nn.ReLU(),
            nn.Linear(64, 1)
        )
        
        task_params = list(self.policy_heads[task_name].parameters()) + \
                     list(self.value_heads[task_name].parameters())
        self.head_optimizers[task_name] = optim.Adam(task_params, lr=1e-3)
        
        self.transfer_performance[task_name] = []
    
    def get_action(self, state, task_name):
        """Get action for specific task."""
        with torch.no_grad():
            state_tensor = torch.FloatTensor(state).unsqueeze(0)
            features = self.feature_extractor(state_tensor)
            action_probs = self.policy_heads[task_name](features)
            return Categorical(action_probs).sample().item()
    
    def get_value(self, state, task_name):
        """Get value estimate for specific task."""
        with torch.no_grad():
            state_tensor = torch.FloatTensor(state).unsqueeze(0)
            features = self.feature_extractor(state_tensor)
            return self.value_heads[task_name](features).item()
    
    def update(self, states, actions, rewards, task_name, update_features=True):
        """Update agent for specific task."""
        states = torch.FloatTensor(states)
        actions = torch.LongTensor(actions)
        rewards = torch.FloatTensor(rewards)
        
        features = self.feature_extractor(states)
        action_probs = self.policy_heads[task_name](features)
        values = self.value_heads[task_name](features).squeeze()
        
        log_probs = torch.log(action_probs.gather(1, actions.unsqueeze(1))).squeeze()
        advantages = rewards - values.detach()
        policy_loss = -(log_probs * advantages).mean()
        
        value_loss = F.mse_loss(values, rewards)
        
        total_loss = policy_loss + 0.5 * value_loss
        
        self.head_optimizers[task_name].zero_grad()
        if update_features:
            self.feature_optimizer.zero_grad()
        
        total_loss.backward()
        
        self.head_optimizers[task_name].step()
        if update_features:
            self.feature_optimizer.step()
        
        return {
            'policy_loss': policy_loss.item(),
            'value_loss': value_loss.item()
        }
    
    def fine_tune_for_task(self, source_task, target_task, fine_tune_lr=1e-4):
        """Fine-tune from source task to target task."""
        self.policy_heads[target_task] = copy.deepcopy(self.policy_heads[source_task])
        self.value_heads[target_task] = copy.deepcopy(self.value_heads[source_task])
        
        task_params = list(self.policy_heads[target_task].parameters()) + \
                     list(self.value_heads[target_task].parameters())
        self.head_optimizers[target_task] = optim.Adam(task_params, lr=fine_tune_lr)
        
        self.transfer_performance[target_task] = []

class CurriculumLearningFramework:
    """Framework for curriculum learning with automatic difficulty adjustment."""
    
    def __init__(self, environments, agent, difficulty_measure='success_rate'):
        self.environments = environments  # List of environments with increasing difficulty
        self.agent = agent
        self.difficulty_measure = difficulty_measure
        
        self.current_level = 0
        self.level_performance = [[] for _ in environments]
        self.progression_threshold = 0.8  # Success rate threshold to advance
        self.regression_threshold = 0.3   # Success rate threshold to regress
        
        self.curriculum_history = []
    
    def get_current_environment(self):
        """Get current environment based on curriculum level."""
        return self.environments[self.current_level]
    
    def evaluate_performance(self, episode_rewards, episode_successes=None):
        """Evaluate performance on current level."""
        if self.difficulty_measure == 'success_rate' and episode_successes is not None:
            return np.mean(episode_successes[-10:]) if len(episode_successes) >= 10 else 0
        elif self.difficulty_measure == 'reward':
            return np.mean(episode_rewards[-10:]) if len(episode_rewards) >= 10 else 0
        else:
            return np.mean(episode_rewards[-10:]) if len(episode_rewards) >= 10 else 0
    
    def update_curriculum(self, performance):
        """Update curriculum level based on performance."""
        old_level = self.current_level
        
        if performance >= self.progression_threshold and self.current_level < len(self.environments) - 1:
            self.current_level += 1
            print(f\"📈 Advanced to level {self.current_level} (performance: {performance:.2f})\"
        
        elif performance < self.regression_threshold and self.current_level > 0:
            self.current_level = max(0, self.current_level - 1)
            print(f\"📉 Regressed to level {self.current_level} (performance: {performance:.2f})\"
        
        if old_level != self.current_level:
            self.curriculum_history.append({
                'episode': len(self.level_performance[old_level]),
                'old_level': old_level,
                'new_level': self.current_level,
                'performance': performance
            })
        
        return self.current_level != old_level
    
    def train_with_curriculum(self, num_episodes=1000):
        """Train agent using curriculum learning."""
        episode_rewards = []
        episode_successes = []
        
        for episode in range(num_episodes):
            env = self.get_current_environment()
            
            state = env.reset()
            episode_reward = 0
            episode_success = False
            
            for step in range(100):  # Max episode length
                action = self.agent.act(state, epsilon=max(0.1, 1.0 - episode/500))
                next_state, reward, done, info = env.step(action)
                
                self.agent.replay_buffer.push(state, action, reward, next_state, done)
                
                episode_reward += reward
                if done and reward > 5:  # Define success condition
                    episode_success = True
                
                if done:
                    break
                
                state = next_state
            
            if len(self.agent.replay_buffer) > 32:
                self.agent.update(32)
            
            episode_rewards.append(episode_reward)
            episode_successes.append(episode_success)
            self.level_performance[self.current_level].append(episode_reward)
            
            if episode % 20 == 0:
                performance = self.evaluate_performance(episode_rewards, episode_successes)
                self.update_curriculum(performance)
            
            if episode % 100 == 0:
                recent_reward = np.mean(episode_rewards[-10:])
                recent_success = np.mean(episode_successes[-10:])
                print(f\"Episode {episode}: Level {self.current_level}, \"
                      f\"Reward: {recent_reward:.2f}, Success: {recent_success:.2f}\"
        
        return episode_rewards, episode_successes

def compare_sample_efficiency():
    \"\"\"Compare sample efficiency of different techniques.\"\"\"
    print(\"⚡ Comparing Sample Efficiency Techniques\")
    
    env = SimpleGridWorld(size=6)
    
    baseline_agent = DQNAgent(state_dim=2, action_dim=4)
    efficient_agent = SampleEfficientAgent(state_dim=2, action_dim=4)
    
    agents = {
        'Baseline DQN': baseline_agent,
        'Sample Efficient': efficient_agent
    }
    
    results = {name: {'rewards': [], 'episodes': []} for name in agents.keys()}
    
    num_episodes = 300
    
    for episode in range(num_episodes):
        for agent_name, agent in agents.items():
            state = env.reset()
            episode_reward = 0
            
            for step in range(50):
                action = agent.act(state, epsilon=max(0.1, 1.0 - episode/200))
                next_state, reward, done, _ = env.step(action)
                episode_reward += reward
                
                if agent_name == 'Baseline DQN':
                    agent.replay_buffer.push(state, action, reward, next_state, done)
                    if len(agent.replay_buffer) > 32:
                        batch = agent.replay_buffer.sample(32)
                        agent.update(batch)
                else:
                    agent.replay_buffer.push(state, action, reward, next_state, done)
                    if len(agent.replay_buffer) > 32:
                        agent.update(32)
                
                if done:
                    break
                
                state = next_state
            
            results[agent_name]['rewards'].append(episode_reward)
            results[agent_name]['episodes'].append(episode)
    
    return results

def demonstrate_transfer_learning():
    \"\"\"Demonstrate transfer learning between related tasks.\"\"\"
    print(\"🔄 Demonstrating Transfer Learning\")
    
    agent = TransferLearningAgent(state_dim=2, action_dim=4)
    
    def create_task_env(goal_position, reward_scale=1.0):
        env = SimpleGridWorld(size=4)
        env.goal = goal_position
        env.reward_scale = reward_scale
        return env
    
    tasks = {
        'task_1': create_task_env([3, 3], 1.0),     # Original goal
        'task_2': create_task_env([3, 0], 1.0),     # Different goal
        'task_3': create_task_env([0, 3], 1.0),     # Another goal
    }
    
    for task_name in tasks.keys():
        agent.add_task(task_name)
    
    print(\"Training on Task 1...\")
    task_1_env = tasks['task_1']
    
    for episode in range(200):
        state = task_1_env.reset()
        episode_states, episode_actions, episode_rewards = [], [], []
        
        for step in range(30):
            action = agent.get_action(state, 'task_1')
            next_state, reward, done, _ = task_1_env.step(action)
            
            episode_states.append(state)
            episode_actions.append(action)
            episode_rewards.append(reward)
            
            if done:
                break
            
            state = next_state
        
        if episode_rewards:
            agent.update(episode_states, episode_actions, episode_rewards, 'task_1')
    
    transfer_results = {}
    
    for new_task in ['task_2', 'task_3']:
        print(f\"Transferring to {new_task}...\")
        
        agent.fine_tune_for_task('task_1', new_task)
        
        task_env = tasks[new_task]
        task_rewards = []
        
        for episode in range(50):  # Limited training
            state = task_env.reset()
            episode_reward = 0
            episode_states, episode_actions, episode_rewards = [], [], []
            
            for step in range(30):
                action = agent.get_action(state, new_task)
                next_state, reward, done, _ = task_env.step(action)
                
                episode_states.append(state)
                episode_actions.append(action)
                episode_rewards.append(reward)
                episode_reward += reward
                
                if done:
                    break
                
                state = next_state
            
            if episode_rewards:
                agent.update(episode_states, episode_actions, episode_rewards, 
                           new_task, update_features=False)
            
            task_rewards.append(episode_reward)
        
        transfer_results[new_task] = task_rewards
        print(f\"  Final performance on {new_task}: {np.mean(task_rewards[-10:]):.2f}\")
    
    return transfer_results

print(\"🚀 Starting Sample Efficiency and Transfer Learning Demonstrations!\")

efficiency_results = compare_sample_efficiency()

transfer_results = demonstrate_transfer_learning()

fig, axes = plt.subplots(1, 2, figsize=(15, 6))

for agent_name, data in efficiency_results.items():
    window_size = 20
    if len(data['rewards']) >= window_size:
        smoothed_rewards = pd.Series(data['rewards']).rolling(window_size).mean()
        axes[0].plot(data['episodes'], smoothed_rewards, label=agent_name, linewidth=2)

axes[0].set_title('Sample Efficiency Comparison')
axes[0].set_xlabel('Episode')
axes[0].set_ylabel('Episode Reward (Smoothed)')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

for task_name, rewards in transfer_results.items():
    axes[1].plot(rewards, label=f'Transfer to {task_name}', linewidth=2)

axes[1].set_title('Transfer Learning Performance')
axes[1].set_xlabel('Episode (Limited Training)')
axes[1].set_ylabel('Episode Reward')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(\"\\n📊 Sample Efficiency Results:\")
for agent_name, data in efficiency_results.items():
    final_perf = np.mean(data['rewards'][-20:])
    print(f\"  {agent_name}: {final_perf:.2f} final performance\")

print(\"\\n🔄 Transfer Learning Results:\")
for task_name, rewards in transfer_results.items():
    final_perf = np.mean(rewards[-10:])
    print(f\"  {task_name}: {final_perf:.2f} final performance with limited training\")

print(\"\\n💡 Key Insights:\")
print(\"  • Prioritized replay and auxiliary tasks improve sample efficiency\")
print(\"  • Data augmentation provides regularization benefits\")
print(\"  • Transfer learning enables rapid adaptation to new tasks\")
print(\"  • Shared representations capture generalizable knowledge\")


# Section 4: Hierarchical Reinforcement Learning

## 4.1 Theory: Hierarchical Decision Making

Hierarchical Reinforcement Learning (HRL) addresses the challenge of learning complex behaviors by decomposing tasks into hierarchical structures. This approach enables agents to:

1. **Learn at Multiple Time Scales**: High-level policies select goals or skills, while low-level policies execute primitive actions
2. **Achieve Better Generalization**: Skills learned in one context can be reused in others
3. **Improve Sample Efficiency**: By leveraging temporal abstractions and skill composition

### Key Components

#### Options Framework
An **option** $\omega$ is defined by a tuple $(I*\omega, \pi*\omega, \beta_\omega)$:
- **Initiation Set** $I_\omega \subseteq \mathcal{S}$: States where the option can be initiated
- **Policy** $\pi_\omega: \mathcal{S} \times \mathcal{A} \rightarrow [0,1]$: Action selection within the option
- **Termination Condition** $\beta_\omega: \mathcal{S} \rightarrow [0,1]$: Probability of termination

#### Hierarchical Value Functions
The value function for options follows the Bellman equation:
$$Q^\pi(s,\omega) = \mathbb{E}*\pi\left[\sum*{t=0}^{\tau-1} \gamma^t r*{t+1} + \gamma^\tau Q^\pi(s*\tau, \omega') \mid s*0=s, \omega*0=\omega\right]$$

where $\tau$ is the termination time and $\omega'$ is the next option selected.

#### Feudal Networks
Feudal Networks implement a manager-worker hierarchy:
- **Manager Network**: Sets goals $g*t$ for workers: $g*t = f*{manager}(s*t, h_{t-1}^{manager})$
- **Worker Network**: Executes actions conditioned on goals: $a*t = \pi*{worker}(s*t, g*t)$
- **Intrinsic Motivation**: Workers receive intrinsic rewards based on goal achievement

### Mathematical Framework

#### Intrinsic Reward Signal
The intrinsic reward for achieving subgoals:
$$r*t^{intrinsic} = \cos(\text{achieved\*goal}*t - \text{desired\*goal}*t) \cdot ||s*{t+1} - s_t||$$

#### Hierarchical Policy Gradient
The gradient for the manager policy:
$$\nabla*{\theta*m} J*m = \mathbb{E}\left[\nabla*{\theta*m} \log \pi*m(g*t|s*t) \cdot A*m(s*t, g_t)\right]$$

And for the worker policy:
$$\nabla*{\theta*w} J*w = \mathbb{E}\left[\nabla*{\theta*w} \log \pi*w(a*t|s*t, g*t) \cdot A*w(s*t, a*t, g_t)\right]$$

## 4.2 Implementation: Hierarchical Rl Architectures

We'll implement several HRL approaches:
1. **Options-Critic Architecture**: Learn options and policies jointly
2. **Feudal Networks**: Manager-worker hierarchies
3. **Hindsight Experience Replay with Goals**: Sample efficiency for goal-conditioned tasks

In [None]:
# Import hierarchical RL classes from package
from agents.hierarchical import (
    OptionsCriticNetwork,
    OptionsCriticAgent,
    FeudalNetwork,
    FeudalAgent
)

print("Hierarchical RL classes imported from agents.hierarchical package")
print("Includes Options-Critic and Feudal Networks for hierarchical decision making")


✅ Hierarchical RL classes imported from agents.hierarchical package
💡 Includes Options-Critic and Feudal Networks for hierarchical decision making


# Section 5: Comprehensive Evaluation and Advanced Techniques Integration

## 5.1 Multi-method Performance Analysis

This section provides comprehensive evaluation comparing all implemented advanced Deep RL techniques:

### Performance Metrics
1. **Sample Efficiency**: Episodes to convergence
2. **Final Performance**: Asymptotic reward
3. **Robustness**: Performance variance
4. **Computational Efficiency**: Training time and memory usage
5. **Transfer Capability**: Performance on related tasks

### Evaluation Framework
We evaluate methods across multiple dimensions:
- **Simple Tasks**: Basic navigation and control
- **Complex Tasks**: Multi-step reasoning and planning
- **Transfer Tasks**: Adaptation to new environments
- **Long-Horizon Tasks**: Extended episode planning

## 5.2 Practical Implementation Considerations

### When to Use Each Method:

#### Model-free Methods (dqn, Policy Gradient)
- ✅ **Use when**: Simple tasks, abundant data, unknown dynamics
- ❌ **Avoid when**: Sample efficiency critical, complex planning needed

#### Model-based Methods
- ✅ **Use when**: Sample efficiency critical, dynamics learnable
- ❌ **Avoid when**: High-dimensional observations, stochastic dynamics

#### World Models
- ✅ **Use when**: Rich sensory input, imagination beneficial
- ❌ **Avoid when**: Simple state spaces, real-time constraints

#### Hierarchical Methods
- ✅ **Use when**: Long-horizon tasks, reusable skills needed
- ❌ **Avoid when**: Simple tasks, flat action spaces

#### Sample Efficiency Techniques
- ✅ **Use when**: Limited data, expensive environments
- ❌ **Avoid when**: Abundant cheap data, simple tasks

## 5.3 Advanced Techniques Summary

This comprehensive assignment covered cutting-edge Deep RL methods:

### Core Contributions:
1. **Sample Efficiency**: Prioritized replay, data augmentation, auxiliary tasks
2. **World Models**: VAE-based dynamics, imagination planning
3. **Transfer Learning**: Shared representations, meta-learning
4. **Hierarchical Learning**: Options framework, feudal networks
5. **Integration**: Multi-method evaluation and practical guidelines

In [None]:

class AdvancedRLEvaluator:
    """Comprehensive evaluation framework for advanced RL methods."""
    
    def __init__(self, environments, agents, metrics=['reward', 'sample_efficiency', 'robustness']):
        self.environments = environments
        self.agents = agents
        self.metrics = metrics
        self.results = {}
        
        self.num_trials = 5
        self.num_episodes = 300
        self.evaluation_interval = 50
        
    def evaluate_sample_efficiency(self, agent, env, convergence_threshold=0.8):
        """Measure episodes to convergence."""
        max_rewards = []
        convergence_episodes = []
        
        for trial in range(self.num_trials):
            episode_rewards = []
            
            if hasattr(agent, 'reset'):
                agent.reset()
            
            for episode in range(self.num_episodes):
                state = env.reset()
                episode_reward = 0
                
                for step in range(100):
                    if hasattr(agent, 'act'):
                        if 'Options' in str(type(agent)):
                            action, _ = agent.act(state)
                        else:
                            action = agent.act(state)
                    else:
                        action = env.action_space.sample()
                    
                    next_state, reward, done, _ = env.step(action)
                    episode_reward += reward
                    
                    if hasattr(agent, 'replay_buffer'):
                        agent.replay_buffer.push(state, action, reward, next_state, done)
                        if len(agent.replay_buffer) > 32:
                            if hasattr(agent, 'update'):
                                agent.update(32)
                    
                    if done:
                        break
                    
                    state = next_state
                
                episode_rewards.append(episode_reward)
                
                if len(episode_rewards) >= 20:
                    recent_performance = np.mean(episode_rewards[-20:])
                    if recent_performance >= convergence_threshold * np.max(episode_rewards[:max(1, episode-20)]):
                        convergence_episodes.append(episode)
                        break
            
            max_rewards.append(np.max(episode_rewards))
            if not convergence_episodes or len(convergence_episodes) <= trial:
                convergence_episodes.append(self.num_episodes)
        
        return {
            'convergence_episodes': np.mean(convergence_episodes),
            'convergence_std': np.std(convergence_episodes),
            'max_reward': np.mean(max_rewards),
            'max_reward_std': np.std(max_rewards)
        }
    
    def evaluate_transfer_capability(self, agent, source_env, target_envs):
        """Evaluate transfer learning capability."""
        source_performance = []
        state = source_env.reset()
        
        for episode in range(100):  # Limited training
            episode_reward = 0
            for step in range(50):
                action = agent.act(state) if hasattr(agent, 'act') else source_env.action_space.sample()
                next_state, reward, done, _ = source_env.step(action)
                episode_reward += reward
                
                if hasattr(agent, 'replay_buffer'):
                    agent.replay_buffer.push(state, action, reward, next_state, done)
                    if len(agent.replay_buffer) > 32 and hasattr(agent, 'update'):
                        agent.update(32)
                
                if done:
                    break
                state = next_state
            
            source_performance.append(episode_reward)
        
        transfer_results = {}
        for target_name, target_env in target_envs.items():
            target_rewards = []
            
            for episode in range(20):  # Quick evaluation
                state = target_env.reset()
                episode_reward = 0
                
                for step in range(50):
                    action = agent.act(state) if hasattr(agent, 'act') else target_env.action_space.sample()
                    next_state, reward, done, _ = target_env.step(action)
                    episode_reward += reward
                    
                    if done:
                        break
                    state = next_state
                
                target_rewards.append(episode_reward)
            
            transfer_results[target_name] = {
                'mean_reward': np.mean(target_rewards),
                'std_reward': np.std(target_rewards)
            }
        
        return {
            'source_performance': np.mean(source_performance[-20:]),
            'transfer_results': transfer_results
        }
    
    def comprehensive_evaluation(self):
        """Run comprehensive evaluation across all agents and environments."""
        print(\"🔬 Starting Comprehensive Evaluation...\")
        
        for agent_name, agent in self.agents.items():
            print(f\"\\n📊 Evaluating {agent_name}...\")
            self.results[agent_name] = {}
            
            if 'sample_efficiency' in self.metrics:
                env = self.environments[0] if self.environments else SimpleGridWorld(size=5)
                efficiency_results = self.evaluate_sample_efficiency(agent, env)
                self.results[agent_name]['sample_efficiency'] = efficiency_results
                print(f\"  Sample Efficiency: {efficiency_results['convergence_episodes']:.1f} ± {efficiency_results['convergence_std']:.1f} episodes\")
            
            if 'transfer' in self.metrics and len(self.environments) > 1:
                source_env = self.environments[0]
                target_envs = {f'env_{i}': env for i, env in enumerate(self.environments[1:])}
                transfer_results = self.evaluate_transfer_capability(agent, source_env, target_envs)
                self.results[agent_name]['transfer'] = transfer_results
                print(f\"  Transfer Capability: Source performance {transfer_results['source_performance']:.2f}\")
        
        return self.results
    
    def generate_report(self):
        \"\"\"Generate comprehensive evaluation report.\"\"\"
        if not self.results:
            self.comprehensive_evaluation()
        
        print(\"\\n\" + \"=\"*60)
        print(\"🏆 COMPREHENSIVE EVALUATION REPORT\")
        print(\"=\"*60)
        
        if any('sample_efficiency' in results for results in self.results.values()):
            print(\"\\n📈 Sample Efficiency Ranking:\")
            efficiency_scores = []
            for agent_name, results in self.results.items():
                if 'sample_efficiency' in results:
                    score = results['sample_efficiency']['convergence_episodes']
                    efficiency_scores.append((agent_name, score))
            
            efficiency_scores.sort(key=lambda x: x[1])  # Lower is better
            for rank, (agent_name, score) in enumerate(efficiency_scores, 1):
                print(f\"  {rank}. {agent_name}: {score:.1f} episodes to convergence\")
        
        print(\"\\n🎯 Final Performance Comparison:\")
        performance_scores = []
        for agent_name, results in self.results.items():
            if 'sample_efficiency' in results:
                score = results['sample_efficiency']['max_reward']
                performance_scores.append((agent_name, score))
        
        performance_scores.sort(key=lambda x: x[1], reverse=True)  # Higher is better
        for rank, (agent_name, score) in enumerate(performance_scores, 1):
            print(f\"  {rank}. {agent_name}: {score:.2f} max reward\")
        
        print(\"\\n💡 Method Recommendations:\")
        
        if efficiency_scores:
            best_efficiency = efficiency_scores[0][0]
            print(f\"  • Best Sample Efficiency: {best_efficiency}\")
        
        if performance_scores:
            best_performance = performance_scores[0][0]
            print(f\"  • Best Final Performance: {best_performance}\")
        
        print(\"\\n🔧 Implementation Guidelines:\")
        print(\"  • Use prioritized replay for sample efficiency\")
        print(\"  • Apply data augmentation for robustness\")
        print(\"  • Consider world models for planning tasks\")
        print(\"  • Employ hierarchical methods for long-horizon problems\")
        print(\"  • Leverage transfer learning for related domains\")

class IntegratedAdvancedAgent:
    \"\"\"Agent integrating multiple advanced RL techniques.\"\"\"
    
    def __init__(self, state_dim, action_dim, config=None):
        self.state_dim = state_dim
        self.action_dim = action_dim
        
        default_config = {
            'use_prioritized_replay': True,
            'use_auxiliary_tasks': True,
            'use_data_augmentation': True,
            'use_world_model': False,
            'use_hierarchical': False,
            'lr': 1e-3,
            'buffer_size': 10000
        }
        self.config = {**default_config, **(config or {})}
        
        self._initialize_components()
        
        self.training_stats = {
            'episode_rewards': [],
            'losses': [],
            'sample_efficiency': [],
            'component_usage': {}
        }
    
    def _initialize_components(self):
        \"\"\"Initialize RL components based on configuration.\"\"\"
        if self.config['use_auxiliary_tasks']:
            self.network = DataAugmentationDQN(self.state_dim, self.action_dim)
        else:
            self.network = DQNAgent(self.state_dim, self.action_dim).network
        
        self.target_network = copy.deepcopy(self.network)
        self.optimizer = optim.Adam(self.network.parameters(), lr=self.config['lr'])
        
        if self.config['use_prioritized_replay']:
            self.replay_buffer = PrioritizedReplayBuffer(self.config['buffer_size'])
        else:
            self.replay_buffer = ReplayBuffer(self.config['buffer_size'])
        
        if self.config['use_world_model']:
            self.world_model = VariationalWorldModel(self.state_dim, self.action_dim)
        
        if self.config['use_hierarchical']:
            self.hierarchical_agent = OptionsCriticAgent(self.state_dim, self.action_dim)
        
        self.gamma = 0.99
        self.update_count = 0
        self.target_update_freq = 100
    
    def act(self, state, epsilon=0.1):
        \"\"\"Select action using integrated approach.\"\"\"
        if self.config['use_hierarchical']:
            action, option = self.hierarchical_agent.act(state)
            self.training_stats['component_usage']['hierarchical'] = \
                self.training_stats['component_usage'].get('hierarchical', 0) + 1
            return action
        
        if np.random.random() < epsilon:
            return np.random.randint(self.action_dim)
        
        with torch.no_grad():
            state_tensor = torch.FloatTensor(state).unsqueeze(0)
            
            if self.config['use_data_augmentation'] and np.random.random() < 0.1:
                state_tensor = self.network.apply_augmentation(state_tensor, 'noise')
            
            q_values = self.network(state_tensor)
            if isinstance(q_values, tuple):
                q_values = q_values[0]  # Extract Q-values from auxiliary network
            
            return q_values.argmax().item()
    
    def update(self, batch_size=32):
        \"\"\"Update agent using integrated advanced techniques.\"\"\"
        if isinstance(self.replay_buffer, PrioritizedReplayBuffer):
            sample_result = self.replay_buffer.sample(batch_size)
            if sample_result is None:
                return None
            experiences, indices, weights = sample_result
        else:
            batch = self.replay_buffer.sample(batch_size)
            if batch is None:
                return None
            experiences = batch
            weights = torch.ones(batch_size)
            indices = None
        
        states, actions, rewards, next_states, dones = experiences
        states = torch.FloatTensor(states)
        actions = torch.LongTensor(actions)
        rewards = torch.FloatTensor(rewards)
        next_states = torch.FloatTensor(next_states)
        dones = torch.BoolTensor(dones)
        weights = torch.FloatTensor(weights) if not isinstance(weights, torch.Tensor) else weights
        
        if self.config['use_data_augmentation']:
            aug_type = np.random.choice(['noise', 'dropout', 'scaling'])
            states = self.network.apply_augmentation(states, aug_type)
            next_states = self.network.apply_augmentation(next_states, aug_type)
            self.training_stats['component_usage']['augmentation'] = \
                self.training_stats['component_usage'].get('augmentation', 0) + 1
        
        if self.config['use_auxiliary_tasks']:
            current_q_values, reward_pred, next_state_pred = self.network(states, actions)
            current_q_values = current_q_values.gather(1, actions.unsqueeze(1)).squeeze()
        else:
            current_q_values = self.network(states).gather(1, actions.unsqueeze(1)).squeeze()
        
        with torch.no_grad():
            if self.config['use_auxiliary_tasks'] and hasattr(self.target_network, 'forward'):
                next_q_values = self.target_network(next_states)
                if isinstance(next_q_values, tuple):
                    next_q_values = next_q_values[0]
            else:
                next_q_values = self.target_network(next_states)
            
            max_next_q_values = next_q_values.max(1)[0]
            target_q_values = rewards + (self.gamma * max_next_q_values * (~dones))
        
        td_errors = (current_q_values - target_q_values).detach()
        q_loss = (weights * F.mse_loss(current_q_values, target_q_values, reduction='none')).mean()
        
        total_loss = q_loss
        
        if self.config['use_auxiliary_tasks']:
            aux_reward_loss = F.mse_loss(reward_pred.squeeze(), rewards)
            aux_dynamics_loss = F.mse_loss(next_state_pred, next_states)
            total_loss += 0.1 * aux_reward_loss + 0.1 * aux_dynamics_loss
            self.training_stats['component_usage']['auxiliary'] = \
                self.training_stats['component_usage'].get('auxiliary', 0) + 1
        
        if self.config['use_world_model']:
            world_model_loss = self.world_model.compute_loss(states, actions, next_states)
            total_loss += 0.1 * world_model_loss
            self.training_stats['component_usage']['world_model'] = \
                self.training_stats['component_usage'].get('world_model', 0) + 1
        
        self.optimizer.zero_grad()
        total_loss.backward()
        torch.nn.utils.clip_grad_norm_(self.network.parameters(), max_norm=1.0)
        self.optimizer.step()
        
        if indices is not None:
            self.replay_buffer.update_priorities(indices, td_errors.numpy())
        
        self.update_count += 1
        if self.update_count % self.target_update_freq == 0:
            self.target_network.load_state_dict(self.network.state_dict())
        
        self.training_stats['losses'].append(total_loss.item())
        
        return {
            'total_loss': total_loss.item(),
            'q_loss': q_loss.item()
        }

def comprehensive_advanced_rl_demo():
    \"\"\"Comprehensive demonstration of all advanced RL techniques.\"\"\"
    print(\"🎓 COMPREHENSIVE ADVANCED DEEP RL DEMONSTRATION\")
    print(\"=\" * 55)
    
    environments = [
        SimpleGridWorld(size=5),
        SimpleGridWorld(size=6),
        SimpleGridWorld(size=7)
    ]
    
    agents = {
        'Baseline DQN': DQNAgent(state_dim=2, action_dim=4),
        'Sample Efficient': SampleEfficientAgent(state_dim=2, action_dim=4),
        'Options-Critic': OptionsCriticAgent(state_dim=2, action_dim=4),
        'Feudal Network': FeudalAgent(state_dim=2, action_dim=4),
        'Integrated Advanced': IntegratedAdvancedAgent(
            state_dim=2, 
            action_dim=4, 
            config={
                'use_prioritized_replay': True,
                'use_auxiliary_tasks': True,
                'use_data_augmentation': True
            }
        )
    }
    
    evaluator = AdvancedRLEvaluator(
        environments=environments,
        agents=agents,
        metrics=['sample_efficiency', 'reward', 'transfer']
    )
    
    evaluator.generate_report()
    
    print(\"\\n🎯 ADVANCED DEEP RL ASSIGNMENT COMPLETED!\")
    print(\"\\n📚 Concepts Covered:\")
    print(\"  ✓ Model-Free vs Model-Based RL Comparison\")
    print(\"  ✓ World Models with VAE Architecture\") 
    print(\"  ✓ Imagination-Based Planning\")
    print(\"  ✓ Sample Efficiency Techniques\")
    print(\"  ✓ Prioritized Experience Replay\")
    print(\"  ✓ Data Augmentation & Auxiliary Tasks\")
    print(\"  ✓ Transfer Learning & Meta-Learning\")
    print(\"  ✓ Hierarchical Reinforcement Learning\")
    print(\"  ✓ Options-Critic Architecture\")
    print(\"  ✓ Feudal Networks\")
    print(\"  ✓ Comprehensive Evaluation Framework\")
    
    print(\"\\n🔬 Key Takeaways:\")
    print(\"  • Advanced RL methods address sample efficiency and scalability\")
    print(\"  • World models enable planning and imagination\")
    print(\"  • Hierarchical methods tackle long-horizon tasks\")
    print(\"  • Transfer learning accelerates adaptation\")
    print(\"  • Integration of techniques often yields best results\")
    
    print(\"\\n🚀 Ready for Real-World Advanced RL Applications!\")
    
    return evaluator.results

print(\"Starting final comprehensive demonstration...\"
final_results = comprehensive_advanced_rl_demo()

print(\"\\n\" + \"=\" * 60)
print(\"📖 ASSIGNMENT 13: ADVANCED DEEP RL - COMPLETE! ✅\"
print(\"=\" * 60)
