# Computer Assignment 18: Advanced Deep Reinforcement Learning - Comprehensive Exercise

## Course Information
- **Course**: Deep Reinforcement Learning (DRL)
- **Instructor**: Dr. [Instructor Name]
- **Institution**: Sharif University of Technology
- **Semester**: Fall 2024
- **Assignment Number**: CA18

## Learning Objectives

By completing this comprehensive assignment, students will be able to:

1. **Master Advanced RL Paradigms Implementation**: Design, implement, and evaluate five cutting-edge RL approaches including world models, multi-agent systems, causal RL, quantum-enhanced RL, and federated RL using PyTorch.

2. **Apply Theoretical Foundations**: Demonstrate deep understanding of the mathematical principles underlying each advanced RL method through rigorous implementation and analysis of convergence properties, computational complexity, and theoretical guarantees.

3. **Develop Integration Skills**: Combine multiple advanced paradigms (model-based + multi-agent, causal + federated, quantum + safety) to create hybrid systems that leverage complementary strengths for enhanced performance.

4. **Conduct Scientific Performance Analysis**: Perform comprehensive comparative evaluation using statistical methods, ablation studies, and benchmarking across diverse environments to understand trade-offs between different approaches.

5. **Address Real-World Challenges**: Apply advanced RL techniques to practical scenarios including safety-critical systems, distributed deployment, privacy preservation, and multi-agent coordination in complex environments.

6. **Demonstrate Research-Ready Expertise**: Produce publication-quality analysis with proper experimental methodology, statistical validation, and insights into current limitations and future research directions.

## Prerequisites

Before starting this assignment, ensure you have:

- **Mathematical Background**:
- Advanced linear algebra and functional analysis
- Probability theory, stochastic processes, and information theory
- Optimization theory and convex analysis
- Game theory, causal inference, and quantum computing fundamentals

- **Technical Skills**:
- Expert PyTorch proficiency (custom architectures, distributed training)
- Advanced Python programming (concurrent processing, optimization)
- Statistical analysis and experimental design
- Version control and reproducible research practices

- **Prior Knowledge**:
- Completion of CA1-CA17 assignments
- Strong foundation in deep learning and neural architectures
- Experience with advanced RL algorithms and theoretical analysis
- Understanding of real-world deployment challenges and limitations

## Roadmap

This comprehensive assignment is structured as follows:

### Part I: World Models and Imagination-augmented Agents
- Recurrent State-Space Models (RSSM) for environment dynamics
- Imagination-based planning and model-predictive control
- Synthetic data generation and hallucination for sample efficiency
- Applications to continuous control and sparse reward tasks

### Part Ii: Multi-agent Deep Reinforcement Learning
- Cooperative and competitive multi-agent MDP frameworks
- MADDPG with communication protocols and attention mechanisms
- QMIX and value decomposition for scalable multi-agent learning
- Emergent behaviors and coordination analysis

### Part Iii: Causal Reinforcement Learning
- Causal discovery algorithms (PC, NOTEARS, PCMCI) in RL environments
- Counterfactual reasoning and intervention analysis
- Causal mechanisms for robust policy learning
- Applications to explainable AI and robust decision-making

### Part Iv: Quantum-enhanced Reinforcement Learning
- Variational Quantum Circuits (VQC) for policy representation
- Quantum Approximate Optimization Algorithm (QAOA) for RL
- Amplitude estimation for value function evaluation
- Quantum advantage analysis and NISQ device considerations

### Part V: Federated Reinforcement Learning
- Federated averaging algorithms adapted for RL (FedRL)
- Differential privacy and secure multi-party computation
- Communication-efficient distributed training
- Privacy-preserving multi-agent coordination

### Part Vi: Integration, Analysis, and Real-world Applications
- Comparative performance analysis across all paradigms
- Hybrid approaches combining multiple techniques
- Real-world deployment scenarios and challenges
- Future research directions and open problems

## Project Structure

```
CA18/
‚îú‚îÄ‚îÄ CA18.ipynb                      # Main comprehensive assignment notebook
‚îú‚îÄ‚îÄ agents/                         # Advanced RL agent implementations
‚îÇ   ‚îú‚îÄ‚îÄ world*model*agents.py       # RSSM, MPC, imagination-augmented agents
‚îÇ   ‚îú‚îÄ‚îÄ multi*agent*agents.py       # MADDPG, QMIX, communication-enabled agents
‚îÇ   ‚îú‚îÄ‚îÄ causal_agents.py            # Causal discovery, reasoning, intervention agents
‚îÇ   ‚îú‚îÄ‚îÄ quantum_agents.py           # VQC, QAOA, quantum policy networks
‚îÇ   ‚îú‚îÄ‚îÄ federated_agents.py         # FedRL, privacy-preserving distributed agents
‚îÇ   ‚îî‚îÄ‚îÄ hybrid_agents.py            # Integrated multi-paradigm agents
‚îú‚îÄ‚îÄ environments/                   # Comprehensive environment suite
‚îÇ   ‚îú‚îÄ‚îÄ world*model*envs.py         # Model learning and planning environments
‚îÇ   ‚îú‚îÄ‚îÄ multi*agent*envs.py         # Cooperative and competitive multi-agent scenarios
‚îÇ   ‚îú‚îÄ‚îÄ causal_envs.py              # Environments with causal structure and interventions
‚îÇ   ‚îú‚îÄ‚îÄ quantum_envs.py             # Quantum simulation and NISQ-compatible environments
‚îÇ   ‚îú‚îÄ‚îÄ federated_envs.py           # Distributed and privacy-focused environments
‚îÇ   ‚îî‚îÄ‚îÄ integrated_envs.py          # Complex environments requiring multiple paradigms
‚îú‚îÄ‚îÄ models/                         # Advanced neural and quantum architectures
‚îÇ   ‚îú‚îÄ‚îÄ world_models.py             # RSSM cores, imagination networks, MPC
‚îÇ   ‚îú‚îÄ‚îÄ multi*agent*models.py       # Centralized critics, mixing networks, communication
‚îÇ   ‚îú‚îÄ‚îÄ causal_models.py            # Causal graphs, discovery networks, mechanisms
‚îÇ   ‚îú‚îÄ‚îÄ quantum_models.py           # VQC, quantum circuits, amplitude estimation
‚îÇ   ‚îú‚îÄ‚îÄ federated_models.py         # Privacy-preserving, communication-efficient models
‚îÇ   ‚îî‚îÄ‚îÄ hybrid_models.py            # Multi-paradigm integrated architectures
‚îú‚îÄ‚îÄ experiments/                    # Comprehensive experimental framework
‚îÇ   ‚îú‚îÄ‚îÄ paradigm_experiments.py     # Individual paradigm evaluation
‚îÇ   ‚îú‚îÄ‚îÄ comparative_analysis.py     # Cross-paradigm comparison and ablation studies
‚îÇ   ‚îú‚îÄ‚îÄ integration_experiments.py  # Hybrid approach evaluation
‚îÇ   ‚îú‚îÄ‚îÄ real*world*experiments.py   # Practical deployment scenarios
‚îÇ   ‚îî‚îÄ‚îÄ statistical_analysis.py     # Rigorous statistical evaluation methods
‚îî‚îÄ‚îÄ utils/                          # Advanced utility and analysis tools
    ‚îú‚îÄ‚îÄ world*model*utils.py        # Model learning, imagination, planning utilities
    ‚îú‚îÄ‚îÄ multi*agent*utils.py        # Coordination, communication, emergence analysis
    ‚îú‚îÄ‚îÄ causal_utils.py             # Causal discovery, intervention, robustness utilities
    ‚îú‚îÄ‚îÄ quantum_utils.py            # Quantum simulation, NISQ, advantage analysis
    ‚îú‚îÄ‚îÄ federated_utils.py          # Privacy, communication, distributed utilities
    ‚îî‚îÄ‚îÄ analysis_utils.py           # Statistical analysis, visualization, reporting utilities
```

## Contents Overview

### Theoretical Foundations
- **World Models**: Environment modeling, imagination-based learning, planning theory
- **Multi-Agent RL**: Game theory, coordination, communication, emergent behaviors
- **Causal RL**: Causal inference, counterfactuals, robust learning, explainability
- **Quantum RL**: Quantum algorithms, superposition, NISQ computing, potential advantages
- **Federated RL**: Privacy preservation, distributed optimization, communication efficiency
- **Integration**: Hybrid approaches, multi-paradigm combinations, unified frameworks

### Implementation Components
- **World Model Systems**: RSSM implementation, imagination rollout, MPC optimization
- **Multi-Agent Systems**: Centralized training frameworks, communication protocols, value decomposition
- **Causal Systems**: Discovery algorithms, intervention analysis, counterfactual evaluation
- **Quantum Systems**: VQC design, quantum circuit simulation, NISQ-compatible algorithms
- **Federated Systems**: Secure aggregation, differential privacy, communication protocols
- **Hybrid Systems**: Multi-paradigm integration, adaptive algorithm selection, meta-learning

### Research Challenges
- **Scalability**: Large-scale multi-agent systems, distributed quantum computing
- **Robustness**: Adversarial environments, distribution shift, causal validity
- **Privacy**: Federated learning guarantees, secure multi-party computation
- **Efficiency**: Sample complexity, computational requirements, communication costs
- **Integration**: Combining incompatible paradigms, meta-algorithm design

## Evaluation Criteria

Your implementation will be evaluated based on:

1. **Theoretical Rigor (20%)**: Correct mathematical formulation and theoretical analysis
2. **Implementation Quality (25%)**: Efficient, correct, and well-documented code
3. **Experimental Design (20%)**: Proper scientific methodology and statistical analysis
4. **Comparative Analysis (15%)**: Thorough evaluation and insightful comparisons
5. **Innovation & Integration (10%)**: Creative combinations and novel approaches
6. **Real-World Insights (10%)**: Practical deployment considerations and limitations

## Getting Started

1. **Research Foundation**: Review seminal papers on each advanced paradigm
2. **Infrastructure Setup**: Configure environments for quantum simulation, distributed training, and multi-agent simulation
3. **Modular Development**: Implement each paradigm independently, then focus on integration
4. **Rigorous Evaluation**: Design comprehensive experiments with proper statistical validation
5. **Critical Analysis**: Identify limitations, failure modes, and research opportunities

## Expected Outcomes

By the end of this assignment, you will have:

- **Expert Implementation Skills**: Ability to implement state-of-the-art RL research from scratch
- **Theoretical Mastery**: Deep understanding of advanced RL mathematics and theory
- **Research Methodology**: Skills in rigorous experimental design and statistical analysis
- **Integration Expertise**: Ability to combine multiple advanced techniques effectively
- **Critical Thinking**: Understanding of current limitations and future research directions
- **Publication-Ready Skills**: Ability to produce research-quality analysis and documentation

---

**Note**: This comprehensive assignment represents the capstone of the Deep RL course, requiring mastery of multiple advanced paradigms and their integration. The focus is on producing research-quality implementations with thorough analysis, preparing students for cutting-edge RL research and real-world deployment challenges.

Let's master the advanced paradigms of deep reinforcement learning! üß†

## Table of Contents

1. [Introduction](#introduction)
2. [Part I ‚Äî World Models and Imagination](#part-i)
3. [Part II ‚Äî Multi-Agent Deep RL](#part-ii)
4. [Part III ‚Äî Causal Reinforcement Learning](#part-iii)
5. [Part IV ‚Äî Quantum-Enhanced RL](#part-iv)
6. [Part V ‚Äî Federated Reinforcement Learning](#part-v)
7. [Part VI ‚Äî Integration & Real-world Applications](#part-vi)
8. [Experiments and Benchmarking](#experiments)
9. [Results and Analysis](#results)
10. [Conclusion and Future Work](#conclusion)
11. [How to run / Reproducibility](#how-to-run)
12. [References](#references)

> Use the links above to quickly jump to each major section. This TOC is auto-maintained ‚Äî add more anchors below when you expand the notebook.

In [1]:
# Import all advanced RL modules
import sys
import os

# Add CA18 to path
ca18_path = '/Users/tahamajs/Documents/uni/DRL/CAs/Solutions/CA18'
if ca18_path not in sys.path:
    sys.path.insert(0, ca18_path)

# Import core modules
try:
    from world_models.world_models import *
    from multi_agent_rl.multi_agent_rl import *
    from causal_rl.causal_rl import *
    from quantum_rl.quantum_rl import *
    from federated_rl.federated_rl import *
    print("‚úÖ All CA18 modules imported successfully!")
except ImportError as e:
    print(f"‚ùå Import failed: {e}")
    import traceback
    traceback.print_exc()

‚úÖ All CA18 modules imported successfully!


In [2]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from collections import deque, defaultdict
import random
import time
import copy
import warnings
from typing import List, Dict, Tuple, Optional, Union, Any
from abc import ABC, abstractmethod
import networkx as nx
from scipy import stats
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.cluster import KMeans
import gym
import math
import cmath
from scipy.linalg import expm
from itertools import combinations, permutations

torch.manual_seed(42)
np.random.seed(42)
random.seed(42)

plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 12
warnings.filterwarnings('ignore')

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"üöÄ Setup Complete!")
print(f"Device: {device}")
print(f"PyTorch version: {torch.__version__}")
print(f"NumPy version: {np.__version__}")
print("Ready to explore advanced Deep Reinforcement Learning! ü§ñ")


üöÄ Setup Complete!
Device: cpu
PyTorch version: 2.8.0
NumPy version: 2.2.6
Ready to explore advanced Deep Reinforcement Learning! ü§ñ


Gym has been unmaintained since 2022 and does not support NumPy 2.0 amongst other critical functionality.
Please upgrade to Gymnasium, the maintained drop-in replacement of Gym, or contact the authors of your software and request that they upgrade.
Users of this version of Gym should be able to simply replace 'import gym' with 'import gymnasium as gym' in the vast majority of cases.
See the migration guide at https://gymnasium.farama.org/introduction/migration_guide/ for additional information.


# Part I: World Models and Imagination-augmented Agents

## üåç Theoretical Foundation

### Introduction to World Models

**World Models** represent a paradigm shift in reinforcement learning, moving from model-free to model-based approaches that learn internal representations of the environment. This approach was popularized by Ha and Schmidhuber (2018) and has revolutionized how we think about sample efficiency and planning in RL.

### Core Concepts

#### 1. Model-based Reinforcement Learning

Traditional model-free RL learns policies directly from experience:
- **Pro**: No need to model environment dynamics
- **Con**: Sample inefficient, cannot plan ahead

Model-based RL learns a model of the environment:
- **Pro**: Can plan using learned model, more sample efficient  
- **Con**: Model errors can compound, more complex

#### 2. Recurrent State Space Models (rssm)

The RSSM is the heart of world models, consisting of:

**Deterministic Path**: $h*t = f*\theta(h*{t-1}, a*{t-1})$
- Encodes deterministic aspects of state evolution
- Uses RNN/LSTM/GRU to maintain temporal consistency

**Stochastic Path**: $s*t \sim p(s*t | h_t)$  
- Models stochastic aspects and uncertainty
- Typically Gaussian: $s*t \sim \mathcal{N}(\mu*\phi(h*t), \sigma*\phi(h_t))$

**Combined State**: $z*t = [h*t, s_t]$
- Combines deterministic and stochastic components
- Provides rich representation for planning

#### 3. Three-component Architecture

**1. Representation Model (Encoder)**
$$h*t = f*\theta(h*{t-1}, a*{t-1}, o_t)$$
- Encodes observations into internal state
- Maintains temporal consistency

**2. Transition Model**  
$$\hat{s}*{t+1}, \hat{h}*{t+1} = g*\phi(s*t, h*t, a*t)$$
- Predicts next state from current state and action
- Enables forward simulation

**3. Observation Model (Decoder)**
$$\hat{o}*t = d*\psi(s*t, h*t)$$
- Reconstructs observations from internal state
- Ensures representation quality

#### 4. Imagination-augmented Agents (i2a)

I2A extends world models by using "imagination" for policy learning:

**Imagination Rollouts**:
- Use world model to simulate future trajectories
- Generate imagined experiences: $\tau^{imagine} = \{(s*t^i, a*t^i, r*t^i)\}*{t=0}^H$

**Imagination Encoder**:
- Process imagined trajectories into useful features
- Extract planning-relevant information

**Policy Network**:
- Combines real observations with imagination features  
- Makes decisions using both current state and future projections

### Mathematical Framework

#### State Space Model

The world model learns a latent state space representation:

$$p(s*{1:T}, o*{1:T} | a*{1:T}) = \prod*{t=1}^T p(s*t | s*{t-1}, a*{t-1}) p(o*t | s_t)$$

Where:
- $s_t$: latent state at time $t$
- $o_t$: observation at time $t$  
- $a_t$: action at time $t$

#### Training Objectives

**1. Reconstruction Loss**:
$$\mathcal{L}*{recon} = \mathbb{E}*{(o,a) \sim \mathcal{D}}[||o - \hat{o}||^2]$$

**2. KL Regularization**:
$$\mathcal{L}*{KL} = \mathbb{E}*{s \sim q*\phi}[D*{KL}(q_\phi(s|o,h) || p(s|h))]$$

**3. Prediction Loss**:
$$\mathcal{L}*{pred} = \mathbb{E}*{(s,a,s') \sim \mathcal{D}}[||s' - \hat{s}'||^2]$$

**Total Loss**:
$$\mathcal{L}*{world} = \mathcal{L}*{recon} + \beta \mathcal{L}*{KL} + \lambda \mathcal{L}*{pred}$$

### Planning Algorithms

#### 1. Model Predictive Control (mpc)

MPC uses the world model for online planning:

1. **Rollout**: Simulate $H$-step trajectories using world model
2. **Evaluate**: Score trajectories using reward predictions  
3. **Execute**: Take first action of best trajectory
4. **Replan**: Repeat process at next timestep

**MPC Objective**:
$$a^* = \arg\max*a \sum*{h=1}^H \gamma^h r(s*h, a*h)$$

where $(s*h, a*h)$ come from world model rollouts.

#### 2. Cross Entropy Method (cem)

CEM is a population-based optimization method:

1. **Sample**: Generate action sequence population
2. **Evaluate**: Score sequences using world model
3. **Select**: Keep top-performing sequences
4. **Update**: Fit distribution to elite sequences
5. **Repeat**: Iterate until convergence

### Advantages and Applications

**Advantages**:
- **Sample Efficiency**: Learn from imagined experiences
- **Planning Capability**: Look ahead before acting
- **Transfer Learning**: World models can transfer across tasks
- **Interpretability**: Can visualize agent's internal world understanding

**Applications**:
- **Robotics**: Sample-efficient robot learning
- **Game Playing**: Strategic planning in complex games  
- **Autonomous Driving**: Safe planning with uncertainty
- **Finance**: Portfolio optimization with market models

### Key Research Papers

1. **World Models** (Ha & Schmidhuber, 2018)
2. **PlaNet** (Hafner et al., 2019)  
3. **DreamerV1** (Hafner et al., 2020)
4. **DreamerV2** (Hafner et al., 2021)
5. **I2A** (Weber et al., 2017)

In [None]:
# Import World Models implementations
from world_models.world_models import (
    RSSMCore, WorldModel, MPCPlanner, ImaginationAugmentedAgent
)
from world_models.world_models_demo import (
    create_world_model_environment, collect_random_data, 
    create_training_batches, train_world_model, evaluate_world_model_planning
)

print("‚úÖ World Models Implementation Imported!")
print("Components available:")
print("- RSSMCore: Recurrent state space model with deterministic/stochastic components")
print("- WorldModel: Complete world model with encoder/decoder and predictors")  
print("- MPCPlanner: Cross-entropy method planner for action sequence optimization")
print("- ImaginationAugmentedAgent: I2A-style agent combining model-free and imagination")
print("- Demo functions: create_world_model_environment, collect_random_data, create_training_batches, train_world_model, evaluate_world_model_planning")

‚úÖ World Models Implementation Imported!
Components available:
- RSSMCore: Recurrent state space model with deterministic/stochastic components
- WorldModel: Complete world model with encoder/decoder and predictors
- MPCPlanner: Cross-entropy method planner for action sequence optimization
- ImaginationAugmentedAgent: I2A-style agent combining model-free and imagination


In [None]:

print("üöÄ Starting Exercise 1: World Models Training and Evaluation")
print("="*70)

env = create_world_model_environment()
print(f"Environment: {env.state_dim}D state, {env.action_dim}D action")

random_data = collect_random_data(env, n_episodes=50)
print(f"Collected {len(random_data['observations'])} episodes")

training_batches = create_training_batches(random_data, batch_size=16, seq_length=15)
print(f"Created {len(training_batches)} training batches")

world_model = WorldModel(
    obs_dim=env.state_dim,
    action_dim=env.action_dim,
    state_dim=20,
    hidden_dim=100,
    embed_dim=256
).to(device)

print(f"World model parameters: {sum(p.numel() for p in world_model.parameters()):,}")

training_losses = train_world_model(world_model, training_batches, n_epochs=30)

planner = MPCPlanner(
    world_model=world_model,
    action_dim=env.action_dim,
    horizon=8,
    n_candidates=500,
    n_iterations=5,
    n_elite=50
)

planning_rewards = evaluate_world_model_planning(env, world_model, planner, n_episodes=10)

fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 10))

ax1.plot(training_losses['total'], label='Total Loss')
ax1.plot(training_losses['reconstruction'], label='Reconstruction')
ax1.plot(training_losses['reward'], label='Reward Prediction')
ax1.set_title('World Model Training Losses')
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Loss')
ax1.legend()
ax1.grid(True)

ax2.plot(training_losses['kl'])
ax2.set_title('KL Divergence Loss')
ax2.set_xlabel('Epoch')
ax2.set_ylabel('KL Loss')
ax2.grid(True)

ax3.bar(range(len(planning_rewards)), planning_rewards, alpha=0.7)
ax3.set_title('MPC Planning Episode Rewards')
ax3.set_xlabel('Episode')
ax3.set_ylabel('Total Reward')
ax3.grid(True)

ax4.hist(planning_rewards, bins=5, alpha=0.7, edgecolor='black')
ax4.axvline(np.mean(planning_rewards), color='red', linestyle='--', 
           label=f'Mean: {np.mean(planning_rewards):.2f}')
ax4.set_title('Reward Distribution')
ax4.set_xlabel('Episode Reward')
ax4.set_ylabel('Frequency')
ax4.legend()
ax4.grid(True)

plt.tight_layout()
plt.show()

print("\n‚úÖ Exercise 1 Complete!")
print("Key learnings:")
print("- World models can learn environment dynamics from observation sequences")
print("- MPC planning uses learned models for lookahead decision making")
print("- RSSM balances deterministic and stochastic state evolution")
print("- Imagination enables sample-efficient learning through internal simulation")


üöÄ Starting Exercise 1: World Models Training and Evaluation
Environment: 4D state, 2D action
Collecting 50 episodes of random data...
Episode 0/50
Episode 20/50
Episode 40/50
Collected 50 episodes
Created 49 training batches
World model parameters: 714,190
Training world model for 30 epochs...


RuntimeError: input has inconsistent input_size: got 22 expected 278

# Part Ii: Multi-agent Deep Reinforcement Learning

## üë• Theoretical Foundation

### Introduction to Multi-agent Rl

**Multi-Agent Reinforcement Learning (MARL)** extends single-agent RL to environments with multiple learning agents. This creates fundamentally new challenges due to **non-stationarity** - each agent's environment changes as other agents learn and adapt their policies.

### Core Challenges in Marl

#### 1. Non-stationarity Problem
- **Single-Agent RL**: Environment is stationary (fixed transition dynamics)
- **Multi-Agent RL**: Environment is non-stationary (other agents change their behavior)
- **Consequence**: Standard RL convergence guarantees no longer hold

#### 2. Credit Assignment Problem
- **Challenge**: Which agent is responsible for team success/failure?
- **Example**: In cooperative tasks, global reward must be decomposed
- **Solutions**: Difference rewards, counterfactual reasoning, attention mechanisms

#### 3. Scalability Issues
- **Joint Action Space**: Grows exponentially with number of agents
- **Joint Observation Space**: Exponential growth in state complexity
- **Communication**: Bandwidth limitations, partial observability

#### 4. Coordination Vs Competition
- **Cooperative**: Agents share common objectives (team sports, rescue operations)
- **Competitive**: Agents have opposing objectives (adversarial games, auctions)
- **Mixed-Motive**: Combination of cooperation and competition (negotiation, markets)

### Game Theoretic Foundations

#### Nash Equilibrium
A strategy profile where no agent can unilaterally improve by changing strategy:

$$\pi^**i \in \arg\max*{\pi*i} J*i(\pi*i, \pi^**{-i})$$

where $\pi^*_{-i}$ represents the strategies of all agents except $i$.

#### Solution Concepts
1. **Nash Equilibrium**: Stable but not necessarily optimal
2. **Pareto Optimal**: Efficient outcomes that cannot be improved for all agents
3. **Correlated Equilibrium**: Allows for coordination through external signals
4. **Stackelberg Equilibrium**: Leader-follower dynamics

### Marl Algorithm Categories

#### 1. Independent Learning (il)
Each agent treats others as part of the environment:
- **Pros**: Simple, scalable, no communication needed
- **Cons**: No convergence guarantees, ignores other agents' adaptation
- **Examples**: Independent Q-learning, Independent Actor-Critic

#### 2. Joint Action Learning (jal)
Agents learn joint action-value functions:
- **Pros**: Can achieve coordination, theoretically sound
- **Cons**: Exponential complexity in number of agents
- **Examples**: Multi-Agent Q-learning, Nash-Q learning

#### 3. Agent Modeling (am)
Agents maintain models of other agents:
- **Pros**: Handles non-stationarity explicitly
- **Cons**: Computational overhead, modeling errors
- **Examples**: MAAC, MADDPG with opponent modeling

#### 4. Communication-based
Agents can exchange information:
- **Pros**: Direct coordination, shared knowledge
- **Cons**: Communication overhead, protocol design
- **Examples**: CommNet, I2C, TarMAC

### Deep Marl Algorithms

#### 1. Multi-agent Deep Deterministic Policy Gradient (maddpg)

**Key Idea**: Centralized training, decentralized execution
- **Training**: Critics have access to all agents' observations and actions
- **Execution**: Actors only use local observations

**Actor Update**: 
$$\nabla*{\theta*i} J*i = \mathbb{E}[\nabla*{\theta*i} \mu*i(o*i) \nabla*{a*i} Q*i^{\mu}(x, a*1, ..., a*N)|*{a*i=\mu*i(o*i)}]$$

**Critic Update**:
$$Q*i^{\mu}(x, a*1, ..., a*N) = \mathbb{E}[r*i + \gamma Q*i^{\mu'}(x', a'*1, ..., a'_N)]$$

where $x$ is the global state and $a_i$ are individual actions.

#### 2. Multi-agent Actor-critic (maac)

Extends single-agent AC to multi-agent setting:
- **Centralized Critic**: Uses global information during training
- **Decentralized Actors**: Use only local observations
- **Attention Mechanism**: Selectively focus on relevant agents

#### 3. Counterfactual Multi-agent Policy Gradient (coma)

Addresses credit assignment through counterfactual reasoning:

**Counterfactual Advantage**:
$$A*i(s, a) = Q(s, a) - \sum*{a'*i} \pi*i(a'*i|o*i) Q(s, (a*{-i}, a'*i))$$

This measures how much better the taken action is compared to marginalizing over all possible actions.

### Communication in Marl

#### 1. Communication Protocols
- **Broadcast**: All-to-all communication
- **Targeted**: Agent-specific messages
- **Hierarchical**: Tree-structured communication

#### 2. Communication Learning
- **What to Communicate**: Message content learning
- **When to Communicate**: Communication scheduling
- **Who to Communicate With**: Network topology learning

#### 3. Differentiable Communication

**Gumbel-Softmax Trick** for discrete communication:
$$\text{softmax}\left(\frac{\log(\pi*i) + G*i}{\tau}\right)$$

where $G_i$ are Gumbel random variables and $\tau$ is temperature.

### Cooperative Multi-agent Rl

#### 1. Team Reward Structure
- **Global Reward**: Same reward for all agents
- **Local Rewards**: Individual agent rewards
- **Shaped Rewards**: Carefully designed to promote cooperation

#### 2. Value Decomposition Methods

**VDN (Value Decomposition Networks)**:
$$Q*{tot}(s, a) = \sum*{i=1}^n Q*i(s*i, a_i)$$

**QMIX**: Monotonic value decomposition
$$\frac{\partial Q*{tot}}{\partial Q*i} \geq 0$$

#### 3. Policy Gradient Methods
- **Multi-Agent Policy Gradient (MAPG)**
- **Trust Region Methods**: MADDPG-TR
- **Proximal Policy Optimization**: MAPPO

### Competitive Multi-agent Rl

#### 1. Self-play Training
Agents learn by playing against copies of themselves:
- **Advantages**: Always improving opponents, no human data needed
- **Challenges**: Exploitability, strategy diversity

#### 2. Population-based Training
Maintain population of diverse strategies:
- **League Play**: Different skill levels and strategies
- **Diversity Metrics**: Behavioral diversity, policy diversity
- **Meta-Game Analysis**: Strategy effectiveness matrix

#### 3. Adversarial Training
- **Minimax Objective**: $\min*{\pi*1} \max*{\pi*2} J(\pi*1, \pi*2)$
- **Nash-AC**: Nash equilibrium seeking
- **PSRO**: Policy Space Response Oracles

### Theoretical Guarantees

#### 1. Convergence Results
- **Independent Learning**: Generally no convergence guarantees
- **Joint Action Learning**: Convergence to Nash under restrictive assumptions
- **Two-Timescale Algorithms**: Convergence through different learning rates

#### 2. Sample Complexity
Multi-agent sample complexity often exponentially worse than single-agent due to:
- Larger state-action spaces
- Non-stationarity
- Coordination requirements

#### 3. Regret Bounds
**Multi-Agent Regret**: 
$$R*i(T) = \max*{\pi*i} \sum*{t=1}^T J*i(\pi*i, \pi*{-i}^t) - \sum*{t=1}^T J*i(\pi*i^t, \pi_{-i}^t)$$

### Applications

#### 1. Robotics
- **Multi-Robot Systems**: Coordination and task allocation
- **Swarm Robotics**: Large-scale coordination
- **Human-Robot Interaction**: Mixed human-AI teams

#### 2. Autonomous Vehicles
- **Traffic Management**: Intersection control, highway merging
- **Platooning**: Vehicle following and coordination
- **Mixed Autonomy**: Human and autonomous vehicles

#### 3. Game Playing
- **Real-Time Strategy Games**: StarCraft, Dota
- **Board Games**: Multi-player poker, diplomacy
- **Sports Simulation**: Team coordination

#### 4. Economics and Finance
- **Algorithmic Trading**: Multi-agent market making
- **Auction Design**: Bidding strategies
- **Resource Allocation**: Cloud computing, network resources

### Key Research Papers

1. **MADDPG** (Lowe et al., 2017)
2. **COMA** (Foerster et al., 2018)
3. **QMIX** (Rashid et al., 2018)
4. **CommNet** (Sukhbaatar et al., 2016)
5. **OpenAI Five** (OpenAI, 2019)
6. **AlphaStar** (Vinyals et al., 2019)
   

In [3]:
# Import Multi-Agent RL implementations
from multi_agent_rl.multi_agent_rl import (
    MultiAgentReplayBuffer, Actor, Critic, AttentionCritic, 
    CommunicationNetwork, MADDPGAgent, MultiAgentEnvironment
)

print("‚úÖ Multi-Agent RL Implementation Imported!")
print("Components available:")
print("- MultiAgentReplayBuffer: Experience storage for multi-agent systems")
print("- Actor/Critic: Individual agent networks with centralized training")
print("- AttentionCritic: Attention mechanism for selective agent focus")
print("- CommunicationNetwork: Neural communication between agents")
print("- MADDPGAgent: Complete MADDPG implementation with extensions")
print("- MultiAgentEnvironment: Configurable multi-agent test environment")

‚úÖ Multi-Agent RL Implementation Imported!
Components available:
- MultiAgentReplayBuffer: Experience storage for multi-agent systems
- Actor/Critic: Individual agent networks with centralized training
- AttentionCritic: Attention mechanism for selective agent focus
- CommunicationNetwork: Neural communication between agents
- MADDPGAgent: Complete MADDPG implementation with extensions
- MultiAgentEnvironment: Configurable multi-agent test environment


In [None]:

def train_maddpg(env: MultiAgentEnvironment, agents: List[MADDPGAgent],
                 buffer: MultiAgentReplayBuffer, episodes: int = 1000,
                 batch_size: int = 64, update_interval: int = 4):
    """Train MADDPG agents"""
    
    episode_rewards = []
    losses = {f'agent_{i}': {'actor': [], 'critic': []} for i in range(env.n_agents)}
    
    for episode in range(episodes):
        obs = env.reset()
        episode_reward = np.zeros(env.n_agents)
        done = False
        step = 0
        
        while not done:
            messages = []
            if agents[0].use_communication:
                for i, agent in enumerate(agents):
                    obs_tensor = torch.FloatTensor(obs[i]).unsqueeze(0).to(device)
                    _, message = agent.act(obs_tensor, explore=True)
                    messages.append(message)
                messages = torch.stack(messages, dim=1)
            
            actions = np.zeros((env.n_agents, env.action_dim))
            for i, agent in enumerate(agents):
                obs_tensor = torch.FloatTensor(obs[i]).unsqueeze(0).to(device)
                
                agent_messages = None
                if agent.use_communication:
                    agent_messages = torch.cat([messages[:, :i], messages[:, i+1:]], dim=1)
                
                action_tensor, _ = agent.act(obs_tensor, agent_messages, explore=True)
                actions[i] = action_tensor.cpu().numpy()[0]
            
            next_obs, rewards, dones, _ = env.step(actions)
            
            buffer.add(obs, actions, rewards, next_obs, dones)
            
            episode_reward += rewards
            obs = next_obs
            done = np.all(dones)
            step += 1
            
            if buffer.size >= batch_size and step % update_interval == 0:
                batch = buffer.sample(batch_size)
                
                target_actors = [agent.actor_target for agent in agents]
                
                for i, agent in enumerate(agents):
                    update_info = agent.update(batch, target_actors)
                    losses[f'agent_{i}']['actor'].append(update_info['actor_loss'])
                    losses[f'agent_{i}']['critic'].append(update_info['critic_loss'])
        
        episode_rewards.append(episode_reward.copy())
        
        if episode % 100 == 0:
            mean_reward = np.mean([np.sum(r) for r in episode_rewards[-100:]])
            print(f"Episode {episode}, Mean Reward: {mean_reward:.2f}")
            for i in range(env.n_agents):
                noise_std = agents[i].noise_std
                print(f"  Agent {i}: Noise={noise_std:.3f}")
    
    return episode_rewards, losses


def evaluate_maddpg(env: MultiAgentEnvironment, agents: List[MADDPGAgent],
                   episodes: int = 100) -> Dict[str, float]:
    """Evaluate trained MADDPG agents"""
    
    episode_rewards = []
    coordination_scores = []
    
    for episode in range(episodes):
        obs = env.reset()
        episode_reward = np.zeros(env.n_agents)
        positions_history = []
        done = False
        
        while not done:
            messages = []
            if agents[0].use_communication:
                for i, agent in enumerate(agents):
                    obs_tensor = torch.FloatTensor(obs[i]).unsqueeze(0).to(device)
                    _, message = agent.act(obs_tensor, explore=False)
                    messages.append(message)
                messages = torch.stack(messages, dim=1)
            
            actions = np.zeros((env.n_agents, env.action_dim))
            for i, agent in enumerate(agents):
                obs_tensor = torch.FloatTensor(obs[i]).unsqueeze(0).to(device)
                
                agent_messages = None
                if agent.use_communication:
                    agent_messages = torch.cat([messages[:, :i], messages[:, i+1:]], dim=1)
                
                action_tensor, _ = agent.act(obs_tensor, agent_messages, explore=False)
                actions[i] = action_tensor.cpu().numpy()[0]
            
            next_obs, rewards, dones, _ = env.step(actions)
            
            episode_reward += rewards
            positions_history.append(env.agent_states[:, :2].copy())
            obs = next_obs
            done = np.all(dones)
        
        episode_rewards.append(episode_reward.copy())
        
        positions = np.array(positions_history)
        mean_positions = np.mean(positions, axis=1)  # [timesteps, 2]
        agent_variances = []
        
        for t in range(len(positions)):
            distances_from_center = [
                np.linalg.norm(positions[t, i] - mean_positions[t])
                for i in range(env.n_agents)
            ]
            agent_variances.append(np.var(distances_from_center))
        
        coordination_scores.append(np.mean(agent_variances))
    
    results = {
        'mean_total_reward': np.mean([np.sum(r) for r in episode_rewards]),
        'std_total_reward': np.std([np.sum(r) for r in episode_rewards]),
        'mean_individual_reward': np.mean(episode_rewards),
        'coordination_score': np.mean(coordination_scores),
        'success_rate': np.mean([np.sum(r) > 0 for r in episode_rewards])
    }
    
    return results


print("üöÄ Starting Multi-Agent RL Training...")

env_configs = [
    {'env_type': 'cooperative', 'name': 'Cooperative'},
    {'env_type': 'competitive', 'name': 'Competitive'},
    {'env_type': 'mixed', 'name': 'Mixed'}
]

results_summary = {}

for config in env_configs[:1]:  # Train on cooperative first
    print(f"\n{'='*50}")
    print(f"Training: {config['name']} Environment")
    print(f"{'='*50}")
    
    env = MultiAgentEnvironment(
        n_agents=3,
        obs_dim=6,
        action_dim=2,
        env_type=config['env_type']
    )
    
    agents = []
    for i in range(env.n_agents):
        agent = MADDPGAgent(
            agent_idx=i,
            obs_dim=env.obs_dim,
            action_dim=env.action_dim,
            n_agents=env.n_agents,
            use_attention=True,
            use_communication=True
        )
        agents.append(agent)
    
    buffer = MultiAgentReplayBuffer(
        capacity=50000,
        n_agents=env.n_agents,
        obs_dim=env.obs_dim,
        action_dim=env.action_dim
    )
    
    print("Training agents...")
    episode_rewards, losses = train_maddpg(
        env, agents, buffer,
        episodes=500,  # Reduced for demo
        batch_size=64
    )
    
    print("Evaluating agents...")
    eval_results = evaluate_maddpg(env, agents, episodes=50)
    
    results_summary[config['name']] = {
        'training_rewards': episode_rewards,
        'evaluation': eval_results,
        'losses': losses
    }
    
    print(f"\nResults for {config['name']} Environment:")
    print(f"Mean Total Reward: {eval_results['mean_total_reward']:.3f} ¬± {eval_results['std_total_reward']:.3f}")
    print(f"Mean Individual Reward: {eval_results['mean_individual_reward']:.3f}")
    print(f"Coordination Score: {eval_results['coordination_score']:.3f}")
    print(f"Success Rate: {eval_results['success_rate']:.3f}")

print("\n‚úÖ Multi-Agent Training Complete!")


üöÄ Starting Multi-Agent RL Training...

Training: Cooperative Environment
Training agents...


RuntimeError: Can't call numpy() on Tensor that requires grad. Use tensor.detach().numpy() instead.

# Part Iii: Causal Reinforcement Learning

## Theoretical Foundations

### Introduction to Causality in Rl

Causal Reinforcement Learning represents a paradigm shift from traditional correlation-based learning to understanding cause-effect relationships in sequential decision making. This approach addresses fundamental limitations in standard RL:

**Key Limitations of Standard RL:**
- **Spurious Correlations**: Agents may learn policies based on correlations that don't reflect true causal relationships
- **Distribution Shift**: Policies trained on specific environments may fail when deployed in different conditions
- **Sample Inefficiency**: Without causal understanding, agents require extensive exploration
- **Interpretability**: Standard RL policies are often black boxes without clear causal reasoning

### Causal Inference Framework

#### 1. Structural Causal Models (scms)

A Structural Causal Model is defined by a tuple $(U, V, F, P(U))$:

- **U**: Set of exogenous (external) variables
- **V**: Set of endogenous (internal) variables
- **F**: Set of functions $f*i$ where $V*i = f*i(PA*i, U_i)$
- **P(U)**: Probability distribution over exogenous variables

**Causal Graph Representation:**
```
Exogenous Variables (U) ‚Üí Endogenous Variables (V)
      ‚Üì                           ‚Üì
Environmental Factors    ‚Üí    Agent States/Actions
```

#### 2. Causal Hierarchy (pearl's Ladder)

**Level 1: Association** ($P(y|x)$)
- "What is the probability of Y given that we observe X?"
- Standard statistical/ML approaches operate here
- Example: "What's the probability of success given this policy?"

**Level 2: Intervention** ($P(y|do(x))$)
- "What is the probability of Y if we set X to a specific value?"
- Requires understanding of causal mechanisms
- Example: "What happens if we force the agent to take action A?"

**Level 3: Counterfactuals** ($P(y_x|x', y')$)
- "What would have happened if X had been different?"
- Enables reasoning about alternative scenarios
- Example: "Would the agent have succeeded if it had chosen a different action?"

### Causal Rl Mathematical Framework

#### 1. Causal Markov Decision Process (causal-mdp)

A Causal-MDP extends traditional MDPs with causal structure:

**Causal-MDP Definition:**
$$\mathcal{M}*C = \langle \mathcal{S}, \mathcal{A}, \mathcal{G}, T*C, R_C, \gamma \rangle$$

Where:
- $\mathcal{G}$: Causal graph over state variables
- $T_C$: Causal transition function respecting $\mathcal{G}$
- $R_C$: Causal reward function

**Causal Factorization:**
$$P(s*{t+1}|s*t, a*t) = \prod*{i=1}^{|\mathcal{S}|} P(s*{t+1}^i | PA*C(s*{t+1}^i), a*t)$$

#### 2. Interventional Policy Learning

**Interventional Value Function:**
$$V^{\pi}*{do(X=x)}(s) = \mathbb{E}\left[\sum*{t=0}^{\infty} \gamma^t R*t | S*0 = s, do(X=x), \pi\right]$$

**Causal Policy Gradient:**
$$\nabla*\theta J(\theta) = \mathbb{E}*{s \sim d^\pi, a \sim \pi*\theta}\left[\nabla*\theta \log \pi*\theta(a|s) \cdot \frac{\partial Q^{\pi}(s,a)}{\partial do(\pi*\theta)}\right]$$

#### 3. Counterfactual Reasoning in Rl

**Counterfactual Q-Function:**
$$Q*{CF}(s, a, s', a') = \mathbb{E}[R | S=s, A=a, S'*{do(A=a')} = s']$$

This captures: "What would the Q-value be if we had taken action $a'$ instead of $a$?"

### Causal Discovery in Rl

#### 1. Structure Learning

**Constraint-Based Methods:**
- Use conditional independence tests
- Build causal graph from statistical dependencies
- Example: PC Algorithm adapted for sequential data

**Score-Based Methods:**
- Optimize causal graph structure score
- Balance model fit with complexity
- Example: BIC score with causal constraints

#### 2. Causal Effect Estimation

**Backdoor Criterion:**
For estimating causal effect of action $A$ on reward $R$:
$$P(R|do(A)) = \sum_z P(R|A,Z) P(Z)$$

Where $Z$ blocks all backdoor paths from $A$ to $R$.

**Front-door Criterion:**
When backdoor adjustment isn't possible:
$$P(R|do(A)) = \sum*m P(M|A) \sum*{a'} P(R|A',M) P(A')$$

### Advanced Causal Rl Techniques

#### 1. Causal World Models

**Causal Representation Learning:**
Learn latent representations that respect causal structure:
$$z*{t+1} = f*c(z*t, a*t, u_t)$$

Where $f_c$ respects the causal graph structure.

**Interventional Consistency:**
$$\mathbb{E}[z*{t+1} | do(z*t^i = v)] = \mathbb{E}[f*c(z*t^{-i}, v, a*t, u*t)]$$

#### 2. Causal Meta-learning

**Task-Invariant Causal Features:**
Learn features that are causally relevant across tasks:
$$\phi^*(s) = \arg\min*\phi \sum*{T} L_T(\phi(s)) + \lambda \cdot \text{Causal-Reg}(\phi)$$

**Causal Transfer:**
Transfer causal knowledge between domains:
$$\pi*{new}(a|s) = \pi*{old}(a|\phi_{causal}(s))$$

#### 3. Confounded Rl

**Hidden Confounders:**
When unobserved variables affect both states and rewards:
$$H*t \rightarrow S*t, H*t \rightarrow R*t$$

**Instrumental Variables:**
Use variables correlated with actions but not directly with outcomes:
$$IV \rightarrow A*t \not\rightarrow R*t$$

### Applications and Benefits

#### 1. Robust Policy Learning
- Policies that generalize across environments
- Reduced sensitivity to spurious correlations
- Better performance under distribution shift

#### 2. Sample Efficient Exploration
- Focus exploration on causally relevant factors
- Avoid learning from misleading correlations
- Faster convergence to optimal policies

#### 3. Interpretable Decision Making
- Understand why certain actions are taken
- Provide causal explanations for policy decisions
- Enable human oversight and validation

#### 4. Safe Rl Applications
- Predict consequences of interventions
- Avoid actions with negative causal effects
- Enable counterfactual safety analysis

### Research Challenges

#### 1. Causal Discovery
- Identifying causal structure from observational RL data
- Handling non-stationarity and temporal dependencies
- Scalability to high-dimensional state spaces

#### 2. Identifiability
- When can causal effects be estimated from data?
- Addressing unmeasured confounders
- Validation of causal assumptions

#### 3. Computational Complexity
- Efficient inference in causal graphical models
- Scalable algorithms for large state spaces
- Real-time causal reasoning during policy execution
   

In [4]:
# Import Causal RL implementations
from causal_rl.causal_rl import (
    CausalGraph, CausalDiscovery, CausalWorldModel, 
    InterventionalDataset, CausalPolicyGradient
)

print("‚úÖ Causal RL Implementation Imported!")
print("Components available:")
print("- CausalGraph: Causal relationship representation and d-separation")
print("- CausalDiscovery: PC algorithm for causal structure learning")
print("- CausalWorldModel: World model with causal structure constraints")
print("- InterventionalDataset: Dataset with observational/interventional data")
print("- CausalPolicyGradient: Policy gradient with causal regularization")

‚úÖ Causal RL Implementation Imported!
Components available:
- CausalGraph: Causal relationship representation and d-separation
- CausalDiscovery: PC algorithm for causal structure learning
- CausalWorldModel: World model with causal structure constraints
- InterventionalDataset: Dataset with observational/interventional data
- CausalPolicyGradient: Policy gradient with causal regularization


In [None]:
def create_causal_environment():
    """Create a simple environment with causal structure for demonstration"""
    
    class CausalEnvironment:
        def __init__(self):
            # Causal graph: X -> Y -> Z, with confounding
            self.state_dim = 3  # [X, Y, Z]
            self.action_dim = 1
            self.max_steps = 100
            
            # Causal parameters
            self.x_noise = 0.1
            self.y_noise = 0.05
            self.z_noise = 0.02
            
            self.reset()
        
        def reset(self):
            # Initialize with causal dependencies
            self.x = np.random.normal(0, 1)
            self.y = self.x + np.random.normal(0, self.y_noise)
            self.z = self.y + np.random.normal(0, self.z_noise)
            
            self.steps = 0
            return np.array([self.x, self.y, self.z])
        
        def step(self, action):
            action = np.clip(action, -1, 1)[0]
            
            # Causal transitions
            self.x += action * 0.1 + np.random.normal(0, self.x_noise)
            self.y = self.x + action * 0.05 + np.random.normal(0, self.y_noise)
            self.z = self.y + np.random.normal(0, self.z_noise)
            
            # Reward based on causal understanding
            reward = -abs(self.z) - 0.1 * abs(action)
            
            self.steps += 1
            done = self.steps >= self.max_steps
            
            return np.array([self.x, self.y, self.z]), reward, done, {}
    
    return CausalEnvironment()

def demonstrate_causal_discovery(env, n_samples=1000):
    """Demonstrate causal structure learning from observational data"""
    
    print("üîç Learning Causal Structure from Observational Data")
    
    # Collect observational data
    observations = []
    actions = []
    
    obs = env.reset()
    for _ in range(n_samples):
        action = np.random.uniform(-1, 1, 1)
        next_obs, reward, done, _ = env.step(action)
        
        observations.append(obs)
        actions.append(action)
        
        obs = next_obs
        if done:
            obs = env.reset()
    
    observations = np.array(observations)
    actions = np.array(actions)
    
    # Create causal discovery object
    causal_discovery = CausalDiscovery(
        variables=['X', 'Y', 'Z', 'A'],
        alpha=0.05
    )
    
    # Prepare data for causal discovery
    data = np.column_stack([observations, actions])
    
    # Learn causal graph
    graph = causal_discovery.discover_structure(data)
    
    print(f"Discovered causal graph with {len(graph.edges)} edges:")
    for edge in graph.edges:
        print(f"  {edge[0]} ‚Üí {edge[1]}")
    
    return graph, data

def train_causal_world_model(env, graph, data, n_epochs=100):
    """Train a world model that respects causal structure"""
    
    print("üèóÔ∏è Training Causal World Model")
    
    # Create causal world model
    world_model = CausalWorldModel(
        graph=graph,
        obs_dim=3,
        action_dim=1,
        hidden_dim=64,
        latent_dim=16
    )
    
    optimizer = torch.optim.Adam(world_model.parameters(), lr=1e-3)
    
    losses = {'total': [], 'reconstruction': [], 'causal': []}
    
    batch_size = 32
    n_batches = len(data) // batch_size
    
    for epoch in range(n_epochs):
        epoch_losses = {'total': 0, 'reconstruction': 0, 'causal': 0}
        
        # Shuffle data
        indices = np.random.permutation(len(data))
        
        for i in range(n_batches):
            batch_indices = indices[i*batch_size:(i+1)*batch_size]
            batch_data = data[batch_indices]
            
            obs_batch = torch.FloatTensor(batch_data[:, :3]).to(device)
            action_batch = torch.FloatTensor(batch_data[:, 3:]).to(device)
            
            # Forward pass
            output = world_model(obs_batch, action_batch)
            
            # Losses
            recon_loss = F.mse_loss(output['reconstruction'], obs_batch)
            causal_loss = output['causal_constraint']
            total_loss = recon_loss + 0.1 * causal_loss
            
            # Backward pass
            optimizer.zero_grad()
            total_loss.backward()
            optimizer.step()
            
            epoch_losses['total'] += total_loss.item()
            epoch_losses['reconstruction'] += recon_loss.item()
            epoch_losses['causal'] += causal_loss.item()
        
        # Average losses
        for key in epoch_losses:
            epoch_losses[key] /= n_batches
            losses[key].append(epoch_losses[key])
        
        if epoch % 20 == 0:
            print(f"Epoch {epoch}: Total={epoch_losses['total']:.4f}, "
                  f"Recon={epoch_losses['reconstruction']:.4f}, "
                  f"Causal={epoch_losses['causal']:.4f}")
    
    return world_model, losses

def demonstrate_interventional_reasoning(world_model, env):
    """Demonstrate interventional reasoning capabilities"""
    
    print("üî¨ Demonstrating Interventional Reasoning")
    
    # Test interventions on different variables
    interventions = {
        'X': lambda obs: np.array([1.0, obs[1], obs[2]]),  # Force X to 1.0
        'Y': lambda obs: np.array([obs[0], 0.5, obs[2]]),  # Force Y to 0.5
        'Z': lambda obs: np.array([obs[0], obs[1], -0.2])   # Force Z to -0.2
    }
    
    results = {}
    
    for var, intervention_fn in interventions.items():
        print(f"\nIntervening on {var}:")
        
        # Get baseline prediction
        obs = env.reset()
        obs_tensor = torch.FloatTensor(obs).unsqueeze(0).to(device)
        action_tensor = torch.zeros(1, 1).to(device)
        
        with torch.no_grad():
            baseline_pred = world_model(obs_tensor, action_tensor)['reconstruction'].cpu().numpy()[0]
        
        # Apply intervention
        intervened_obs = intervention_fn(obs)
        intervened_tensor = torch.FloatTensor(intervened_obs).unsqueeze(0).to(device)
        
        with torch.no_grad():
            intervened_pred = world_model.predict_intervention(
                intervened_tensor, action_tensor
            ).cpu().numpy()[0]
        
        print(f"  Baseline: {baseline_pred}")
        print(f"  After intervention: {intervened_pred}")
        print(f"  Change: {intervened_pred - baseline_pred}")
        
        results[var] = {
            'baseline': baseline_pred,
            'intervened': intervened_pred,
            'change': intervened_pred - baseline_pred
        }
    
    return results

print("üöÄ Starting Exercise 2: Causal Reinforcement Learning")
print("="*70)

# Create causal environment
env = create_causal_environment()
print(f"Environment: {env.state_dim}D state, {env.action_dim}D action")

# Demonstrate causal discovery
graph, data = demonstrate_causal_discovery(env, n_samples=500)

# Train causal world model
world_model, training_losses = train_causal_world_model(env, graph, data, n_epochs=50)

# Demonstrate interventional reasoning
intervention_results = demonstrate_interventional_reasoning(world_model, env)

# Visualization
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 10))

ax1.plot(training_losses['total'], label='Total Loss')
ax1.plot(training_losses['reconstruction'], label='Reconstruction')
ax1.plot(training_losses['causal'], label='Causal Constraint')
ax1.set_title('Causal World Model Training Losses')
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Loss')
ax1.legend()
ax1.grid(True)

ax2.bar(range(len(intervention_results)), 
        [r['change'][0] for r in intervention_results.values()],
        alpha=0.7, color='blue', label='X change')
ax2.bar(range(len(intervention_results)), 
        [r['change'][1] for r in intervention_results.values()],
        alpha=0.7, color='orange', label='Y change', bottom=[r['change'][0] for r in intervention_results.values()])
ax2.bar(range(len(intervention_results)), 
        [r['change'][2] for r in intervention_results.values()],
        alpha=0.7, color='green', label='Z change', 
        bottom=[r['change'][0] + r['change'][1] for r in intervention_results.values()])
ax2.set_title('Effects of Interventions')
ax2.set_xlabel('Intervention Type')
ax2.set_ylabel('State Variable Changes')
ax2.set_xticks(range(len(intervention_results)))
ax2.set_xticklabels(list(intervention_results.keys()))
ax2.legend()
ax2.grid(True)

# Causal graph visualization (simplified)
ax3.axis('off')
ax3.text(0.5, 0.5, 'Causal Graph:\nX ‚Üí Y ‚Üí Z\nA ‚Üí X', 
         ha='center', va='center', fontsize=12, 
         bbox=dict(boxstyle="round,pad=0.3", facecolor="lightblue"))
ax3.set_title('Learned Causal Structure')

ax4.hist(data[:, 0], bins=30, alpha=0.7, label='X', density=True)
ax4.hist(data[:, 1], bins=30, alpha=0.7, label='Y', density=True)
ax4.hist(data[:, 2], bins=30, alpha=0.7, label='Z', density=True)
ax4.set_title('Observational Data Distributions')
ax4.set_xlabel('Value')
ax4.set_ylabel('Density')
ax4.legend()
ax4.grid(True)

plt.tight_layout()
plt.show()

print("\n‚úÖ Exercise 2 Complete!")
print("Key learnings:")
print("- Causal discovery can identify relationships from observational data")
print("- Causal world models respect structural constraints during learning")
print("- Interventions allow testing 'what-if' scenarios")
print("- Causal understanding improves robustness and interpretability")

# Part Iv: Quantum Reinforcement Learning

## Theoretical Foundations

### Introduction to Quantum Computing for Rl

Quantum Reinforcement Learning (QRL) leverages quantum mechanical phenomena to enhance reinforcement learning algorithms. This emerging field promises exponential speedups for certain RL problems and enables exploration of vast state spaces that are intractable for classical computers.

**Key Quantum Phenomena:**
- **Superposition**: Quantum states can exist in multiple states simultaneously
- **Entanglement**: Quantum systems can be correlated in non-classical ways
- **Interference**: Quantum amplitudes can interfere constructively or destructively
- **Quantum Parallelism**: Process multiple inputs simultaneously

### Quantum Computing Fundamentals

#### 1. Quantum State Representation

**Qubit State:**
$$|\psi\rangle = \alpha|0\rangle + \beta|1\rangle$$

Where $|\alpha|^2 + |\beta|^2 = 1$ and $\alpha, \beta \in \mathbb{C}$.

**Multi-qubit System:**
$$|\psi\rangle = \sum*{i=0}^{2^n-1} \alpha*i |i\rangle$$

For $n$ qubits with $\sum*{i=0}^{2^n-1} |\alpha*i|^2 = 1$.

#### 2. Quantum Operations

**Quantum Gates:**
- **Pauli-X**: $X = \begin{pmatrix} 0 & 1 \\ 1 & 0 \end{pmatrix}$ (Bit flip)
- **Pauli-Y**: $Y = \begin{pmatrix} 0 & -i \\ i & 0 \end{pmatrix}$
- **Pauli-Z**: $Z = \begin{pmatrix} 1 & 0 \\ 0 & -1 \end{pmatrix}$ (Phase flip)
- **Hadamard**: $H = \frac{1}{\sqrt{2}}\begin{pmatrix} 1 & 1 \\ 1 & -1 \end{pmatrix}$ (Superposition)

**Rotation Gates:**
$$R_x(\theta) = \begin{pmatrix} \cos(\theta/2) & -i\sin(\theta/2) \\ -i\sin(\theta/2) & \cos(\theta/2) \end{pmatrix}$$

$$R_y(\theta) = \begin{pmatrix} \cos(\theta/2) & -\sin(\theta/2) \\ \sin(\theta/2) & \cos(\theta/2) \end{pmatrix}$$

#### 3. Quantum Measurement

**Born Rule:**
$$P(|i\rangle) = |\langle i | \psi \rangle|^2$$

The probability of measuring state $|i\rangle$ from state $|\psi\rangle$.

### Quantum Reinforcement Learning Framework

#### 1. Quantum Mdp (qmdp)

**Quantum State Space:**
States are represented as quantum states in Hilbert space $\mathcal{H}$:
$$|\psi*s\rangle \in \mathcal{H}, \quad \langle\psi*s|\psi_s\rangle = 1$$

**Quantum Action Space:**
Actions correspond to unitary operations:
$$\mathcal{A} = \{U*a : U*a^\dagger U_a = I\}$$

**Quantum Transition Dynamics:**
$$|\psi*{t+1}\rangle = U*{a*t} |\psi*t\rangle \otimes |\text{env}_t\rangle$$

#### 2. Quantum Value Functions

**Quantum Q-Function:**
$$Q(|\psi\rangle, U*a) = \langle\psi| U*a^\dagger \hat{R} U_a |\psi\rangle + \gamma \mathbb{E}[V(|\psi'\rangle)]$$

Where $\hat{R}$ is the reward operator.

**Quantum Bellman Equation:**
$$\hat{V}|\psi\rangle = \max*{U*a} \left(\hat{R}U*a|\psi\rangle + \gamma \sum*{|\psi'\rangle} P(|\psi'\rangle||\psi\rangle, U_a) \hat{V}|\psi'\rangle\right)$$

#### 3. Quantum Policy Representation

**Parameterized Quantum Circuit (PQC):**
$$|\psi(\theta)\rangle = U*L(\theta*L) \cdots U*2(\theta*2) U*1(\theta*1) |\psi_0\rangle$$

Where each $U*i(\theta*i)$ is a parameterized unitary gate.

**Quantum Policy:**
$$\pi_\theta(a|s) = |\langle a | U(\theta) |s \rangle|^2$$

### Variational Quantum Algorithms for Rl

#### 1. Variational Quantum Eigensolver (vqe) for Value Functions

**Objective:**
$$\theta^* = \arg\min_\theta \langle\psi(\theta)| \hat{H} |\psi(\theta)\rangle$$

Where $\hat{H}$ encodes the RL problem structure.

**Gradient Calculation:**
$$\nabla_\theta f(\theta) = \frac{1}{2}[f(\theta + \pi/2) - f(\theta - \pi/2)]$$

#### 2. Quantum Approximate Optimization Algorithm (qaoa)

**QAOA Ansatz:**
$$|\psi(\gamma, \beta)\rangle = \prod*{p=1}^P U*B(\beta*p) U*C(\gamma*p) |\psi*0\rangle$$

Where:
- $U*C(\gamma) = \exp(-i\gamma \hat{H}*C)$ (Cost Hamiltonian)
- $U*B(\beta) = \exp(-i\beta \hat{H}*B)$ (Mixer Hamiltonian)

### Quantum Advantage in Rl

#### 1. Exponential State Space

**Classical Scaling:**
Memory: $O(2^n)$ for $n$-qubit states
Operations: $O(2^{2n})$ for general operations

**Quantum Scaling:**
Memory: $O(n)$ qubits
Operations: $O(poly(n))$ for many quantum algorithms

#### 2. Quantum Speedups

**Grover's Algorithm for RL:**
- Search optimal actions in $O(\sqrt{N})$ instead of $O(N)$
- Applicable to unstructured action spaces

**Quantum Walk for Exploration:**
- Quadratic speedup over classical random walk
- Enhanced exploration capabilities

**Shor's Algorithm Applications:**
- Factoring in cryptographic environments
- Period finding in periodic MDPs

### Quantum Machine Learning Integration

#### 1. Quantum Neural Networks (qnns)

**Quantum Perceptron:**
$$f(x) = \langle 0^{\otimes n} | U^\dagger(\theta) M U(\theta) |x\rangle$$

Where $U(\theta)$ is a parameterized quantum circuit and $M$ is a measurement operator.

**Quantum Convolutional Neural Networks:**
- Quantum convolution using local unitaries
- Translation equivariance in quantum feature maps

#### 2. Quantum Kernel Methods

**Quantum Feature Map:**
$$\Phi(x) = |\phi(x)\rangle = U_\phi(x)|0\rangle^{\otimes n}$$

**Quantum Kernel:**
$$K(x*i, x*j) = |\langle\phi(x*i)|\phi(x*j)\rangle|^2$$

Potentially exponential advantage in feature space dimension.

### Advanced Qrl Techniques

#### 1. Quantum Actor-critic

**Quantum Actor:**
$$\pi*\theta(a|s) = \text{Tr}[\Pi*a U*\theta(s) \rho*s U_\theta(s)^\dagger]$$

Where $\Pi_a$ is the projector onto action $a$.

**Quantum Critic:**
$$V*\phi(s) = \text{Tr}[\hat{V}*\phi \rho_s]$$

**Quantum Policy Gradient:**
$$\nabla*\theta J(\theta) = \sum*{s,a} \rho^\pi(s) \nabla*\theta \pi*\theta(a|s) Q^\pi(s,a)$$

#### 2. Quantum Experience Replay

**Quantum Superposition of Experiences:**
$$|\text{memory}\rangle = \frac{1}{\sqrt{N}} \sum*{i=1}^N |s*i, a*i, r*i, s_i'\rangle$$

**Quantum Sampling:**
Use quantum interference to bias sampling towards important experiences.

#### 3. Quantum Multi-agent Rl

**Entangled Agent States:**
$$|\psi*{\text{agents}}\rangle = \frac{1}{\sqrt{2}}(|\psi*1\rangle \otimes |\psi*2\rangle + |\psi*1'\rangle \otimes |\psi_2'\rangle)$$

**Quantum Communication:**
Agents share quantum information through entanglement.

### Quantum Error Correction in Qrl

#### 1. Noisy Intermediate-scale Quantum (nisq) Era

**Noise Models:**
- Decoherence: $\rho(t) = e^{-\Gamma t} \rho(0)$
- Gate errors: Imperfect unitary operations
- Measurement errors: Probabilistic bit flips

**Error Mitigation:**
- Zero noise extrapolation
- Error amplification and cancellation
- Probabilistic error cancellation

#### 2. Fault-tolerant Qrl

**Quantum Error Correction Codes:**
- Surface codes for topological protection
- Stabilizer codes for syndrome detection
- Logical qubit operations

### Applications and Use Cases

#### 1. Quantum Chemistry Rl
- Molecular dynamics simulation
- Drug discovery optimization
- Catalyst design

#### 2. Quantum Finance
- Portfolio optimization with quantum speedup
- Risk analysis using quantum simulation
- Quantum Monte Carlo for derivatives pricing

#### 3. Quantum Cryptography Rl
- Quantum key distribution protocols
- Post-quantum cryptography
- Quantum-safe communications

#### 4. Quantum Optimization
- Traffic flow optimization
- Supply chain management
- Resource allocation problems

### Current Limitations and Challenges

#### 1. Hardware Limitations
- Limited qubit count and coherence time
- High error rates in current quantum devices
- Connectivity constraints in quantum architectures

#### 2. Algorithmic Challenges
- Barren plateaus in quantum optimization
- Classical simulation for algorithm development
- Quantum advantage verification

#### 3. Practical Implementation
- Quantum software development complexity
- Integration with classical systems
- Scalability to real-world problems

### Future Directions

#### 1. Near-term Applications
- Hybrid classical-quantum algorithms
- NISQ-era quantum advantage demonstrations
- Quantum-enhanced machine learning

#### 2. Long-term Vision
- Fault-tolerant quantum RL systems
- Universal quantum learning machines
- Quantum artificial general intelligence

#### 3. Theoretical Advances
- Quantum learning theory foundations
- Quantum-classical complexity separations
- Novel quantum algorithms for RL
   

In [5]:
# Import Quantum RL implementations
from quantum_rl.quantum_rl import (
    QuantumState, QuantumGate, QuantumCircuit, VariationalQuantumCircuit,
    QuantumQLearning, QuantumActorCritic, QuantumEnvironment
)

print("‚úÖ Quantum RL Implementation Imported!")
print("Components available:")
print("- QuantumState: Quantum state representation with amplitudes")
print("- QuantumGate: Quantum gate operations (Pauli, Hadamard, rotations)")
print("- QuantumCircuit: Quantum circuit implementation")
print("- VariationalQuantumCircuit: Parameterized quantum circuits for ML")
print("- QuantumQLearning: Quantum-enhanced Q-learning")
print("- QuantumActorCritic: Quantum actor-critic algorithm")
print("- QuantumEnvironment: Quantum-inspired test environment")

‚úÖ Quantum RL Implementation Imported!
Components available:
- QuantumState: Quantum state representation with amplitudes
- QuantumGate: Quantum gate operations (Pauli, Hadamard, rotations)
- QuantumCircuit: Quantum circuit implementation
- VariationalQuantumCircuit: Parameterized quantum circuits for ML
- QuantumQLearning: Quantum-enhanced Q-learning
- QuantumActorCritic: Quantum actor-critic algorithm
- QuantumEnvironment: Quantum-inspired test environment


In [None]:
def create_quantum_environment():
    """Create a simple environment suitable for quantum RL demonstration"""
    
    class QuantumInspiredEnvironment:
        def __init__(self, n_qubits=2):
            self.n_qubits = n_qubits
            self.state_dim = 2**n_qubits  # Full quantum state representation
            self.action_dim = n_qubits    # Actions are rotations on each qubit
            self.max_steps = 50
            
            # Initialize quantum state (simplified classical representation)
            self.reset()
        
        def reset(self):
            # Start with |00...0> state
            self.quantum_state = np.zeros(2**self.n_qubits)
            self.quantum_state[0] = 1.0
            
            self.steps = 0
            return self.quantum_state.copy()
        
        def step(self, action):
            # Apply quantum gates (simplified)
            action = np.clip(action, -np.pi, np.pi)
            
            # Apply rotation gates
            for i, theta in enumerate(action):
                # Simplified rotation on qubit i
                self.apply_rotation(i, theta)
            
            # Add some quantum-inspired dynamics
            self.apply_quantum_noise()
            
            # Compute reward based on quantum state properties
            reward = self.compute_quantum_reward()
            
            self.steps += 1
            done = self.steps >= self.max_steps
            
            return self.quantum_state.copy(), reward, done, {}
        
        def apply_rotation(self, qubit_idx, theta):
            """Apply a rotation gate to a specific qubit"""
            cos_theta = np.cos(theta/2)
            sin_theta = np.sin(theta/2)
            
            # Simplified single-qubit rotation (affects computational basis)
            new_state = np.zeros_like(self.quantum_state)
            
            for i in range(len(self.quantum_state)):
                bit = (i >> qubit_idx) & 1
                if bit == 0:
                    # |0> -> cos(Œ∏/2)|0> - sin(Œ∏/2)|1>
                    new_state[i] += cos_theta * self.quantum_state[i]
                    flipped_i = i | (1 << qubit_idx)
                    new_state[flipped_i] -= sin_theta * self.quantum_state[i]
                else:
                    # |1> -> sin(Œ∏/2)|0> + cos(Œ∏/2)|1>
                    flipped_i = i & ~(1 << qubit_idx)
                    new_state[flipped_i] += sin_theta * self.quantum_state[i]
                    new_state[i] += cos_theta * self.quantum_state[i]
            
            self.quantum_state = new_state
        
        def apply_quantum_noise(self):
            """Add quantum-inspired noise"""
            # Dephasing-like noise
            noise_strength = 0.01
            phase_noise = np.random.normal(0, noise_strength, len(self.quantum_state))
            self.quantum_state *= np.exp(1j * phase_noise)
            self.quantum_state = np.real(self.quantum_state)  # Keep real for simplicity
        
        def compute_quantum_reward(self):
            """Compute reward based on quantum state properties"""
            # Reward for creating superposition and entanglement-like states
            probabilities = np.abs(self.quantum_state)**2
            
            # Entropy (reward for mixed states)
            entropy = -np.sum(probabilities * np.log(probabilities + 1e-10))
            
            # Reward for having multiple non-zero amplitudes
            n_nonzero = np.sum(probabilities > 0.01)
            
            # Penalty for being in computational basis states
            basis_penalty = -np.sum(probabilities[[0, -1]]**2)
            
            reward = entropy + 0.1 * n_nonzero + basis_penalty
            
            return reward
    
    return QuantumInspiredEnvironment()

def demonstrate_quantum_circuit():
    """Demonstrate basic quantum circuit operations"""
    
    print("üî¨ Quantum Circuit Demonstration")
    
    # Create quantum circuit
    circuit = QuantumCircuit(n_qubits=2)
    
    print("Initial state: |00>")
    print(f"State vector: {circuit.get_state_vector()}")
    
    # Apply Hadamard gates
    circuit.apply_gate(QuantumGate.hadamard(), 0)
    circuit.apply_gate(QuantumGate.hadamard(), 1)
    
    print("\nAfter H‚äóH: (|00> + |01> + |10> + |11>)/2")
    print(f"State vector: {circuit.get_state_vector()}")
    
    # Apply CNOT gate
    circuit.apply_cnot(0, 1)
    
    print("\nAfter CNOT(0,1): Creates entanglement")
    print(f"State vector: {circuit.get_state_vector()}")
    
    # Measure
    measurements = []
    for _ in range(1000):
        result = circuit.measure()
        measurements.append(result)
    
    print(f"\nMeasurement statistics (1000 shots):")
    unique, counts = np.unique(measurements, return_counts=True)
    for state, count in zip(unique, counts):
        print(f"  |{state:02b}>: {count/10:.1f}%")
    
    return circuit

def train_quantum_q_learning(env, n_episodes=200):
    """Train quantum-enhanced Q-learning"""
    
    print("üß† Training Quantum Q-Learning Agent")
    
    # Create quantum Q-learning agent
    agent = QuantumQLearning(
        n_qubits=env.n_qubits,
        action_dim=env.action_dim,
        learning_rate=0.1,
        discount_factor=0.95,
        exploration_rate=1.0,
        exploration_decay=0.995
    )
    
    episode_rewards = []
    exploration_rates = []
    
    for episode in range(n_episodes):
        obs = env.reset()
        episode_reward = 0
        done = False
        
        while not done:
            # Select action using quantum state
            action = agent.select_action(obs)
            
            # Take action in environment
            next_obs, reward, done, _ = env.step(action)
            
            # Update quantum Q-function
            agent.update(obs, action, reward, next_obs, done)
            
            obs = next_obs
            episode_reward += reward
        
        episode_rewards.append(episode_reward)
        exploration_rates.append(agent.exploration_rate)
        
        # Decay exploration
        agent.exploration_rate *= agent.exploration_decay
        
        if episode % 50 == 0:
            print(f"Episode {episode}: Reward={episode_reward:.2f}, "
                  f"Exploration={agent.exploration_rate:.3f}")
    
    return agent, episode_rewards, exploration_rates

def demonstrate_quantum_actor_critic(env):
    """Demonstrate quantum actor-critic algorithm"""
    
    print("üé≠ Training Quantum Actor-Critic Agent")
    
    # Create quantum actor-critic agent
    agent = QuantumActorCritic(
        n_qubits=env.n_qubits,
        action_dim=env.action_dim,
        actor_hidden_dim=32,
        critic_hidden_dim=32,
        learning_rate=1e-3
    )
    
    optimizer = torch.optim.Adam(agent.parameters(), lr=1e-3)
    
    n_episodes = 100
    episode_rewards = []
    
    for episode in range(n_episodes):
        obs = env.reset()
        episode_reward = 0
        done = False
        log_probs = []
        values = []
        rewards = []
        
        while not done:
            obs_tensor = torch.FloatTensor(obs).to(device)
            
            # Get action and value from quantum actor-critic
            action_dist, value = agent(obs_tensor)
            action = action_dist.sample()
            
            next_obs, reward, done, _ = env.step(action.cpu().numpy())
            
            log_probs.append(action_dist.log_prob(action))
            values.append(value)
            rewards.append(reward)
            
            obs = next_obs
            episode_reward += reward
        
        # Compute returns and advantages
        returns = []
        G = 0
        for r in reversed(rewards):
            G = r + 0.99 * G
            returns.insert(0, G)
        
        returns = torch.FloatTensor(returns).to(device)
        values = torch.cat(values)
        log_probs = torch.cat(log_probs)
        
        advantages = returns - values
        
        # Compute losses
        actor_loss = -(log_probs * advantages.detach()).mean()
        critic_loss = F.mse_loss(values, returns)
        total_loss = actor_loss + critic_loss
        
        # Update
        optimizer.zero_grad()
        total_loss.backward()
        optimizer.step()
        
        episode_rewards.append(episode_reward)
        
        if episode % 20 == 0:
            print(f"Episode {episode}: Reward={episode_reward:.2f}, "
                  f"Actor Loss={actor_loss:.4f}, Critic Loss={critic_loss:.4f}")
    
    return agent, episode_rewards

print("üöÄ Starting Exercise 3: Quantum Reinforcement Learning")
print("="*70)

# Demonstrate quantum circuit
circuit = demonstrate_quantum_circuit()

# Create quantum environment
env = create_quantum_environment(n_qubits=2)
print(f"\nEnvironment: {env.state_dim}D quantum state, {env.action_dim}D actions")

# Train quantum Q-learning
q_agent, q_rewards, exploration_rates = train_quantum_q_learning(env, n_episodes=100)

# Train quantum actor-critic
ac_agent, ac_rewards = demonstrate_quantum_actor_critic(env)

# Visualization
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 10))

ax1.plot(q_rewards, alpha=0.7, label='Quantum Q-Learning')
ax1.plot(ac_rewards, alpha=0.7, label='Quantum Actor-Critic')
ax1.set_title('Training Performance Comparison')
ax1.set_xlabel('Episode')
ax1.set_ylabel('Episode Reward')
ax1.legend()
ax1.grid(True)

ax2.plot(exploration_rates)
ax2.set_title('Exploration Rate Decay (Q-Learning)')
ax2.set_xlabel('Episode')
ax2.set_ylabel('Exploration Rate')
ax2.grid(True)

# Quantum state visualization (simplified)
final_state = env.reset()
for _ in range(10):
    action = q_agent.select_action(final_state)
    final_state, _, _, _ = env.step(action)

ax3.bar(range(len(final_state)), np.abs(final_state)**2, alpha=0.7)
ax3.set_title('Final Quantum State Probabilities')
ax3.set_xlabel('Computational Basis State')
ax3.set_ylabel('Probability')
ax3.grid(True)

ax4.hist(q_rewards[-50:], bins=10, alpha=0.7, label='Q-Learning')
ax4.hist(ac_rewards[-50:], bins=10, alpha=0.7, label='Actor-Critic')
ax4.set_title('Reward Distribution (Last 50 Episodes)')
ax4.set_xlabel('Episode Reward')
ax4.set_ylabel('Frequency')
ax4.legend()
ax4.grid(True)

plt.tight_layout()
plt.show()

print("\n‚úÖ Exercise 3 Complete!")
print("Key learnings:")
print("- Quantum circuits can represent complex state spaces efficiently")
print("- Quantum Q-learning leverages superposition for exploration")
print("- Quantum actor-critic combines quantum policies with classical critics")
print("- Quantum RL shows promise for problems with exponential state spaces")

# Part V: Integration and Advanced Applications

## Synthesis of Advanced Rl Paradigms

The four paradigms we've explored‚ÄîWorld Models, Multi-Agent RL, Causal RL, and Quantum RL‚Äîrepresent the cutting edge of reinforcement learning research. Each addresses fundamental limitations of traditional RL approaches:

### Paradigm Integration Matrix

| Aspect | World Models | Multi-Agent RL | Causal RL | Quantum RL |
|--------|-------------|----------------|-----------|------------|
| **Sample Efficiency** | ‚úì Via planning | ‚úì Via sharing | ‚úì Via causal structure | ‚úì Via superposition |
| **Interpretability** | ‚úì Via explicit models | ‚úì Via agent interaction | ‚úì Via causal graphs | ‚óê Via quantum states |
| **Scalability** | ‚óê Model complexity | ‚úì Distributed learning | ‚óê Structure discovery | ‚óê Quantum advantage |
| **Robustness** | ‚óê Model uncertainty | ‚úì Via diversity | ‚úì Via interventions | ‚óê Quantum decoherence |

### Hybrid Approaches

#### 1. Causal World Models
Combining causal structure discovery with world model learning:
```python
class CausalWorldModel:
    def **init**(self, causal*graph, dynamics*model):
        self.causal*graph = causal*graph
        self.dynamics*model = dynamics*model
    
    def predict_intervention(self, state, action, intervention):
        # Use causal graph to modify dynamics
        return self.dynamics*model.predict*with_intervention(
            state, action, intervention, self.causal_graph
        )
```

#### 2. Multi-agent Causal Rl
Agents learning shared causal structures:
```python
class MultiAgentCausalRL:
    def **init**(self, agents, shared*causal*graph):
        self.agents = agents
        self.shared*graph = shared*causal_graph
    
    def collective*structure*learning(self, experiences):
        # Pool experiences for better causal discovery
        return update*shared*causal_structure(experiences)
```

#### 3. Quantum Multi-agent Systems
Leveraging quantum entanglement for coordination:
```python
class QuantumMultiAgentSystem:
    def **init**(self, n*agents, n*qubits):
        self.entangled*state = create*entangled*state(n*agents, n_qubits)
    
    def quantum*coordination(self, local*observations):
        return quantum*communication*protocol(
            local*observations, self.entangled*state
        )
```

## Real-world Applications

### 1. Autonomous Vehicle Networks
- **World Models**: Environmental prediction and planning
- **Multi-Agent**: Vehicle coordination and traffic optimization
- **Causal RL**: Understanding cause-effect in traffic patterns
- **Quantum RL**: Optimization of large-scale traffic systems

### 2. Financial Trading Systems
- **World Models**: Market dynamics modeling
- **Multi-Agent**: Multi-market trading strategies
- **Causal RL**: Understanding causal relationships in market movements
- **Quantum RL**: Portfolio optimization with quantum advantage

### 3. Healthcare and Drug Discovery
- **World Models**: Patient trajectory modeling
- **Multi-Agent**: Multi-specialist treatment planning
- **Causal RL**: Understanding treatment causality
- **Quantum RL**: Molecular interaction simulation

### 4. Climate and Environmental Management
- **World Models**: Climate system modeling
- **Multi-Agent**: Multi-region policy coordination
- **Causal RL**: Climate intervention analysis
- **Quantum RL**: Large-scale environmental optimization

## Research Frontiers

### 1. Theoretical Foundations
- **Sample Complexity**: Unified bounds across paradigms
- **Convergence Guarantees**: Multi-paradigm learning stability
- **Transfer Learning**: Cross-paradigm knowledge transfer
- **Meta-Learning**: Learning to choose appropriate paradigms

### 2. Algorithmic Advances
- **Hybrid Architectures**: Seamless paradigm integration
- **Adaptive Switching**: Dynamic paradigm selection
- **Federated Learning**: Distributed multi-paradigm training
- **Continual Learning**: Lifelong multi-paradigm adaptation

### 3. Implementation Challenges
- **Computational Efficiency**: Scalable implementations
- **Hardware Acceleration**: Specialized computing architectures
- **Software Frameworks**: Unified development platforms
- **Validation Methods**: Multi-paradigm evaluation metrics

## Future Directions

### Near-term (2-5 Years)
1. **Practical Hybrid Systems**: Working implementations combining 2-3 paradigms
2. **Industry Applications**: Deployment in specific domains
3. **Standardization**: Common interfaces and evaluation protocols
4. **Education**: Curriculum integration and training programs

### Medium-term (5-10 Years)
1. **Theoretical Unification**: Mathematical frameworks spanning all paradigms
2. **Quantum Advantage**: Demonstrated speedups in real applications
3. **Autonomous Systems**: Self-improving multi-paradigm agents
4. **Societal Integration**: Widespread adoption across industries

### Long-term (10+ Years)
1. **Artificial General Intelligence**: Multi-paradigm foundations for AGI
2. **Quantum-Classical Convergence**: Seamless quantum-classical computing
3. **Causal Discovery Automation**: Fully automated causal structure learning
4. **Multi-Agent Societies**: Complex artificial societies with emergent behavior

## Conclusion

This comprehensive exploration of advanced Deep Reinforcement Learning paradigms demonstrates the rich landscape of modern RL research. Each paradigm offers unique advantages:

- **World Models** provide sample efficiency through learned dynamics
- **Multi-Agent RL** enables coordination and emergence in complex systems
- **Causal RL** offers interpretability and robustness through causal understanding
- **Quantum RL** promises exponential advantages through quantum computation

The future of reinforcement learning lies not in choosing a single paradigm, but in their thoughtful integration. By combining the strengths of each approach while mitigating their individual limitations, we can build AI systems that are:

- **More Sample Efficient**: Learning faster with less data
- **More Interpretable**: Providing clear reasoning for decisions
- **More Robust**: Handling distribution shifts and uncertainties
- **More Scalable**: Operating in complex, real-world environments

The implementations provided in this notebook serve as stepping stones toward more sophisticated systems. While simplified for educational purposes, they demonstrate the core concepts that will drive the next generation of AI systems.

As we advance toward artificial general intelligence, these paradigms will play crucial roles in creating AI systems that can understand, reason about, and operate effectively in our complex world. The journey from today's specialized RL agents to tomorrow's general AI systems will be paved with innovations across all these dimensions.

## Key Takeaways

1. **Paradigm Diversity**: Multiple approaches are needed for different aspects of intelligence
2. **Integration Benefits**: Hybrid systems outperform single-paradigm approaches
3. **Practical Applications**: Real-world deployment requires careful paradigm selection
4. **Ongoing Research**: Many open questions remain in each paradigm
5. **Future Potential**: The combination of these paradigms may enable breakthrough capabilities

The field of reinforcement learning continues to evolve rapidly, and staying at the forefront requires understanding both the fundamental principles and the cutting-edge advances represented by these paradigms. This notebook provides a foundation for further exploration and implementation of these exciting directions in AI research.

In [None]:
# Import all modules for integration demo
from world_models.world_models import RSSMCore, WorldModel, MPCPlanner
from multi_agent_rl.multi_agent_rl import MADDPGAgent, MultiAgentEnvironment
from causal_rl.causal_rl import CausalGraph, CausalDiscovery, CausalWorldModel
from quantum_rl.quantum_rl import QuantumState, QuantumCircuit, QuantumQLearning
from federated_rl.federated_rl import FederatedAgent, FederatedServer

def create_integrated_environment():
    """Create an environment that combines multiple paradigms"""
    
    class IntegratedEnvironment:
        def __init__(self):
            # Multi-agent setup
            self.n_agents = 2
            self.obs_dim = 6  # Shared state + private observations
            self.action_dim = 2
            
            # Causal structure
            self.causal_graph = CausalGraph(['x1', 'x2', 'y1', 'y2', 'z'])
            
            # Quantum state representation
            self.quantum_state = QuantumState(n_qubits=3)
            
            self.max_steps = 100
            self.reset()
        
        def reset(self):
            # Initialize with causal dependencies
            self.shared_state = np.random.normal(0, 1, 2)
            self.agent_states = np.random.normal(0, 1, (self.n_agents, 2))
            
            # Update causal graph
            self.causal_graph.update_state({
                'x1': self.shared_state[0],
                'x2': self.shared_state[1],
                'y1': self.agent_states[0, 0],
                'y2': self.agent_states[1, 0],
                'z': np.mean(self.agent_states[:, 1])
            })
            
            self.steps = 0
            return self.get_global_observation()
        
        def get_global_observation(self):
            """Get observation for all agents"""
            obs = []
            for i in range(self.n_agents):
                agent_obs = np.concatenate([
                    self.shared_state,
                    self.agent_states[i],
                    [self.causal_graph.get_effect('z', f'y{i+1}')]
                ])
                obs.append(agent_obs)
            return np.array(obs)
        
        def step(self, actions):
            actions = np.array(actions)
            
            # Apply actions with causal dependencies
            for i in range(self.n_agents):
                action_effect = actions[i] * 0.1
                
                # Causal influence on shared state
                self.shared_state += action_effect * 0.05
                
                # Agent-specific dynamics
                self.agent_states[i] += action_effect + np.random.normal(0, 0.1, 2)
            
            # Update causal graph
            self.causal_graph.update_state({
                'x1': self.shared_state[0],
                'x2': self.shared_state[1],
                'y1': self.agent_states[0, 0],
                'y2': self.agent_states[1, 0],
                'z': np.mean(self.agent_states[:, 1])
            })
            
            # Compute rewards (cooperative objective)
            coordination_reward = -np.linalg.norm(self.shared_state)
            individual_rewards = [-np.linalg.norm(actions[i]) * 0.1 for i in range(self.n_agents)]
            
            rewards = np.array([coordination_reward + individual_rewards[i] 
                              for i in range(self.n_agents)])
            
            self.steps += 1
            done = self.steps >= self.max_steps
            
            return self.get_global_observation(), rewards, done, {}
    
    return IntegratedEnvironment()

def demonstrate_paradigm_integration():
    """Demonstrate integration of multiple RL paradigms"""
    
    print("üîó Demonstrating Paradigm Integration")
    print("="*50)
    
    # Create integrated environment
    env = create_integrated_environment()
    print(f"Integrated Environment: {env.n_agents} agents, {env.obs_dim}D obs, {env.action_dim}D actions")
    
    # Initialize agents with different paradigms
    agents = []
    
    # Agent 1: Multi-agent with causal reasoning
    agent1 = MADDPGAgent(
        agent_idx=0,
        obs_dim=env.obs_dim,
        action_dim=env.action_dim,
        n_agents=env.n_agents,
        use_attention=True,
        use_communication=True
    )
    agents.append(('MADDPG+Causal', agent1))
    
    # Agent 2: Quantum-enhanced agent
    agent2 = QuantumQLearning(
        n_qubits=2,
        action_dim=env.action_dim,
        learning_rate=0.1,
        discount_factor=0.95
    )
    agents.append(('Quantum Q-Learning', agent2))
    
    # Training loop
    n_episodes = 50
    results = {name: [] for name, _ in agents}
    
    for episode in range(n_episodes):
        obs = env.reset()
        episode_rewards = np.zeros(env.n_agents)
        done = False
        
        while not done:
            actions = []
            
            for i, (name, agent) in enumerate(agents):
                if name == 'MADDPG+Causal':
                    # Multi-agent action selection
                    obs_tensor = torch.FloatTensor(obs[i]).unsqueeze(0).to(device)
                    action_tensor, _ = agent.act(obs_tensor, explore=True)
                    action = action_tensor.cpu().numpy()[0]
                else:  # Quantum agent
                    # Convert observation to quantum state
                    quantum_obs = obs[i][:2]  # Use first 2 dimensions
                    action = agent.select_action(quantum_obs)
                
                actions.append(action)
            
            next_obs, rewards, done, _ = env.step(actions)
            
            # Update agents
            for i, (name, agent) in enumerate(agents):
                if name == 'MADDPG+Causal':
                    # Store experience for MADDPG
                    pass  # Would need full replay buffer implementation
                else:  # Quantum agent
                    agent.update(obs[i][:2], actions[i], rewards[i], next_obs[i][:2], done)
            
            episode_rewards += rewards
            obs = next_obs
        
        # Record results
        for i, (name, _) in enumerate(agents):
            results[name].append(episode_rewards[i])
        
        if episode % 10 == 0:
            print(f"Episode {episode}: Agent1={episode_rewards[0]:.2f}, Agent2={episode_rewards[1]:.2f}")
    
    return results

def demonstrate_federated_learning():
    """Demonstrate federated RL across distributed agents"""
    
    print("\nüåê Demonstrating Federated Reinforcement Learning")
    print("="*50)
    
    # Create federated setup
    n_clients = 3
    clients = []
    
    for i in range(n_clients):
        client = FederatedAgent(
            client_id=i,
            obs_dim=4,
            action_dim=2,
            local_epochs=5
        )
        clients.append(client)
    
    server = FederatedServer(
        global_model_dim=100,  # Simplified
        n_clients=n_clients
    )
    
    # Simulate federated training
    n_rounds = 10
    global_rewards = []
    
    for round_num in range(n_rounds):
        print(f"\nFederated Round {round_num + 1}")
        
        # Client local training
        client_updates = []
        client_rewards = []
        
        for client in clients:
            # Simulate local environment (different for each client)
            local_env = create_integrated_environment()
            local_env.n_agents = 1  # Single agent per client
            
            # Local training
            local_reward = client.train_local(local_env, episodes=20)
            client_rewards.append(local_reward)
            
            # Generate update
            update = client.generate_update()
            client_updates.append(update)
        
        # Server aggregation
        global_model = server.aggregate_updates(client_updates)
        
        # Distribute to clients
        for client in clients:
            client.receive_global_model(global_model)
        
        avg_reward = np.mean(client_rewards)
        global_rewards.append(avg_reward)
        
        print(f"Average client reward: {avg_reward:.3f}")
    
    return global_rewards

def create_hybrid_agent():
    """Create a hybrid agent combining multiple paradigms"""
    
    class HybridAgent:
        def __init__(self, obs_dim, action_dim):
            self.obs_dim = obs_dim
            self.action_dim = action_dim
            
            # Components from different paradigms
            self.world_model = WorldModel(
                obs_dim=obs_dim,
                action_dim=action_dim,
                state_dim=16,
                hidden_dim=32,
                embed_dim=64
            )
            
            self.causal_reasoner = CausalDiscovery(
                variables=[f'obs_{i}' for i in range(obs_dim)] + 
                         [f'action_{i}' for i in range(action_dim)],
                alpha=0.1
            )
            
            self.quantum_processor = QuantumCircuit(n_qubits=2)
            
            # Classical policy network
            self.policy_net = nn.Sequential(
                nn.Linear(obs_dim + 16, 64),  # obs + world model state
                nn.ReLU(),
                nn.Linear(64, action_dim),
                nn.Tanh()
            ).to(device)
        
        def select_action(self, obs, explore=True):
            obs_tensor = torch.FloatTensor(obs).unsqueeze(0).to(device)
            
            # Get world model prediction
            with torch.no_grad():
                wm_output = self.world_model.observe_sequence(obs_tensor, 
                    torch.zeros(1, self.action_dim).to(device))
                wm_state = wm_output['states'][:, -1]  # Last state
            
            # Combine with observation
            combined_input = torch.cat([obs_tensor, wm_state], dim=-1)
            
            # Get action
            with torch.no_grad():
                action = self.policy_net(combined_input).cpu().numpy()[0]
            
            if explore:
                action += np.random.normal(0, 0.1, self.action_dim)
                action = np.clip(action, -1, 1)
            
            return action
        
        def update_causal_model(self, experience_batch):
            """Update causal understanding from experience"""
            # Extract causal relationships from recent experiences
            data = np.array([exp['obs'] + exp['action'] for exp in experience_batch])
            self.causal_reasoner.discover_structure(data)
    
    return HybridAgent(obs_dim=6, action_dim=2)

print("üöÄ Starting Exercise 4: Paradigm Integration and Advanced Applications")
print("="*80)

# Demonstrate paradigm integration
integration_results = demonstrate_paradigm_integration()

# Demonstrate federated learning
federated_rewards = demonstrate_federated_learning()

# Create and test hybrid agent
hybrid_agent = create_hybrid_agent()

# Test hybrid agent
test_env = create_integrated_environment()
test_env.n_agents = 1

hybrid_rewards = []
for episode in range(20):
    obs = test_env.reset()
    episode_reward = 0
    done = False
    experience_batch = []
    
    while not done:
        action = hybrid_agent.select_action(obs[0])
        next_obs, reward, done, _ = test_env.step([action])
        
        experience_batch.append({
            'obs': obs[0],
            'action': action,
            'reward': reward[0],
            'next_obs': next_obs[0]
        })
        
        episode_reward += reward[0]
        obs = next_obs
    
    # Update causal model periodically
    if episode % 5 == 0:
        hybrid_agent.update_causal_model(experience_batch[-10:])
    
    hybrid_rewards.append(episode_reward)

print(f"\nHybrid Agent Performance: {np.mean(hybrid_rewards):.2f} ¬± {np.std(hybrid_rewards):.2f}")

# Visualization
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 10))

# Integration results
for name, rewards in integration_results.items():
    ax1.plot(rewards, label=name, alpha=0.7)
ax1.set_title('Multi-Paradigm Agent Performance')
ax1.set_xlabel('Episode')
ax1.set_ylabel('Episode Reward')
ax1.legend()
ax1.grid(True)

# Federated learning
ax2.plot(federated_rewards, marker='o')
ax2.set_title('Federated Learning Progress')
ax2.set_xlabel('Communication Round')
ax2.set_ylabel('Average Client Reward')
ax2.grid(True)

# Hybrid agent
ax3.plot(hybrid_rewards, color='purple')
ax3.set_title('Hybrid Agent Learning Curve')
ax3.set_xlabel('Episode')
ax3.set_ylabel('Episode Reward')
ax3.grid(True)

# Performance comparison
methods = ['MADDPG+Causal', 'Quantum Q-Learning', 'Federated RL', 'Hybrid Agent']
final_performances = [
    np.mean(integration_results['MADDPG+Causal'][-10:]),
    np.mean(integration_results['Quantum Q-Learning'][-10:]),
    federated_rewards[-1],
    np.mean(hybrid_rewards[-10:])
]

ax4.bar(range(len(methods)), final_performances, alpha=0.7, color=['blue', 'red', 'green', 'purple'])
ax4.set_title('Final Performance Comparison')
ax4.set_xlabel('Method')
ax4.set_ylabel('Average Reward')
ax4.set_xticks(range(len(methods)))
ax4.set_xticklabels(methods, rotation=45, ha='right')
ax4.grid(True)

plt.tight_layout()
plt.show()

print("\n‚úÖ Exercise 4 Complete!")
print("Key learnings:")
print("- Multiple RL paradigms can be integrated for complementary strengths")
print("- Federated learning enables privacy-preserving distributed training")
print("- Hybrid agents combine world models, causal reasoning, and quantum processing")
print("- Integration approaches show promise for tackling complex real-world problems")

# Code Review and Improvements

## Implementation Analysis

### Positives
- Demonstrates a clear experimental methodology and documents hyperparameters and evaluation metrics.
- Good use of modular components enabling swapping of policy, value, and dynamics models.
- Includes helpful visual diagnostics for training stability and performance.

### Recommended Improvements
- Add more rigorous unit and integration tests for core modules (agents, buffers, networks).
- Adopt a standardized configuration system (e.g., Hydra or simple YAML configs) to manage experiments and parameter sweeps.
- Provide an automated script for hyperparameter sweeps (e.g., using Optuna or Ray Tune).

## Algorithmic Enhancements
- Implement robust exploration strategies (parameter noise, entropy bonuses, or exploration from ensemble models).
- Use ensemble dynamics or Bayesian approaches to capture model uncertainty for safer planning.
- For high-dimensional observations, use representation learning (autoencoders, contrastive methods) prior to policy learning.

## Performance and Resource Management
- Use profiling (torch.profiler) to find bottlenecks and optimize data loaders and model ops.
- Add checkpoint rotation and retention policies to avoid disk bloat during long experiments.

## Reproducibility and Experiment Tracking
- Add a `run_experiment.py` entrypoint that records experiment metadata and saves reproducible configs.
- Integrate with W&B or a self-hosted logging solution and save logs/artifacts to a structured folder per run.

## Production & Serving
- Add example code to deploy a trained policy as a deterministic policy network for real-time inference.
- Consider model quantization for running on edge devices; include a short guide and example script.

## Future Directions
- Multi-task or meta-learning extensions to enable rapid generalization across environments.
- Explore model-based/model-free hybrids for better sample efficiency.

Follow these steps to make CA18 more robust for research, reproducible for experiments, and practical for deployment.