# CA16: Cutting-Edge Deep Reinforcement Learning - Foundation Models, Neurosymbolic RL, and Future Paradigms

## Deep Reinforcement Learning - Advanced Topics and Emerging Paradigms

This comprehensive notebook explores the latest frontiers in Deep Reinforcement Learning, covering foundation models, neurosymbolic approaches, continual learning, human-AI collaboration, and emerging paradigms that will shape the future of intelligent agents.

## Topics Covered:

### 🧠 **Foundation Models in RL**
- Large-scale pre-trained RL models
- Decision Transformer and Trajectory Transformers
- Multi-task and multi-modal RL agents
- In-context learning for RL

### 🔬 **Neurosymbolic Reinforcement Learning**
- Symbolic reasoning integration
- Logic-guided policy learning
- Interpretable and explainable RL
- Causal reasoning in RL

### 🔄 **Continual and Lifelong Learning**
- Catastrophic forgetting in RL
- Meta-learning and adaptation
- Progressive neural networks
- Memory systems for continual RL

### 🤝 **Human-AI Collaborative RL**
- Learning from human feedback (RLHF)
- Interactive learning and teaching
- Preference learning and reward modeling
- Constitutional AI and value alignment

### ⚡ **Advanced Computational Methods**
- Quantum-inspired RL algorithms
- Neuromorphic computing for RL
- Distributed and federated RL
- Energy-efficient RL architectures

### 🌍 **Real-World Deployment and Ethics**
- Production RL systems
- Ethical considerations and fairness
- Robustness and reliability
- Regulatory compliance and safety

## Learning Objectives:
1. Master foundation model architectures for reinforcement learning
2. Implement neurosymbolic RL algorithms with interpretability
3. Design continual learning systems that avoid catastrophic forgetting
4. Build human-AI collaborative learning frameworks
5. Explore quantum and neuromorphic computing paradigms
6. Apply advanced RL to real-world production systems
7. Address ethical considerations and societal impact
8. Analyze emerging paradigms and future research directions

### Session Structure:
- **Section 1**: Foundation Models and Large-Scale RL
- **Section 2**: Neurosymbolic RL and Interpretability
- **Section 3**: Continual Learning and Meta-Learning
- **Section 4**: Human-AI Collaborative Learning
- **Section 5**: Advanced Computational Paradigms
- **Section 6**: Real-World Deployment and Ethics
- **Section 7**: Future Directions and Research Frontiers

---
**Assignment Date**: Cutting-Edge Deep RL - Lesson 16  
**Estimated Time**: 4-5 hours  
**Difficulty**: Research-Level Advanced  
**Prerequisites**: CA1-CA15 completed

---

In [1]:

import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.distributions import Normal, Categorical, MultivariateNormal
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from collections import deque, namedtuple, OrderedDict
import random
import copy
import math
import time
import gym
from typing import List, Dict, Tuple, Optional, Union, Any, Callable
import warnings
warnings.filterwarnings('ignore')

from dataclasses import dataclass
from abc import ABC, abstractmethod
import json
import pickle
from datetime import datetime
import logging
from pathlib import Path

torch.manual_seed(42)
np.random.seed(42)
random.seed(42)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
if torch.cuda.is_available():
    torch.cuda.manual_seed(42)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

print(f"🚀 Using device: {device}")
if torch.cuda.is_available():
    print(f"💫 GPU: {torch.cuda.get_device_name(0)}")
    print(f"🔢 CUDA Version: {torch.version.cuda}")

plt.style.use('default')
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 10
plt.rcParams['axes.grid'] = True
plt.rcParams['grid.alpha'] = 0.3
plt.rcParams['figure.dpi'] = 100

FOUNDATION_MODEL_CONFIG = {
    'model_dim': 512,
    'num_heads': 8,
    'num_layers': 6,
    'context_length': 1024,
    'dropout': 0.1,
    'layer_norm_eps': 1e-5,
    'max_position_embeddings': 2048
}

NEUROSYMBOLIC_CONFIG = {
    'logic_embedding_dim': 128,
    'symbolic_vocab_size': 1000,
    'reasoning_steps': 5,
    'symbolic_weight': 0.3,
    'neural_weight': 0.7,
    'interpretability_threshold': 0.8
}

CONTINUAL_LEARNING_CONFIG = {
    'ewc_lambda': 1000,
    'memory_size': 10000,
    'num_tasks': 10,
    'adaptation_lr': 1e-4,
    'meta_lr': 1e-3,
    'forgetting_threshold': 0.1
}

HUMAN_AI_CONFIG = {
    'preference_model_dim': 256,
    'reward_model_lr': 3e-4,
    'human_feedback_ratio': 0.1,
    'preference_batch_size': 64,
    'kl_penalty': 0.1,
    'value_alignment_weight': 1.0
}

QUANTUM_RL_CONFIG = {
    'num_qubits': 8,
    'circuit_depth': 10,
    'quantum_lr': 0.01,
    'entanglement_layers': 3,
    'measurement_shots': 1024,
    'quantum_advantage_threshold': 1.5
}

print("\n🧠 Cutting-Edge Deep RL Environment Initialized!")
print("🔬 Advanced Topics: Foundation Models, Neurosymbolic RL, Continual Learning")
print("🤝 Human-AI Collaboration, Quantum RL, Ethics & Future Paradigms")
print("⚡ Ready for next-generation reinforcement learning research!")
print("\n" + "="*80)


🚀 Using device: cpu

🧠 Cutting-Edge Deep RL Environment Initialized!
🔬 Advanced Topics: Foundation Models, Neurosymbolic RL, Continual Learning
🤝 Human-AI Collaboration, Quantum RL, Ethics & Future Paradigms
⚡ Ready for next-generation reinforcement learning research!



Gym has been unmaintained since 2022 and does not support NumPy 2.0 amongst other critical functionality.
Please upgrade to Gymnasium, the maintained drop-in replacement of Gym, or contact the authors of your software and request that they upgrade.
Users of this version of Gym should be able to simply replace 'import gym' with 'import gymnasium as gym' in the vast majority of cases.
See the migration guide at https://gymnasium.farama.org/introduction/migration_guide/ for additional information.


# Section 1: Foundation Models in Reinforcement Learning

Foundation models represent a paradigm shift in AI, where large-scale pre-trained models can be adapted to various downstream tasks. In RL, this concept translates to training massive models on diverse experiences that can then be fine-tuned for specific tasks.

## 1.1 Theoretical Foundations

### Decision Transformers
The Decision Transformer reframes RL as a sequence modeling problem, where the goal is to generate actions conditioned on desired returns.

**Key Insight**: Instead of learning value functions or policy gradients, we model:
$$P(a_t | s_{1:t}, a_{1:t-1}, R_{t:T})$$

Where $R_{t:T}$ represents the desired return-to-go from time $t$ to episode end $T$.

### Trajectory Transformers
Extend transformers to model entire trajectories:
$$P(\tau | g) = \prod_{t=0}^{T} P(s_{t+1}, r_t, a_t | s_{1:t}, a_{1:t-1}, g)$$

Where $g$ represents the goal or task specification.

### Multi-Task Pre-training
Foundation models in RL are trained on massive datasets containing:
- Multiple environments and tasks
- Diverse behavioral policies
- Various skill demonstrations
- Cross-modal experiences (vision, language, control)

**Training Objective**:
$$\mathcal{L} = \sum_{\mathcal{D}_i} \mathbb{E}_{\tau \sim \mathcal{D}_i} [-\log P(\tau | \text{context}_i)]$$

### In-Context Learning for RL
Similar to language models, RL foundation models can adapt to new tasks through in-context learning:
- Provide few-shot demonstrations
- Model infers task structure and optimal behavior
- No gradient updates required

## 1.2 Advantages and Challenges

### Advantages:
1. **Sample Efficiency**: Leverage pre-training for rapid adaptation
2. **Generalization**: Transfer knowledge across diverse tasks
3. **Few-Shot Learning**: Adapt to new tasks with minimal data
4. **Unified Architecture**: Single model for multiple domains

### Challenges:
1. **Computational Requirements**: Massive models need significant resources
2. **Data Requirements**: Need diverse, high-quality training data
3. **Task Distribution**: Performance depends on training task diversity
4. **Fine-tuning Complexity**: Avoiding catastrophic forgetting during adaptation

### Scaling Laws in RL
Similar to language models, RL foundation models exhibit scaling laws:
- **Model Size**: Larger models achieve better performance
- **Data Scale**: More diverse training data improves generalization
- **Compute**: Increased training compute enables larger models

**Empirical Scaling Relationship**:
$$\text{Performance} \propto \alpha N^{\beta} D^{\gamma} C^{\delta}$$

Where $N$ = model parameters, $D$ = dataset size, $C$ = compute budget.

In [5]:

class PositionalEncoding(nn.Module):
    """Positional encoding for transformer-based RL models."""
    
    def __init__(self, d_model, max_len=5000):
        super().__init__()
        
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0).transpose(0, 1)
        
        self.register_buffer('pe', pe)
    
    def forward(self, x):
        return x + self.pe[:x.size(0), :]

class DecisionTransformer(nn.Module):
    """Decision Transformer for sequence-based RL."""
    
    def __init__(self, state_dim, action_dim, model_dim=512, num_heads=8, num_layers=6, 
                 max_length=1024, dropout=0.1):
        super().__init__()
        
        self.state_dim = state_dim
        self.action_dim = action_dim
        self.model_dim = model_dim
        self.max_length = max_length
        
        self.state_embedding = nn.Linear(state_dim, model_dim)
        self.action_embedding = nn.Linear(action_dim, model_dim)
        self.return_embedding = nn.Linear(1, model_dim)
        self.timestep_embedding = nn.Embedding(max_length, model_dim)
        
        self.pos_encoding = PositionalEncoding(model_dim, max_length * 3)  # 3x for s,a,r tokens
        
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=model_dim,
            nhead=num_heads,
            dim_feedforward=4 * model_dim,
            dropout=dropout,
            batch_first=True
        )
        self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
        
        self.layer_norm = nn.LayerNorm(model_dim)
        
        self.action_head = nn.Linear(model_dim, action_dim)
        self.value_head = nn.Linear(model_dim, 1)
        self.return_head = nn.Linear(model_dim, 1)
        
        self.dropout = nn.Dropout(dropout)
        
        self.apply(self._init_weights)
    
    def _init_weights(self, module):
        """Initialize transformer weights."""
        if isinstance(module, (nn.Linear, nn.Embedding)):
            module.weight.data.normal_(mean=0.0, std=0.02)
            if isinstance(module, nn.Linear) and module.bias is not None:
                module.bias.data.zero_()
        elif isinstance(module, nn.LayerNorm):
            module.bias.data.zero_()
            module.weight.data.fill_(1.0)
    
    def forward(self, states, actions, returns_to_go, timesteps, attention_mask=None):
        """
        Forward pass through Decision Transformer.
        
        Args:
            states: (batch_size, seq_len, state_dim)
            actions: (batch_size, seq_len, action_dim)
            returns_to_go: (batch_size, seq_len, 1)
            timesteps: (batch_size, seq_len)
            attention_mask: (batch_size, seq_len * 3)
        """
        batch_size, seq_len = states.shape[0], states.shape[1]
        
        state_embeddings = self.state_embedding(states)
        action_embeddings = self.action_embedding(actions)
        return_embeddings = self.return_embedding(returns_to_go)
        time_embeddings = self.timestep_embedding(timesteps)
        
        state_embeddings += time_embeddings
        action_embeddings += time_embeddings
        return_embeddings += time_embeddings
        
        stacked_inputs = torch.stack([
            return_embeddings, state_embeddings, action_embeddings
        ], dim=2).reshape(batch_size, 3 * seq_len, self.model_dim)
        
        stacked_inputs = self.pos_encoding(stacked_inputs.transpose(0, 1)).transpose(0, 1)
        stacked_inputs = self.layer_norm(stacked_inputs)
        stacked_inputs = self.dropout(stacked_inputs)
        
        transformer_output = self.transformer(stacked_inputs, src_key_padding_mask=attention_mask)
        
        transformer_output = transformer_output.reshape(batch_size, seq_len, 3, self.model_dim)
        
        return_preds = self.return_head(transformer_output[:, :, 0])  # Return tokens
        state_preds = transformer_output[:, :, 1]  # State tokens (for representation)
        action_preds = self.action_head(transformer_output[:, :, 2])  # Action tokens
        value_preds = self.value_head(transformer_output[:, :, 1])  # Value from state tokens
        
        return {
            'action_preds': action_preds,
            'value_preds': value_preds,
            'return_preds': return_preds,
            'state_representations': state_preds
        }
    
    def get_action(self, states, actions, returns_to_go, timesteps, temperature=1.0):
        """Get action for inference."""
        self.eval()
        with torch.no_grad():
            outputs = self.forward(states, actions, returns_to_go, timesteps)
            action_logits = outputs['action_preds'][:, -1] / temperature
            
            if self.action_dim > 1:
                action_probs = F.softmax(action_logits, dim=-1)
                action = torch.multinomial(action_probs, 1)
            else:
                action = torch.tanh(action_logits)  # For continuous actions
            
            return action

class MultiTaskRLFoundationModel(nn.Module):
    """Multi-task foundation model for RL."""
    
    def __init__(self, state_dim, action_dim, task_dim, model_dim=512, num_heads=8, num_layers=6):
        super().__init__()
        
        self.state_dim = state_dim
        self.action_dim = action_dim
        self.task_dim = task_dim
        self.model_dim = model_dim
        
        self.task_embedding = nn.Embedding(task_dim, model_dim)
        
        self.decision_transformer = DecisionTransformer(
            state_dim, action_dim, model_dim, num_heads, num_layers
        )
        
        self.task_heads = nn.ModuleDict({
            f'task_{i}': nn.Linear(model_dim, action_dim)
            for i in range(task_dim)
        })
        
        self.context_encoder = nn.LSTM(model_dim, model_dim, batch_first=True)
        self.adaptation_network = nn.Sequential(
            nn.Linear(model_dim, model_dim),
            nn.ReLU(),
            nn.Linear(model_dim, model_dim)
        )
    
    def forward(self, states, actions, returns_to_go, timesteps, task_ids, context_length=10):
        """Forward pass with task conditioning."""
        batch_size = states.shape[0]
        
        task_embeds = self.task_embedding(task_ids)  # (batch_size, model_dim)
        task_embeds = task_embeds.unsqueeze(1).expand(-1, states.shape[1], -1)
        
        conditioned_states = states + task_embeds[:, :, :self.state_dim]
        
        outputs = self.decision_transformer(conditioned_states, actions, returns_to_go, timesteps)
        
        state_representations = outputs['state_representations']
        task_specific_actions = []
        
        for i, task_id in enumerate(task_ids):
            task_head = self.task_heads[f'task_{task_id.item()}']
            task_action = task_head(state_representations[i])
            task_specific_actions.append(task_action)
        
        outputs['task_specific_actions'] = torch.stack(task_specific_actions)
        
        return outputs
    
    def adapt_to_new_task(self, context_trajectories, num_adaptation_steps=5):
        """Few-shot adaptation to new task using in-context learning."""
        context_features = []
        
        for trajectory in context_trajectories:
            states, actions, returns = trajectory['states'], trajectory['actions'], trajectory['returns']
            timesteps = torch.arange(len(states))
            
            with torch.no_grad():
                outputs = self.decision_transformer(states, actions, returns, timesteps)
                context_features.append(outputs['state_representations'].mean(dim=1))
        
        context_features = torch.stack(context_features)
        context_encoding, _ = self.context_encoder(context_features.unsqueeze(0))
        
        adaptation_params = self.adaptation_network(context_encoding.squeeze(0).mean(dim=0))
        
        return adaptation_params

class InContextLearningRL:
    """In-context learning for RL foundation models."""
    
    def __init__(self, foundation_model, context_length=50):
        self.foundation_model = foundation_model
        self.context_length = context_length
        self.context_buffer = deque(maxlen=context_length)
    
    def add_context(self, state, action, reward, next_state, done):
        """Add experience to context buffer."""
        self.context_buffer.append({
            'state': state,
            'action': action,
            'reward': reward,
            'next_state': next_state,
            'done': done
        })
    
    def get_action(self, current_state, desired_return, temperature=1.0):
        """Get action using in-context learning."""
        if len(self.context_buffer) == 0:
            return np.random.randint(self.foundation_model.action_dim)
        
        context_states = []
        context_actions = []
        context_returns = []
        context_timesteps = []
        
        cumulative_return = 0
        for i, exp in enumerate(reversed(list(self.context_buffer))):
            context_states.append(exp['state'])
            context_actions.append(exp['action'])
            cumulative_return += exp['reward']
            context_returns.append([cumulative_return])
            context_timesteps.append(len(self.context_buffer) - i - 1)
        
        context_states.reverse()
        context_actions.reverse()
        context_returns.reverse()
        context_timesteps.reverse()
        
        context_states.append(current_state)
        context_actions.append(np.zeros(self.foundation_model.action_dim))  # Placeholder
        context_returns.append([desired_return])
        context_timesteps.append(len(self.context_buffer))
        
        states = torch.FloatTensor(context_states).unsqueeze(0).to(device)
        actions = torch.FloatTensor(context_actions).unsqueeze(0).to(device)
        returns_to_go = torch.FloatTensor(context_returns).unsqueeze(0).to(device)
        timesteps = torch.LongTensor(context_timesteps).unsqueeze(0).to(device)
        
        with torch.no_grad():
            action = self.foundation_model.get_action(states, actions, returns_to_go, timesteps, temperature)
        
        return action.cpu().numpy().flatten()

class FoundationModelTrainer:
    """Training framework for RL foundation models."""
    
    def __init__(self, model, learning_rate=1e-4, weight_decay=1e-2):
        self.model = model
        self.optimizer = optim.AdamW(model.parameters(), lr=learning_rate, weight_decay=weight_decay)
        self.scheduler = optim.lr_scheduler.CosineAnnealingLR(self.optimizer, T_max=1000)
        
        self.training_stats = {
            'losses': [],
            'action_losses': [],
            'value_losses': [],
            'return_losses': []
        }
    
    def train_step(self, batch):
        """Single training step."""
        self.model.train()
        self.optimizer.zero_grad()
        
        states = batch['states'].to(device)
        actions = batch['actions'].to(device)
        returns_to_go = batch['returns_to_go'].to(device)
        timesteps = batch['timesteps'].to(device)
        target_actions = batch['target_actions'].to(device)
        target_returns = batch['target_returns'].to(device)
        
        outputs = self.model(states, actions, returns_to_go, timesteps)
        
        action_loss = F.mse_loss(outputs['action_preds'], target_actions)
        value_loss = F.mse_loss(outputs['value_preds'], target_returns)
        return_loss = F.mse_loss(outputs['return_preds'], target_returns)
        
        total_loss = action_loss + 0.5 * value_loss + 0.1 * return_loss
        
        total_loss.backward()
        torch.nn.utils.clip_grad_norm_(self.model.parameters(), max_norm=1.0)
        self.optimizer.step()
        self.scheduler.step()
        
        self.training_stats['losses'].append(total_loss.item())
        self.training_stats['action_losses'].append(action_loss.item())
        self.training_stats['value_losses'].append(value_loss.item())
        self.training_stats['return_losses'].append(return_loss.item())
        
        return total_loss.item()

print("🧠 Foundation Models Implementation Complete!")
print("📊 Key Components:")
print("  • DecisionTransformer: Sequence-based RL with transformers")
print("  • MultiTaskRLFoundationModel: Multi-task pre-training framework")
print("  • InContextLearningRL: Few-shot adaptation without gradient updates")
print("  • FoundationModelTrainer: Scalable training infrastructure")
print("\n✨ Ready for large-scale RL foundation model training!")


🧠 Foundation Models Implementation Complete!
📊 Key Components:
  • DecisionTransformer: Sequence-based RL with transformers
  • MultiTaskRLFoundationModel: Multi-task pre-training framework
  • InContextLearningRL: Few-shot adaptation without gradient updates
  • FoundationModelTrainer: Scalable training infrastructure

✨ Ready for large-scale RL foundation model training!


# Section 2: Neurosymbolic Reinforcement Learning

Neurosymbolic RL combines the learning capabilities of neural networks with the reasoning power of symbolic systems, creating interpretable and more robust intelligent agents.

## 2.1 Theoretical Foundations

### The Neurosymbolic Paradigm
Traditional RL systems struggle with:
- **Interpretability**: Understanding why decisions were made
- **Compositional Reasoning**: Combining learned concepts systematically
- **Sample Efficiency**: Learning abstract rules from limited data
- **Transfer**: Applying learned knowledge to new domains

**Neurosymbolic RL** addresses these challenges by integrating:
- **Neural Components**: Learning from raw sensory data
- **Symbolic Components**: Logical reasoning and rule-based inference
- **Hybrid Architectures**: Seamless integration of both paradigms

### Core Components

#### 1. Symbolic Knowledge Representation
Represent environment knowledge using formal logic:
- **Predicate Logic**: $\text{at}(\text{agent}, x, y) \land \text{obstacle}(x+1, y) \rightarrow \neg \text{move\_right}$
- **Temporal Logic**: $\square (\text{goal\_reached} \rightarrow \Diamond \text{reward})$
- **Probabilistic Logic**: $P(\text{success} | \text{action}, \text{state}) = 0.8$

#### 2. Neural-Symbolic Integration Patterns

**Pattern 1: Neural Perception + Symbolic Reasoning**
$$\pi(a|s) = \text{SymbolicPlanner}(\text{NeuralPerception}(s))$$

**Pattern 2: Symbolic-Guided Neural Learning**
$$\mathcal{L} = \mathcal{L}_{\text{RL}} + \lambda \mathcal{L}_{\text{logic}}$$

**Pattern 3: Hybrid Representations**
$$h = \text{Combine}(h_{\text{neural}}, h_{\text{symbolic}})$$

### Logical Policy Learning
Learn policies that satisfy logical constraints:

**Constraint Satisfaction**:
$$\pi^* = \arg\max_\pi \mathbb{E}_\pi[R] \text{ subject to } \phi \models \psi$$

Where $\phi$ represents the policy behavior and $\psi$ represents logical constraints.

**Logic-Regularized RL**:
$$\mathcal{L} = -\mathbb{E}_\pi[R] + \alpha \cdot \text{LogicViolation}(\pi, \psi)$$

### Compositional Learning
Enable agents to compose learned primitives:

**Hierarchical Composition**:
- **Skills**: $\pi_1, \pi_2, \ldots, \pi_k$
- **Meta-Policy**: $\pi_{\text{meta}}(k|s)$
- **Composition Rule**: $\pi(a|s) = \sum_k \pi_{\text{meta}}(k|s) \pi_k(a|s)$

**Logical Composition**:
- **Primitive Predicates**: $p_1, p_2, \ldots, p_n$
- **Logical Operators**: $\land, \lor, \neg, \rightarrow$
- **Complex Behaviors**: $\psi = p_1 \land (p_2 \lor \neg p_3) \rightarrow p_4$

## 2.2 Interpretability and Explainability

### Attention-Based Explanations
Use attention mechanisms to highlight decision factors:
$$\alpha_i = \frac{\exp(e_i)}{\sum_j \exp(e_j)}, \quad e_i = f_{\text{att}}(h_i)$$

### Counterfactual Reasoning
Generate explanations through counterfactuals:
- **Question**: "What if state $s$ were different?"
- **Counterfactual State**: $s' = s + \delta$
- **Action Change**: $\Delta a = \pi(s') - \pi(s)$
- **Explanation**: "If $x$ were true, agent would do $y$ instead"

### Causal Discovery in RL
Learn causal relationships between variables:
$$X \rightarrow Y \text{ if } I(Y; \text{do}(X)) > 0$$

Where $I$ is mutual information and $\text{do}(X)$ represents intervention.

### Logical Rule Extraction
Extract interpretable rules from trained policies:
1. **State Abstraction**: Group similar states
2. **Action Patterns**: Identify consistent action choices
3. **Rule Formation**: Convert patterns to logical rules
4. **Rule Validation**: Test rules on new data

## 2.3 Advanced Neurosymbolic Architectures

### Differentiable Neural Module Networks (dNMNs)
Compose neural modules based on language instructions:
- **Modules**: $\{m_1, m_2, \ldots, m_k\}$
- **Composition**: Dynamic module assembly
- **Training**: End-to-end differentiable

### Graph Neural Networks for Symbolic Reasoning
Represent knowledge as graphs and use GNNs:
- **Nodes**: Entities, concepts, states
- **Edges**: Relations, transitions, dependencies
- **Message Passing**: Propagate information through graph
- **Reasoning**: Multi-hop inference over graph structure

### Memory-Augmented Networks
External memory for symbolic knowledge storage:
- **Memory Matrix**: $M \in \mathbb{R}^{N \times D}$
- **Attention**: $w = \text{softmax}(q^T M)$
- **Read**: $r = w^T M$
- **Write**: $M \leftarrow M + w \odot \text{update}$

In [6]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
from typing import List, Dict, Tuple, Any
from dataclasses import dataclass
from enum import Enum
import networkx as nx
import matplotlib.pyplot as plt
from collections import defaultdict

class LogicalOperator(Enum):
    AND = "and"
    OR = "or"
    NOT = "not"
    IMPLIES = "implies"

@dataclass
class LogicalPredicate:
    name: str
    args: List[str]
    truth_value: float = 0.0
    
    def __str__(self):
        if self.args:
            return f"{self.name}({', '.join(self.args)})"
        return self.name

@dataclass
class LogicalRule:
    premises: List[LogicalPredicate]
    conclusion: LogicalPredicate
    operator: LogicalOperator
    confidence: float = 1.0
    
    def evaluate(self, facts: Dict[str, float]) -> float:
        """Evaluate rule given current facts (fuzzy logic)"""
        premise_values = []
        for premise in self.premises:
            key = str(premise)
            premise_values.append(facts.get(key, 0.0))
        
        if self.operator == LogicalOperator.AND:
            premise_truth = min(premise_values) if premise_values else 0.0
        elif self.operator == LogicalOperator.OR:
            premise_truth = max(premise_values) if premise_values else 0.0
        elif self.operator == LogicalOperator.NOT:
            premise_truth = 1.0 - max(premise_values) if premise_values else 1.0
        elif self.operator == LogicalOperator.IMPLIES:
            premise_truth = min(premise_values) if premise_values else 0.0
        
        conclusion_key = str(self.conclusion)
        current_conclusion = facts.get(conclusion_key, 0.0)
        
        if self.operator == LogicalOperator.IMPLIES:
            return min(1.0, 1.0 - premise_truth + current_conclusion) * self.confidence
        
        return premise_truth * self.confidence

class SymbolicKnowledgeBase:
    """Knowledge base for storing and reasoning with logical rules"""
    
    def __init__(self):
        self.rules: List[LogicalRule] = []
        self.facts: Dict[str, float] = {}
        self.predicates: Dict[str, LogicalPredicate] = {}
    
    def add_rule(self, rule: LogicalRule):
        """Add a logical rule to the knowledge base"""
        self.rules.append(rule)
    
    def add_fact(self, predicate: LogicalPredicate, truth_value: float):
        """Add a fact to the knowledge base"""
        key = str(predicate)
        self.facts[key] = truth_value
        self.predicates[key] = predicate
    
    def forward_chain(self, max_iterations: int = 10) -> Dict[str, float]:
        """Forward chaining inference with fuzzy logic"""
        for iteration in range(max_iterations):
            changed = False
            for rule in self.rules:
                rule_activation = rule.evaluate(self.facts)
                conclusion_key = str(rule.conclusion)
                
                old_value = self.facts.get(conclusion_key, 0.0)
                new_value = max(old_value, rule_activation)
                
                if new_value != old_value:
                    self.facts[conclusion_key] = new_value
                    changed = True
            
            if not changed:
                break
        
        return self.facts
    
    def explain_decision(self, query: str) -> List[str]:
        """Generate explanation for why a fact is true"""
        explanations = []
        for rule in self.rules:
            if str(rule.conclusion) == query:
                activation = rule.evaluate(self.facts)
                if activation > 0.1:  # Threshold for meaningful activation
                    premise_str = f" {rule.operator.value} ".join([str(p) for p in rule.premises])
                    explanations.append(f"{query} because {premise_str} (confidence: {activation:.2f})")
        return explanations

class NeuralPerceptionModule(nn.Module):
    """Neural module for perceiving raw state and extracting symbolic predicates"""
    
    def __init__(self, state_dim: int, predicate_dim: int, hidden_dim: int = 64):
        super().__init__()
        self.state_dim = state_dim
        self.predicate_dim = predicate_dim
        
        self.encoder = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, predicate_dim),
            nn.Sigmoid()  # Output probabilities for predicates
        )
        
        self.attention = nn.MultiheadAttention(hidden_dim, num_heads=4)
        
    def forward(self, state: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
        """
        Args:
            state: Raw state representation [batch, state_dim]
        Returns:
            predicates: Predicate truth values [batch, predicate_dim]
            attention_weights: Attention weights for interpretability
        """
        features = self.encoder[:-1](state)  # All layers except final sigmoid
        
        features_expanded = features.unsqueeze(1)  # [batch, 1, hidden_dim]
        attended_features, attention_weights = self.attention(
            features_expanded, features_expanded, features_expanded
        )
        attended_features = attended_features.squeeze(1)
        
        predicates = torch.sigmoid(self.encoder[-1](attended_features))
        
        return predicates, attention_weights

class SymbolicReasoningModule:
    """Module for symbolic reasoning using knowledge base"""
    
    def __init__(self, knowledge_base: SymbolicKnowledgeBase):
        self.kb = knowledge_base
        self.predicate_names = [
            "near_goal", "obstacle_ahead", "low_energy", "high_reward_area",
            "safe_position", "explored_area", "time_pressure", "resource_available"
        ]
    
    def reason(self, neural_predicates: torch.Tensor) -> Dict[str, float]:
        """
        Perform symbolic reasoning given neural predicate activations
        
        Args:
            neural_predicates: Tensor of predicate activations [predicate_dim]
        Returns:
            Inferred facts and their truth values
        """
        self.kb.facts.clear()
        for i, pred_name in enumerate(self.predicate_names):
            if i < len(neural_predicates):
                pred = LogicalPredicate(pred_name, [])
                self.kb.add_fact(pred, float(neural_predicates[i]))
        
        inferred_facts = self.kb.forward_chain()
        
        return inferred_facts

class NeurosymbolicPolicy(nn.Module):
    """Policy that combines neural perception with symbolic reasoning"""
    
    def __init__(self, state_dim: int, action_dim: int, predicate_dim: int = 8):
        super().__init__()
        self.state_dim = state_dim
        self.action_dim = action_dim
        self.predicate_dim = predicate_dim
        
        self.perception = NeuralPerceptionModule(state_dim, predicate_dim)
        
        self.kb = SymbolicKnowledgeBase()
        self._initialize_domain_knowledge()
        
        self.reasoning = SymbolicReasoningModule(self.kb)
        
        self.action_net = nn.Sequential(
            nn.Linear(predicate_dim * 2, 64),  # *2 for neural + symbolic features
            nn.ReLU(),
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, action_dim)
        )
        
        self.value_net = nn.Sequential(
            nn.Linear(predicate_dim * 2, 64),
            nn.ReLU(),
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, 1)
        )
    
    def _initialize_domain_knowledge(self):
        """Initialize knowledge base with domain-specific rules"""
        
        obstacle_pred = LogicalPredicate("obstacle_ahead", [])
        safe_pred = LogicalPredicate("safe_position", [])
        avoid_pred = LogicalPredicate("avoid_forward", [])
        
        rule1 = LogicalRule(
            premises=[obstacle_pred, LogicalPredicate("safe_position", [])],
            conclusion=avoid_pred,
            operator=LogicalOperator.AND,
            confidence=0.9
        )
        self.kb.add_rule(rule1)
        
        near_goal_pred = LogicalPredicate("near_goal", [])
        high_reward_pred = LogicalPredicate("high_reward_area", [])
        approach_pred = LogicalPredicate("approach_goal", [])
        
        rule2 = LogicalRule(
            premises=[near_goal_pred, high_reward_pred],
            conclusion=approach_pred,
            operator=LogicalOperator.AND,
            confidence=0.95
        )
        self.kb.add_rule(rule2)
        
        low_energy_pred = LogicalPredicate("low_energy", [])
        resource_pred = LogicalPredicate("resource_available", [])
        collect_pred = LogicalPredicate("collect_resource", [])
        
        rule3 = LogicalRule(
            premises=[low_energy_pred, resource_pred],
            conclusion=collect_pred,
            operator=LogicalOperator.AND,
            confidence=0.85
        )
        self.kb.add_rule(rule3)
        
        time_pred = LogicalPredicate("time_pressure", [])
        explored_pred = LogicalPredicate("explored_area", [])
        explore_pred = LogicalPredicate("explore_quickly", [])
        
        rule4 = LogicalRule(
            premises=[time_pred, explored_pred],
            conclusion=explore_pred,
            operator=LogicalOperator.AND,
            confidence=0.8
        )
        self.kb.add_rule(rule4)
    
    def forward(self, state: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor, Dict]:
        """
        Forward pass through neurosymbolic policy
        
        Returns:
            action_logits: Action probability logits
            value: State value estimate
            explanations: Interpretability information
        """
        batch_size = state.shape[0]
        
        neural_predicates, attention_weights = self.perception(state)
        
        symbolic_features_list = []
        explanations = {"neural_predicates": [], "symbolic_inferences": [], "explanations": []}
        
        for i in range(batch_size):
            symbolic_facts = self.reasoning.reason(neural_predicates[i])
            
            symbolic_features = torch.zeros(self.predicate_dim)
            for j, pred_name in enumerate(self.reasoning.predicate_names):
                if j < self.predicate_dim:
                    symbolic_features[j] = symbolic_facts.get(pred_name, 0.0)
            
            symbolic_features_list.append(symbolic_features)
            
            explanations["neural_predicates"].append(neural_predicates[i].detach())
            explanations["symbolic_inferences"].append(symbolic_features)
            
            sample_explanations = []
            for fact_name, truth_value in symbolic_facts.items():
                if truth_value > 0.5:  # High activation threshold
                    fact_explanations = self.kb.explain_decision(fact_name)
                    sample_explanations.extend(fact_explanations)
            explanations["explanations"].append(sample_explanations)
        
        symbolic_features = torch.stack(symbolic_features_list).to(state.device)
        
        combined_features = torch.cat([neural_predicates, symbolic_features], dim=1)
        
        action_logits = self.action_net(combined_features)
        values = self.value_net(combined_features)
        
        explanations["attention_weights"] = attention_weights
        
        return action_logits, values, explanations
    
    def get_action(self, state: torch.Tensor, deterministic: bool = False) -> Tuple[torch.Tensor, Dict]:
        """Get action from policy with explanations"""
        action_logits, values, explanations = self.forward(state)
        
        if deterministic:
            actions = torch.argmax(action_logits, dim=-1)
        else:
            action_dist = torch.distributions.Categorical(logits=action_logits)
            actions = action_dist.sample()
        
        return actions, explanations

class NeurosymbolicAgent:
    """Complete neurosymbolic RL agent with training capabilities"""
    
    def __init__(self, state_dim: int, action_dim: int, lr: float = 3e-4):
        self.policy = NeurosymbolicPolicy(state_dim, action_dim)
        self.optimizer = torch.optim.Adam(self.policy.parameters(), lr=lr)
        
        self.training_history = {
            'rewards': [],
            'losses': [],
            'explanations': []
        }
    
    def train_step(self, states: torch.Tensor, actions: torch.Tensor, 
                   rewards: torch.Tensor, next_states: torch.Tensor, 
                   dones: torch.Tensor) -> Dict[str, float]:
        """Training step with advantage actor-critic"""
        
        action_logits, values, explanations = self.policy(states)
        next_action_logits, next_values, _ = self.policy(next_states)
        
        with torch.no_grad():
            targets = rewards + 0.99 * next_values.squeeze() * (1 - dones.float())
            advantages = targets - values.squeeze()
        
        action_dist = torch.distributions.Categorical(logits=action_logits)
        log_probs = action_dist.log_prob(actions)
        policy_loss = -(log_probs * advantages.detach()).mean()
        
        value_loss = F.mse_loss(values.squeeze(), targets.detach())
        
        entropy = action_dist.entropy().mean()
        entropy_bonus = 0.01 * entropy
        
        total_loss = policy_loss + 0.5 * value_loss - entropy_bonus
        
        self.optimizer.zero_grad()
        total_loss.backward()
        torch.nn.utils.clip_grad_norm_(self.policy.parameters(), 0.5)
        self.optimizer.step()
        
        train_info = {
            'total_loss': total_loss.item(),
            'policy_loss': policy_loss.item(),
            'value_loss': value_loss.item(),
            'entropy': entropy.item(),
            'avg_value': values.mean().item(),
            'explanations': explanations
        }
        
        self.training_history['losses'].append(total_loss.item())
        
        return train_info

print("✅ Neurosymbolic RL classes implemented successfully!")
print("Components: LogicalPredicate, LogicalRule, SymbolicKnowledgeBase")
print("Neural modules: NeuralPerceptionModule, SymbolicReasoningModule") 
print("Policy: NeurosymbolicPolicy with interpretable reasoning")
print("Agent: NeurosymbolicAgent with training capabilities")


✅ Neurosymbolic RL classes implemented successfully!
Components: LogicalPredicate, LogicalRule, SymbolicKnowledgeBase
Neural modules: NeuralPerceptionModule, SymbolicReasoningModule
Policy: NeurosymbolicPolicy with interpretable reasoning
Agent: NeurosymbolicAgent with training capabilities


In [7]:
import gymnasium as gym
from gymnasium import spaces
import random
from typing import Tuple, List
import seaborn as sns

class SymbolicGridWorld(gym.Env):
    """GridWorld environment with symbolic predicates for neurosymbolic RL"""
    
    def __init__(self, size=8):
        super().__init__()
        self.size = size
        self.agent_pos = [0, 0]
        self.goal_pos = [size-1, size-1]
        
        self.obstacles = set()
        for _ in range(size // 2):
            x, y = random.randint(1, size-2), random.randint(1, size-2)
            if [x, y] != self.goal_pos:
                self.obstacles.add((x, y))
        
        self.resources = set()
        for _ in range(size // 3):
            x, y = random.randint(0, size-1), random.randint(0, size-1)
            if [x, y] != self.goal_pos and (x, y) not in self.obstacles:
                self.resources.add((x, y))
        
        self.energy = 10
        self.max_energy = 10
        self.time_step = 0
        self.max_time = size * size
        self.collected_resources = set()
        self.visited_positions = set()
        
        self.action_space = spaces.Discrete(4)  # Up, Down, Left, Right
        self.observation_space = spaces.Box(
            low=0, high=1, shape=(12,), dtype=np.float32
        )
        
        self.actions = {
            0: [-1, 0],  # Up
            1: [1, 0],   # Down
            2: [0, -1],  # Left
            3: [0, 1]    # Right
        }
    
    def reset(self, seed=None):
        super().reset(seed=seed)
        self.agent_pos = [0, 0]
        self.energy = self.max_energy
        self.time_step = 0
        self.collected_resources = set()
        self.visited_positions = {tuple(self.agent_pos)}
        return self._get_observation(), {}
    
    def step(self, action):
        old_pos = self.agent_pos.copy()
        new_pos = [
            self.agent_pos[0] + self.actions[action][0],
            self.agent_pos[1] + self.actions[action][1]
        ]
        
        if 0 <= new_pos[0] < self.size and 0 <= new_pos[1] < self.size:
            if tuple(new_pos) not in self.obstacles:
                self.agent_pos = new_pos
                self.energy -= 1  # Moving costs energy
        
        self.time_step += 1
        self.visited_positions.add(tuple(self.agent_pos))
        
        if tuple(self.agent_pos) in self.resources and tuple(self.agent_pos) not in self.collected_resources:
            self.collected_resources.add(tuple(self.agent_pos))
            self.energy = min(self.max_energy, self.energy + 3)
        
        reward = self._calculate_reward()
        
        terminated = (self.agent_pos == self.goal_pos or 
                     self.energy <= 0 or 
                     self.time_step >= self.max_time)
        
        return self._get_observation(), reward, terminated, False, {}
    
    def _calculate_reward(self):
        """Calculate reward based on current state"""
        reward = 0
        
        if self.agent_pos == self.goal_pos:
            reward += 100
        
        goal_dist = abs(self.agent_pos[0] - self.goal_pos[0]) + abs(self.agent_pos[1] - self.goal_pos[1])
        reward -= goal_dist * 0.1
        
        if self.energy <= 0:
            reward -= 50
        
        if tuple(self.agent_pos) in self.resources and tuple(self.agent_pos) not in self.collected_resources:
            reward += 10
        
        if tuple(self.agent_pos) not in self.visited_positions:
            reward += 1
        
        reward -= 0.01
        
        return reward
    
    def _get_observation(self):
        """Get observation with symbolic predicates"""
        obs = np.zeros(12, dtype=np.float32)
        
        obs[0] = self.agent_pos[0] / self.size
        obs[1] = self.agent_pos[1] / self.size
        
        obs[2] = self._near_goal()          # near_goal
        obs[3] = self._obstacle_ahead()     # obstacle_ahead  
        obs[4] = self._low_energy()         # low_energy
        obs[5] = self._high_reward_area()   # high_reward_area
        obs[6] = self._safe_position()      # safe_position
        obs[7] = self._explored_area()      # explored_area
        obs[8] = self._time_pressure()      # time_pressure
        obs[9] = self._resource_available() # resource_available
        
        obs[10] = self.energy / self.max_energy  # energy_level
        obs[11] = self.time_step / self.max_time # time_progress
        
        return obs
    
    def _near_goal(self) -> float:
        """Check if near goal"""
        dist = abs(self.agent_pos[0] - self.goal_pos[0]) + abs(self.agent_pos[1] - self.goal_pos[1])
        return max(0, 1.0 - dist / (2 * self.size))
    
    def _obstacle_ahead(self) -> float:
        """Check if obstacle is ahead in any direction"""
        for action in range(4):
            new_pos = [
                self.agent_pos[0] + self.actions[action][0],
                self.agent_pos[1] + self.actions[action][1]
            ]
            if (0 <= new_pos[0] < self.size and 0 <= new_pos[1] < self.size and 
                tuple(new_pos) in self.obstacles):
                return 1.0
        return 0.0
    
    def _low_energy(self) -> float:
        """Check if energy is low"""
        return max(0, 1.0 - self.energy / (self.max_energy * 0.3))
    
    def _high_reward_area(self) -> float:
        """Check if in high reward area (near goal or resource)"""
        goal_reward = self._near_goal()
        
        resource_reward = 0.0
        for resource in self.resources:
            dist = abs(self.agent_pos[0] - resource[0]) + abs(self.agent_pos[1] - resource[1])
            resource_reward = max(resource_reward, max(0, 1.0 - dist / 3))
        
        return max(goal_reward, resource_reward)
    
    def _safe_position(self) -> float:
        """Check if in safe position (not near obstacles)"""
        min_dist = float('inf')
        for obstacle in self.obstacles:
            dist = abs(self.agent_pos[0] - obstacle[0]) + abs(self.agent_pos[1] - obstacle[1])
            min_dist = min(min_dist, dist)
        
        if min_dist == float('inf'):
            return 1.0
        return min(1.0, min_dist / 3)
    
    def _explored_area(self) -> float:
        """Check if current area has been explored"""
        return 1.0 if tuple(self.agent_pos) in self.visited_positions else 0.0
    
    def _time_pressure(self) -> float:
        """Check if under time pressure"""
        return max(0, (self.time_step - self.max_time * 0.7) / (self.max_time * 0.3))
    
    def _resource_available(self) -> float:
        """Check if resource is available at current position"""
        return 1.0 if (tuple(self.agent_pos) in self.resources and 
                      tuple(self.agent_pos) not in self.collected_resources) else 0.0
    
    def render(self, mode='human'):
        """Render the environment"""
        grid = np.zeros((self.size, self.size))
        
        for obs in self.obstacles:
            grid[obs[0], obs[1]] = -1
        
        for res in self.resources:
            if res not in self.collected_resources:
                grid[res[0], res[1]] = 0.5
        
        for res in self.collected_resources:
            grid[res[0], res[1]] = 0.3
        
        for pos in self.visited_positions:
            if grid[pos[0], pos[1]] == 0:
                grid[pos[0], pos[1]] = 0.1
        
        grid[self.goal_pos[0], self.goal_pos[1]] = 2
        
        grid[self.agent_pos[0], self.agent_pos[1]] = 1
        
        plt.figure(figsize=(8, 8))
        plt.imshow(grid, cmap='RdYlBu', vmin=-1, vmax=2)
        plt.colorbar(label='Cell Type')
        plt.title(f'Neurosymbolic GridWorld (Step: {self.time_step}, Energy: {self.energy})')
        
        legend_elements = [
            plt.Rectangle((0,0),1,1, facecolor='darkred', label='Obstacle'),
            plt.Rectangle((0,0),1,1, facecolor='orange', label='Resource'),
            plt.Rectangle((0,0),1,1, facecolor='yellow', label='Collected Resource'),
            plt.Rectangle((0,0),1,1, facecolor='lightblue', label='Visited'),
            plt.Rectangle((0,0),1,1, facecolor='blue', label='Goal'),
            plt.Rectangle((0,0),1,1, facecolor='red', label='Agent')
        ]
        plt.legend(handles=legend_elements, loc='center left', bbox_to_anchor=(1, 0.5))
        plt.grid(True, alpha=0.3)
        plt.show()

def train_neurosymbolic_agent(env, agent, episodes=1000, render_every=200):
    """Train neurosymbolic agent and collect interpretability data"""
    
    episode_rewards = []
    episode_explanations = []
    
    for episode in range(episodes):
        state, _ = env.reset()
        episode_reward = 0
        episode_explanation_log = []
        done = False
        
        while not done:
            state_tensor = torch.FloatTensor(state).unsqueeze(0)
            action, explanations = agent.policy.get_action(state_tensor, deterministic=False)
            action = action.item()
            
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            episode_reward += reward
            
            if explanations['explanations'][0]:  # If there are explanations
                episode_explanation_log.append({
                    'step': env.time_step,
                    'state': state.copy(),
                    'action': action,
                    'reward': reward,
                    'explanations': explanations['explanations'][0].copy(),
                    'neural_predicates': explanations['neural_predicates'][0].numpy().copy(),
                    'symbolic_inferences': explanations['symbolic_inferences'][0].numpy().copy()
                })
            
            state = next_state
        
        episode_rewards.append(episode_reward)
        episode_explanations.append(episode_explanation_log)
        
        if episode % render_every == 0:
            print(f\"Episode {episode}: Reward = {episode_reward:.2f}\")\
            if episode_explanation_log:
                print(\"Sample explanations:\")
                for exp in episode_explanation_log[:3]:  # Show first 3 explanations
                    if exp['explanations']:
                        print(f\"  Step {exp['step']}: {exp['explanations'][0]}\"")
            print()
        
        if episode > 10 and episode % 10 == 0:
            train_states, train_actions, train_rewards, train_next_states, train_dones = [], [], [], [], []
            
            for _ in range(32):  # Small batch
                state, _ = env.reset()
                for _ in range(10):  # Short episodes for training
                    state_tensor = torch.FloatTensor(state).unsqueeze(0)
                    action, _ = agent.policy.get_action(state_tensor)
                    action = action.item()
                    
                    next_state, reward, terminated, truncated, _ = env.step(action)
                    done = terminated or truncated
                    
                    train_states.append(state)
                    train_actions.append(action)
                    train_rewards.append(reward)
                    train_next_states.append(next_state)
                    train_dones.append(done)
                    
                    if done:
                        break
                    state = next_state
            
            train_states = torch.FloatTensor(np.array(train_states))
            train_actions = torch.LongTensor(train_actions)
            train_rewards = torch.FloatTensor(train_rewards)
            train_next_states = torch.FloatTensor(np.array(train_next_states))
            train_dones = torch.BoolTensor(train_dones)
            
            train_info = agent.train_step(train_states, train_actions, train_rewards, train_next_states, train_dones)
            agent.training_history['rewards'].append(np.mean(episode_rewards[-10:]))
    
    return episode_rewards, episode_explanations

print("Creating Symbolic GridWorld Environment...")
env = SymbolicGridWorld(size=6)
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n

print(f"Environment created with state_dim={state_dim}, action_dim={action_dim}")

print("Creating Neurosymbolic Agent...")
agent = NeurosymbolicAgent(state_dim, action_dim, lr=1e-3)

print("✅ Environment and Agent ready!")
print("Next: Run training to see neurosymbolic reasoning in action")


SyntaxError: unexpected character after line continuation character (4116690020.py, line 297)

# Section 3: Human-AI Collaborative Learning

Human-AI collaborative learning represents a paradigm where AI agents learn not just from environment interaction, but also from human guidance, feedback, and collaboration to achieve superhuman performance.

## 3.1 Theoretical Foundations

### The Human-AI Collaboration Paradigm

Traditional RL assumes agents learn independently from environment feedback. **Human-AI Collaborative Learning** extends this by incorporating human intelligence:

- **Human Expertise Integration**: Leverage human domain knowledge and intuition
- **Interactive Learning**: Real-time human feedback during agent training
- **Shared Control**: Dynamic handoff between human and AI decision-making
- **Explanatory AI**: AI explains decisions to humans for better collaboration

### Learning from Human Feedback (RLHF)

**Preference-Based Learning**:
Instead of engineering reward functions, learn from human preferences:

$$r_{\theta}(s, a) = \text{RewardModel}_{\theta}(s, a)$$

Where the reward model is trained on human preference data:
$$\mathcal{D} = \{(s_i, a_i^1, a_i^2, y_i)\}$$

Where $y_i \in \{0, 1\}$ indicates whether human prefers action $a_i^1$ over $a_i^2$ in state $s_i$.

**Bradley-Terry Model** for preferences:
$$P(a^1 \succ a^2 | s) = \frac{\exp(r_{\theta}(s, a^1))}{\exp(r_{\theta}(s, a^1)) + \exp(r_{\theta}(s, a^2))}$$

**Training Objective**:
$$\mathcal{L}(\theta) = -\mathbb{E}_{(s,a^1,a^2,y) \sim \mathcal{D}}[y \log P(a^1 \succ a^2 | s) + (1-y) \log P(a^2 \succ a^1 | s)]$$

### Interactive Imitation Learning

**DAgger (Dataset Aggregation)**:
Iteratively collect expert demonstrations on learned policy trajectories:

1. Train policy $\pi_i$ on current dataset $\mathcal{D}_i$
2. Execute $\pi_i$ to collect states $\{s_t\}$
3. Query expert for optimal actions $\{a_t^*\}$ on $\{s_t\}$
4. Aggregate: $\mathcal{D}_{i+1} = \mathcal{D}_i \cup \{(s_t, a_t^*)\}$

**SMILe (Safe Multi-agent Imitation Learning)**:
Learn from multiple human experts with safety constraints:
$$\pi^* = \arg\min_\pi \sum_i w_i \mathcal{L}_{\text{imitation}}(\pi, \pi_i^{\text{expert}}) + \lambda \mathcal{L}_{\text{safety}}(\pi)$$

### Shared Autonomy and Control

**Arbitration Between Human and AI**:
Dynamic switching between human and AI control:

$$a_t = \begin{cases}
a_t^{\text{human}} & \text{if } \alpha_t > \tau \\
a_t^{\text{AI}} & \text{otherwise}
\end{cases}$$

Where $\alpha_t$ represents human authority level at time $t$.

**Confidence-Based Handoff**:
$$\alpha_t = f(\text{confidence}_{\text{AI}}(s_t), \text{urgency}(s_t), \text{human\_availability}(t))$$

**Blended Control**:
Combine human and AI actions based on context:
$$a_t = w_t \cdot a_t^{\text{human}} + (1 - w_t) \cdot a_t^{\text{AI}}$$

### Trust and Calibration

**Trust Modeling**:
Model human trust in AI decisions:
$$T_{t+1} = T_t + \alpha \cdot (\text{outcome}_t - T_t) \cdot \text{surprise}_t$$

Where:
- $T_t$: Trust level at time $t$
- $\text{outcome}_t$: Actual performance outcome
- $\text{surprise}_t$: Difference between expected and actual outcome

**Calibrated Confidence**:
Ensure AI confidence matches actual performance:
$$\text{Calibration Error} = \mathbb{E}[|\text{Confidence} - \text{Accuracy}|]$$

**Trust-Aware Policy**:
Modify policy to maintain appropriate human trust:
$$\pi_{\text{trust}}(a|s) = \pi(a|s) \cdot f_{\text{trust}}(a, s, T_t)$$

## 3.2 Human Feedback Integration Methods

### Critiquing and Advice
Allow humans to provide structured feedback:

**Action Critiquing**:
- Human observes AI action and provides feedback
- Types: "Good action", "Bad action", "Better action would be..."
- Update policy based on critique

**State-Action Advice**:
$$\mathcal{L}_{\text{advice}} = -\log \pi(a_{\text{advised}} | s) \cdot w_{\text{confidence}}$$

### Demonstration and Intervention

**Human Demonstrations**:
- Collect expert trajectories: $\tau_{\text{expert}} = \{(s_0, a_0), (s_1, a_1), \ldots\}$
- Learn via behavioral cloning or inverse RL
- Active learning: query human on uncertain states

**Intervention Learning**:
- Human takes control when AI makes mistakes
- Learn from intervention patterns
- Identify failure modes and correction strategies

### Preference Learning and Ranking

**Pairwise Preferences**:
Show human two action sequences and ask for preference
$$\mathcal{P} = \{(\tau_1, \tau_2, \text{preference})\}$$

**Trajectory Ranking**:
Rank multiple trajectories by performance
$$\tau_1 \succ \tau_2 \succ \ldots \succ \tau_k$$

**Active Preference Learning**:
Intelligently select which comparisons to show human:
$$\text{query}^* = \arg\max_{\text{query}} \text{InformationGain}(\text{query})$$

## 3.3 Collaborative Decision Making

### Shared Mental Models
Align human and AI understanding of the task:

**Common Ground**:
- Shared representation of environment
- Agreed-upon goal decomposition  
- Common terminology and concepts

**Theory of Mind**:
AI models human beliefs, intentions, and capabilities:
$$\text{AI\_Model}(\text{human\_belief}(s_t), \text{human\_goal}, \text{human\_capability})$$

### Communication Protocols

**Natural Language Interface**:
- AI explains decisions in natural language
- Human provides feedback via natural language
- Bidirectional communication for coordination

**Multimodal Communication**:
- Visual indicators (attention, confidence)
- Gestural input from humans
- Audio feedback and alerts

### Coordination Strategies

**Task Allocation**:
Divide tasks based on comparative advantage:
$$\text{Assign}(T_i) = \begin{cases}
\text{Human} & \text{if } \text{Advantage}_{\text{human}}(T_i) > \text{Advantage}_{\text{AI}}(T_i) \\
\text{AI} & \text{otherwise}
\end{cases}$$

**Dynamic Role Assignment**:
Roles change based on context, performance, and availability:
- **Leader-Follower**: One party leads, other assists
- **Peer Collaboration**: Equal partnership with negotiation
- **Hierarchical**: Clear command structure with delegation

## 3.4 Advanced Collaborative Learning Paradigms

### Constitutional AI
Train AI systems to follow high-level principles:

1. **Constitutional Training**: Define principles in natural language
2. **Self-Critiquing**: AI evaluates its own responses against principles
3. **Iterative Refinement**: Improve responses based on principle violations

**Constitutional Loss**:
$$\mathcal{L}_{\text{constitutional}} = \mathcal{L}_{\text{task}} + \lambda \sum_i \text{Violation}(\text{principle}_i)$$

### Cooperative Inverse Reinforcement Learning (Co-IRL)
Learn shared reward functions through interaction:

$$R^* = \arg\max_R \log P(\tau_{\text{human}} | R) + \log P(\tau_{\text{AI}} | R) + \text{Cooperation}(R)$$

### Multi-Agent Human-AI Teams
Extend collaboration to multi-agent settings:

**Team Formation**:
- Optimal team composition (humans + AI agents)
- Role specialization and capability matching
- Communication network topology

**Collective Intelligence**:
$$\text{Team\_Performance} > \max(\text{Individual\_Performance})$$

### Continual Human-AI Co-Evolution
Humans and AI systems improve together over time:

**Co-Adaptation**:
- AI adapts to human preferences and style
- Humans develop better collaboration skills with AI
- Mutual model updates and learning

**Lifelong Collaboration**:
- Maintain collaboration quality over extended periods
- Handle changes in human capabilities and preferences
- Evolve communication and coordination protocols

In [8]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
from typing import List, Dict, Tuple, Optional, Callable
from dataclasses import dataclass
from collections import deque
import matplotlib.pyplot as plt
import random

@dataclass
class HumanPreference:
    """Represents a human preference between two trajectories or actions"""
    state: np.ndarray
    action1: int
    action2: int 
    preference: int  # 0 if action1 preferred, 1 if action2 preferred
    confidence: float = 1.0
    timestamp: float = 0.0

@dataclass 
class HumanFeedback:
    """Different types of human feedback"""
    feedback_type: str  # 'preference', 'critique', 'demonstration', 'intervention'
    content: any
    confidence: float = 1.0
    context: Dict = None

class PreferenceRewardModel(nn.Module):
    """Neural network that learns to predict human preferences"""
    
    def __init__(self, state_dim: int, action_dim: int, hidden_dim: int = 128):
        super().__init__()
        self.state_dim = state_dim
        self.action_dim = action_dim
        
        self.encoder = nn.Sequential(
            nn.Linear(state_dim + 1, hidden_dim),  # +1 for action (discrete)
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim // 2),
            nn.ReLU(),
            nn.Linear(hidden_dim // 2, 1)  # Output scalar reward
        )
        
        self.confidence_net = nn.Sequential(
            nn.Linear(state_dim + 1, hidden_dim // 2),
            nn.ReLU(),
            nn.Linear(hidden_dim // 2, 1),
            nn.Sigmoid()  # Confidence between 0 and 1
        )
    
    def forward(self, states: torch.Tensor, actions: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
        """
        Forward pass through reward model
        
        Args:
            states: State tensors [batch, state_dim]
            actions: Action tensors [batch] (discrete)
        
        Returns:
            rewards: Predicted rewards [batch]
            confidences: Prediction confidences [batch]
        """
        actions_normalized = actions.float().unsqueeze(1) / self.action_dim
        
        state_action = torch.cat([states, actions_normalized], dim=1)
        
        rewards = self.encoder(state_action).squeeze(-1)
        confidences = self.confidence_net(state_action).squeeze(-1)
        
        return rewards, confidences
    
    def preference_probability(self, state: torch.Tensor, action1: torch.Tensor, action2: torch.Tensor) -> torch.Tensor:
        """Calculate probability of preferring action1 over action2 using Bradley-Terry model"""
        reward1, conf1 = self.forward(state, action1)
        reward2, conf2 = self.forward(state, action2)
        
        prob = torch.sigmoid(reward1 - reward2)
        return prob, (conf1 + conf2) / 2  # Average confidence

class HumanFeedbackCollector:
    """Simulates and manages human feedback collection"""
    
    def __init__(self, true_reward_fn: Optional[Callable] = None):
        self.preferences: List[HumanPreference] = []
        self.feedback_history: List[HumanFeedback] = []
        self.true_reward_fn = true_reward_fn
        self.noise_level = 0.1  # Human feedback noise
        
    def collect_preference(self, state: np.ndarray, action1: int, action2: int, 
                          use_true_reward: bool = True) -> HumanPreference:
        """Simulate human preference collection"""
        
        if use_true_reward and self.true_reward_fn is not None:
            reward1 = self.true_reward_fn(state, action1)
            reward2 = self.true_reward_fn(state, action2)
            
            reward1 += np.random.normal(0, self.noise_level)
            reward2 += np.random.normal(0, self.noise_level)
            
            preference = 0 if reward1 > reward2 else 1
            confidence = min(1.0, max(0.1, abs(reward1 - reward2)))
        else:
            preference = random.choice([0, 1])
            confidence = random.uniform(0.5, 1.0)
        
        pref = HumanPreference(
            state=state,
            action1=action1,
            action2=action2,
            preference=preference,
            confidence=confidence
        )
        
        self.preferences.append(pref)
        return pref
    
    def collect_critique(self, state: np.ndarray, action: int, ai_reward: float) -> HumanFeedback:
        """Simulate human critique of AI action"""
        
        if self.true_reward_fn is not None:
            true_reward = self.true_reward_fn(state, action)
            
            reward_diff = true_reward - ai_reward
            
            if reward_diff > 0.5:
                critique = "good_action"
                confidence = min(1.0, reward_diff)
            elif reward_diff < -0.5:
                critique = "bad_action"
                confidence = min(1.0, abs(reward_diff))
            else:
                critique = "neutral"
                confidence = 0.5
        else:
            critique = random.choice(["good_action", "bad_action", "neutral"])
            confidence = random.uniform(0.3, 1.0)
        
        feedback = HumanFeedback(
            feedback_type="critique",
            content=critique,
            confidence=confidence,
            context={"state": state, "action": action, "ai_reward": ai_reward}
        )
        
        self.feedback_history.append(feedback)
        return feedback
    
    def get_preference_dataset(self) -> List[HumanPreference]:
        """Get all collected preferences"""
        return self.preferences
    
    def clear_history(self):
        """Clear feedback history"""
        self.preferences.clear()
        self.feedback_history.clear()

class CollaborativeAgent:
    """RL Agent that learns from human feedback"""
    
    def __init__(self, state_dim: int, action_dim: int, lr: float = 3e-4):
        self.state_dim = state_dim
        self.action_dim = action_dim
        
        self.policy = nn.Sequential(
            nn.Linear(state_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Linear(64, action_dim)
        )
        
        self.value_net = nn.Sequential(
            nn.Linear(state_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Linear(64, 1)
        )
        
        self.reward_model = PreferenceRewardModel(state_dim, action_dim)
        
        self.policy_optimizer = torch.optim.Adam(self.policy.parameters(), lr=lr)
        self.value_optimizer = torch.optim.Adam(self.value_net.parameters(), lr=lr)
        self.reward_optimizer = torch.optim.Adam(self.reward_model.parameters(), lr=lr)
        
        self.human_trust = 0.8  # Initial trust level
        self.ai_confidence_history = deque(maxlen=100)
        self.collaboration_history = []
        
    def get_action(self, state: torch.Tensor, use_learned_reward: bool = True) -> Tuple[int, Dict]:
        """Get action with collaboration information"""
        with torch.no_grad():
            logits = self.policy(state)
            action_probs = F.softmax(logits, dim=-1)
            
            action_dist = torch.distributions.Categorical(action_probs)
            action = action_dist.sample().item()
            
            if use_learned_reward:
                reward, confidence = self.reward_model(state.unsqueeze(0), torch.tensor([action]))
                ai_confidence = confidence.item()
                predicted_reward = reward.item()
            else:
                ai_confidence = action_probs.max().item()
                predicted_reward = None
            
            self.ai_confidence_history.append(ai_confidence)
            
            intervention_threshold = self._compute_intervention_threshold()
            should_request_human = ai_confidence < intervention_threshold
            
            collab_info = {
                'action': action,
                'ai_confidence': ai_confidence,
                'predicted_reward': predicted_reward,
                'action_probs': action_probs.numpy(),
                'should_request_human': should_request_human,
                'human_trust': self.human_trust,
                'intervention_threshold': intervention_threshold
            }
            
            return action, collab_info
    
    def _compute_intervention_threshold(self) -> float:
        """Compute when to request human intervention based on trust and performance"""
        base_threshold = 0.5
        trust_adjustment = (1.0 - self.human_trust) * 0.3  # Lower trust = lower threshold
        
        if len(self.ai_confidence_history) > 10:
            recent_avg_confidence = np.mean(list(self.ai_confidence_history)[-10:])
            performance_adjustment = (0.7 - recent_avg_confidence) * 0.2
        else:
            performance_adjustment = 0
        
        threshold = base_threshold + trust_adjustment + performance_adjustment
        return np.clip(threshold, 0.2, 0.8)
    
    def train_reward_model(self, preferences: List[HumanPreference], epochs: int = 10):
        """Train reward model from human preferences"""
        if len(preferences) < 2:
            return
        
        total_loss = 0
        for epoch in range(epochs):
            epoch_loss = 0
            random.shuffle(preferences)
            
            for pref in preferences:
                state = torch.FloatTensor(pref.state).unsqueeze(0)
                action1 = torch.tensor([pref.action1])
                action2 = torch.tensor([pref.action2])
                
                prob, avg_conf = self.reward_model.preference_probability(state, action1, action2)
                
                if pref.preference == 0:  # action1 preferred
                    loss = -torch.log(prob + 1e-8)
                else:  # action2 preferred
                    loss = -torch.log(1 - prob + 1e-8)
                
                loss = loss * pref.confidence
                
                self.reward_optimizer.zero_grad()
                loss.backward()
                self.reward_optimizer.step()
                
                epoch_loss += loss.item()
            
            total_loss += epoch_loss
        
        return total_loss / (len(preferences) * epochs)
    
    def update_trust(self, predicted_outcome: float, actual_outcome: float, surprise_factor: float = 1.0):
        """Update human trust based on AI performance"""
        learning_rate = 0.1
        prediction_error = actual_outcome - predicted_outcome
        
        normalized_error = np.tanh(prediction_error / 2.0)  # Bound between -1 and 1
        
        trust_update = learning_rate * normalized_error * surprise_factor
        self.human_trust = np.clip(self.human_trust + trust_update, 0.0, 1.0)
        
        self.collaboration_history.append({
            'predicted_outcome': predicted_outcome,
            'actual_outcome': actual_outcome,
            'prediction_error': prediction_error,
            'trust_update': trust_update,
            'new_trust': self.human_trust
        })
    
    def train_policy_from_rewards(self, states: torch.Tensor, actions: torch.Tensor, 
                                 rewards: torch.Tensor, next_states: torch.Tensor, 
                                 dones: torch.Tensor) -> Dict[str, float]:
        """Train policy using learned rewards (Reinforcement Learning from Human Feedback)"""
        
        action_logits = self.policy(states)
        action_dist = torch.distributions.Categorical(logits=action_logits)
        log_probs = action_dist.log_prob(actions)
        
        values = self.value_net(states).squeeze()
        next_values = self.value_net(next_states).squeeze()
        
        with torch.no_grad():
            targets = rewards + 0.99 * next_values * (1 - dones.float())
            advantages = targets - values
        
        policy_loss = -(log_probs * advantages.detach()).mean()
        
        value_loss = F.mse_loss(values, targets.detach())
        
        entropy = action_dist.entropy().mean()
        entropy_bonus = 0.01 * entropy
        
        total_loss = policy_loss + 0.5 * value_loss - entropy_bonus
        
        self.policy_optimizer.zero_grad()
        total_loss.backward()
        self.policy_optimizer.step()
        
        value_loss_separate = F.mse_loss(self.value_net(states).squeeze(), targets.detach())
        self.value_optimizer.zero_grad()
        value_loss_separate.backward()
        self.value_optimizer.step()
        
        return {
            'policy_loss': policy_loss.item(),
            'value_loss': value_loss.item(),
            'entropy': entropy.item(),
            'total_loss': total_loss.item()
        }
    
    def get_collaboration_stats(self) -> Dict:
        """Get collaboration statistics"""
        if not self.collaboration_history:
            return {}
        
        recent_history = self.collaboration_history[-50:]  # Last 50 interactions
        
        return {
            'current_trust': self.human_trust,
            'avg_prediction_error': np.mean([h['prediction_error'] for h in recent_history]),
            'trust_volatility': np.std([h['new_trust'] for h in recent_history]),
            'collaboration_count': len(self.collaboration_history),
            'recent_performance': np.mean([1 - abs(h['prediction_error']) for h in recent_history])
        }

print("✅ Human-AI Collaborative Learning implementation complete!")
print("Components implemented:")
print("- PreferenceRewardModel: Learn from human preferences") 
print("- HumanFeedbackCollector: Simulate human feedback")
print("- CollaborativeAgent: Agent that learns from human feedback")
print("- Trust modeling and intervention mechanisms")


✅ Human-AI Collaborative Learning implementation complete!
Components implemented:
- PreferenceRewardModel: Learn from human preferences
- HumanFeedbackCollector: Simulate human feedback
- CollaborativeAgent: Agent that learns from human feedback
- Trust modeling and intervention mechanisms


In [9]:
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle
import seaborn as sns

class CollaborativeGridWorld(gym.Env):
    """GridWorld specifically designed for human-AI collaboration testing"""
    
    def __init__(self, size=6):
        super().__init__()
        self.size = size
        self.reset()
        
        self.action_space = spaces.Discrete(4)
        self.observation_space = spaces.Box(low=0, high=1, shape=(8,), dtype=np.float32)
        
        self.actions = {0: [-1, 0], 1: [1, 0], 2: [0, -1], 3: [0, 1]}  # Up, Down, Left, Right
        
        self.true_reward_weights = {
            'goal_distance': -0.1,
            'obstacle_penalty': -5.0,
            'goal_reward': 10.0,
            'efficiency_bonus': 0.5,
            'exploration_bonus': 0.1
        }
    
    def reset(self, seed=None):
        super().reset(seed=seed)
        self.agent_pos = [0, 0]
        self.goal_pos = [self.size-1, self.size-1]
        
        self.obstacles = set()
        for i in range(2, 5):
            self.obstacles.add((i, 2))
        for j in range(1, 4):
            self.obstacles.add((2, j))
        
        self.obstacles.discard(tuple(self.goal_pos))
        
        self.visited_positions = {tuple(self.agent_pos)}
        self.step_count = 0
        self.max_steps = self.size * self.size * 2
        
        return self._get_observation(), {}
    
    def step(self, action):
        old_pos = self.agent_pos.copy()
        new_pos = [
            self.agent_pos[0] + self.actions[action][0],
            self.agent_pos[1] + self.actions[action][1]
        ]
        
        reward = 0
        
        if (0 <= new_pos[0] < self.size and 0 <= new_pos[1] < self.size and 
            tuple(new_pos) not in self.obstacles):
            self.agent_pos = new_pos
            
            old_dist = abs(old_pos[0] - self.goal_pos[0]) + abs(old_pos[1] - self.goal_pos[1])
            new_dist = abs(new_pos[0] - self.goal_pos[0]) + abs(new_pos[1] - self.goal_pos[1])
            reward += self.true_reward_weights['goal_distance'] * (new_dist - old_dist)
            
            if tuple(new_pos) not in self.visited_positions:
                reward += self.true_reward_weights['exploration_bonus']
                self.visited_positions.add(tuple(new_pos))
        else:
            reward += self.true_reward_weights['obstacle_penalty']
        
        if self.agent_pos == self.goal_pos:
            reward += self.true_reward_weights['goal_reward']
            efficiency = max(0, 1 - self.step_count / (self.size * 2))
            reward += self.true_reward_weights['efficiency_bonus'] * efficiency
        
        self.step_count += 1
        
        terminated = (self.agent_pos == self.goal_pos or self.step_count >= self.max_steps)
        
        return self._get_observation(), reward, terminated, False, {}
    
    def _get_observation(self):
        """Get observation with useful features for learning"""
        obs = np.zeros(8, dtype=np.float32)
        
        obs[0] = self.agent_pos[0] / self.size
        obs[1] = self.agent_pos[1] / self.size
        
        goal_dx = (self.goal_pos[0] - self.agent_pos[0]) / self.size
        goal_dy = (self.goal_pos[1] - self.agent_pos[1]) / self.size
        obs[2] = goal_dx
        obs[3] = goal_dy
        
        goal_dist = abs(self.agent_pos[0] - self.goal_pos[0]) + abs(self.agent_pos[1] - self.goal_pos[1])
        obs[4] = goal_dist / (2 * self.size)
        
        for i, (dx, dy) in enumerate([[0, 1], [0, -1], [1, 0], [-1, 0]]):
            next_pos = [self.agent_pos[0] + dx, self.agent_pos[1] + dy]
            if (next_pos[0] < 0 or next_pos[0] >= self.size or 
                next_pos[1] < 0 or next_pos[1] >= self.size or
                tuple(next_pos) in self.obstacles):
                obs[5] = 1.0  # Obstacle detected
                break
        
        obs[6] = self.step_count / self.max_steps
        
        obs[7] = len(self.visited_positions) / (self.size * self.size)
        
        return obs
    
    def true_reward_function(self, state, action):
        """True reward function for simulating human feedback"""
        old_pos = self.agent_pos.copy()
        old_step = self.step_count
        old_visited = self.visited_positions.copy()
        
        obs, reward, done, truncated, _ = self.step(action)
        
        self.agent_pos = old_pos
        self.step_count = old_step
        self.visited_positions = old_visited
        
        return reward
    
    def render_collaboration(self, collab_info=None):
        """Render environment with collaboration information"""
        fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
        
        grid = np.zeros((self.size, self.size))
        for obs in self.obstacles:
            grid[obs[0], obs[1]] = -1
        for pos in self.visited_positions:
            if grid[pos[0], pos[1]] == 0:
                grid[pos[0], pos[1]] = 0.3
        grid[self.goal_pos[0], self.goal_pos[1]] = 2
        grid[self.agent_pos[0], self.agent_pos[1]] = 1
        
        im = ax1.imshow(grid, cmap='RdYlBu', vmin=-1, vmax=2)
        ax1.set_title(f'Collaborative GridWorld (Step: {self.step_count})')
        ax1.grid(True, alpha=0.3)
        
        legend_elements = [
            Rectangle((0,0),1,1, facecolor='darkred', label='Obstacle'),
            Rectangle((0,0),1,1, facecolor='lightblue', label='Visited'),
            Rectangle((0,0),1,1, facecolor='blue', label='Goal'),
            Rectangle((0,0),1,1, facecolor='red', label='Agent')
        ]
        ax1.legend(handles=legend_elements, loc='center left', bbox_to_anchor=(1, 0.5))
        
        if collab_info:
            ax2.axis('off')
            info_text = f"""Collaboration Status:
            
AI Confidence: {collab_info['ai_confidence']:.3f}
Human Trust: {collab_info['human_trust']:.3f}
Intervention Threshold: {collab_info['intervention_threshold']:.3f}
Request Human Help: {collab_info['should_request_human']}

Action Probabilities:
"""
            for i, prob in enumerate(collab_info['action_probs']):
                action_names = ['Up', 'Down', 'Left', 'Right']
                info_text += f"  {action_names[i]}: {prob:.3f}\n"
            
            if collab_info['predicted_reward'] is not None:
                info_text += f"\nPredicted Reward: {collab_info['predicted_reward']:.3f}"
                
            ax2.text(0.1, 0.9, info_text, transform=ax2.transAxes, fontsize=10,
                    verticalalignment='top', fontfamily='monospace')
        
        plt.tight_layout()
        plt.show()

def train_collaborative_agent(env, agent, feedback_collector, episodes=500, 
                             feedback_frequency=10, render_frequency=100):
    """Train agent with human feedback integration"""
    
    episode_rewards = []
    trust_history = []
    collaboration_events = []
    
    print("Starting Collaborative Training...")
    print(f"Episodes: {episodes}, Feedback every: {feedback_frequency} episodes")
    
    for episode in range(episodes):
        state, _ = env.reset()
        episode_reward = 0
        episode_steps = []
        done = False
        
        while not done:
            state_tensor = torch.FloatTensor(state).unsqueeze(0)
            action, collab_info = agent.get_action(state_tensor, use_learned_reward=True)
            
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            episode_reward += reward
            
            episode_steps.append({
                'state': state.copy(),
                'action': action,
                'reward': reward,
                'next_state': next_state.copy(),
                'done': done,
                'collab_info': collab_info
            })
            
            if collab_info['predicted_reward'] is not None:
                agent.update_trust(
                    predicted_outcome=collab_info['predicted_reward'],
                    actual_outcome=reward,
                    surprise_factor=1.0 - collab_info['ai_confidence']
                )
            
            state = next_state
        
        episode_rewards.append(episode_reward)
        trust_history.append(agent.human_trust)
        
        if episode % feedback_frequency == 0 and episode > 0:
            print(f"\n--- Episode {episode}: Collecting Human Feedback ---")
            
            feedback_count = 0
            for step_data in episode_steps[-10:]:  # Last 10 steps of episode
                if random.random() < 0.3:  # 30% chance of feedback per step
                    available_actions = list(range(env.action_space.n))
                    if step_data['action'] in available_actions:
                        available_actions.remove(step_data['action'])
                    
                    if available_actions:
                        alt_action = random.choice(available_actions)
                        
                        pref = feedback_collector.collect_preference(
                            state=step_data['state'],
                            action1=step_data['action'],
                            action2=alt_action,
                            use_true_reward=True
                        )
                        feedback_count += 1
                        
                        critique = feedback_collector.collect_critique(
                            state=step_data['state'],
                            action=step_data['action'],
                            ai_reward=step_data['collab_info'].get('predicted_reward', 0)
                        )
            
            print(f"Collected {feedback_count} preference comparisons")
            
            if len(feedback_collector.preferences) > 5:
                reward_loss = agent.train_reward_model(
                    feedback_collector.preferences,
                    epochs=5
                )
                print(f"Reward model loss: {reward_loss:.4f}")
            
            if len(episode_steps) > 10:
                states = torch.FloatTensor([step['state'] for step in episode_steps])
                actions = torch.LongTensor([step['action'] for step in episode_steps])
                next_states = torch.FloatTensor([step['next_state'] for step in episode_steps])
                dones = torch.BoolTensor([step['done'] for step in episode_steps])
                
                with torch.no_grad():
                    learned_rewards, _ = agent.reward_model(states, actions)
                
                policy_stats = agent.train_policy_from_rewards(
                    states, actions, learned_rewards, next_states, dones
                )
                print(f"Policy loss: {policy_stats['policy_loss']:.4f}")
                
            collaboration_events.append({
                'episode': episode,
                'reward': episode_reward,
                'trust': agent.human_trust,
                'feedback_count': feedback_count,
                'total_preferences': len(feedback_collector.preferences)
            })
        
        if episode % render_frequency == 0 and episode > 0:
            print(f"\nEpisode {episode}:")
            print(f"  Reward: {episode_reward:.2f}")
            print(f"  Human Trust: {agent.human_trust:.3f}")
            print(f"  Total Preferences Collected: {len(feedback_collector.preferences)}")
            
            collab_stats = agent.get_collaboration_stats()
            if collab_stats:
                print(f"  Avg Prediction Error: {collab_stats['avg_prediction_error']:.3f}")
                print(f"  Recent Performance: {collab_stats['recent_performance']:.3f}")
    
    return {
        'episode_rewards': episode_rewards,
        'trust_history': trust_history,
        'collaboration_events': collaboration_events,
        'final_preferences': feedback_collector.preferences
    }

def setup_collaborative_experiment():
    """Setup and run collaborative learning experiment"""
    
    print("Setting up Collaborative Learning Experiment...")
    
    env = CollaborativeGridWorld(size=6)
    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.n
    
    agent = CollaborativeAgent(state_dim, action_dim, lr=1e-3)
    
    feedback_collector = HumanFeedbackCollector(
        true_reward_fn=lambda state, action: env.true_reward_function(state, action)
    )
    
    print(f"Environment: {state_dim}D state, {action_dim} actions")
    print(f"Agent: CollaborativeAgent with preference learning")
    print(f"Feedback: Simulated human with {feedback_collector.noise_level} noise level")
    
    return env, agent, feedback_collector

env, agent, feedback_collector = setup_collaborative_experiment()

print("\n🤖 Testing single episode with collaboration...")
state, _ = env.reset()
action, collab_info = agent.get_action(torch.FloatTensor(state).unsqueeze(0))
print(f"AI Action: {action}, Confidence: {collab_info['ai_confidence']:.3f}")
print(f"Should request human help: {collab_info['should_request_human']}")

print("\n✅ Collaborative Learning setup complete!")
print("Ready to run: train_collaborative_agent(env, agent, feedback_collector)")
print("This will train the agent with simulated human feedback")


Setting up Collaborative Learning Experiment...


NameError: name 'spaces' is not defined

# Section 4: Foundation Models in Reinforcement Learning

Foundation models represent a paradigm shift in RL, leveraging pre-trained large models to achieve sample-efficient learning and strong generalization across diverse tasks and domains.

## 4.1 Theoretical Foundations

### The Foundation Model Paradigm in RL

**Traditional RL Limitations**:
- **Sample Inefficiency**: Learning from scratch on each task
- **Poor Generalization**: Overfitting to specific environments
- **Limited Transfer**: Difficulty sharing knowledge across domains
- **Representation Learning**: Learning both policy and representations simultaneously

**Foundation Model Advantages**:
- **Pre-trained Representations**: Rich features learned from large datasets
- **Few-Shot Learning**: Rapid adaptation to new tasks with minimal data
- **Cross-Domain Transfer**: Knowledge sharing across different environments
- **Compositional Reasoning**: Understanding of complex task structures

### Mathematical Framework

**Foundation Model as Universal Approximator**:
$$f_{\theta}: \mathcal{X} \rightarrow \mathcal{Z}$$

Where $\mathcal{X}$ is input space (observations, language, etc.) and $\mathcal{Z}$ is latent representation space.

**Task-Specific Adaptation**:
$$\pi_{\phi}^{(i)}(a|s) = g_{\phi}(f_{\theta}(s), \text{context}_i)$$

Where $g_{\phi}$ is a task-specific head and $\text{context}_i$ provides task information.

**Multi-Task Objective**:
$$\mathcal{L} = \sum_{i=1}^{T} w_i \mathcal{L}_i(\pi_{\phi}^{(i)}) + \lambda \mathcal{L}_{\text{reg}}(\theta, \phi)$$

Where $T$ is number of tasks, $w_i$ are task weights, and $\mathcal{L}_{\text{reg}}$ is regularization.

### Transfer Learning in RL

**Three Paradigms**:

1. **Feature Transfer**: Use pre-trained features
   $$\pi(a|s) = \text{Head}(\text{FrozenFoundationModel}(s))$$

2. **Fine-Tuning**: Adapt entire model
   $$\theta^{*} = \arg\min_{\theta} \mathcal{L}_{\text{task}}(\theta) + \lambda ||\theta - \theta_0||^2$$

3. **Prompt-Based Learning**: Task specification through prompts
   $$\pi(a|s, p) = \text{FoundationModel}(s, p)$$
   
   Where $p$ is a task-specific prompt.

### Cross-Modal Learning

**Vision-Language-Action Models**:
$$\pi(a|v, l) = f(v, l) \text{ where } v \in \mathcal{V}, l \in \mathcal{L}, a \in \mathcal{A}$$

**Unified Representations**:
- Visual observations $\rightarrow$ Vision transformer features
- Language instructions $\rightarrow$ Language model embeddings  
- Actions $\rightarrow$ Shared action space representations

**Cross-Modal Alignment**:
$$\mathcal{L}_{\text{align}} = ||\text{Embed}_V(v) - \text{Embed}_L(\text{describe}(v))||^2$$

## 4.2 Large Language Models for RL

### LLMs as World Models

**Chain-of-Thought Reasoning**:
```
Thought: I need to navigate to the goal while avoiding obstacles.
Action: Move right to avoid the wall on the left.
Observation: I see a clear path ahead.
Thought: The goal is north of my position.
Action: Move up toward the goal.
```

**Structured Reasoning**:
$$\text{Action} = \text{LLM}(\text{State}, \text{Goal}, \text{History}, \text{Reasoning Template})$$

### Prompt Engineering for RL

**Task Specification Prompts**:
```
Task: Navigate a robot to collect all gems in a maze.
Rules: 
- Avoid obstacles (marked as #)
- Collect gems (marked as *)  
- Reach exit (marked as E)
Current state: [ASCII representation]
Choose action: [up, down, left, right]
```

**Few-Shot Learning Prompts**:
```
Example 1:
State: Agent at (0,0), Goal at (1,1), No obstacles
Action: right (move toward goal)
Result: Reached (1,0)

Example 2: 
State: Agent at (1,0), Goal at (1,1)
Action: up (move toward goal)
Result: Reached goal, +10 reward

Current situation:
State: [current state]
Action: [your choice]
```

### LLM-Based Hierarchical Planning

**High-Level Planning**:
$$\text{Subgoals} = \text{LLM}_{\text{planner}}(\text{Task}, \text{Environment})$$

**Low-Level Execution**:
$$a_t = \pi_{\text{low}}(s_t, \text{current\_subgoal})$$

**Plan Refinement**:
$$\text{Updated\_Plan} = \text{LLM}_{\text{planner}}(\text{Original\_Plan}, \text{Execution\_Feedback})$$

## 4.3 Vision Transformers in RL

### ViT for State Representation

**Patch Embedding**:
$$\text{Patches} = \text{Reshape}(\text{Image}_{H \times W \times C}) \rightarrow \mathbb{R}^{N \times P^2 \cdot C}$$

Where $N = HW/P^2$ is number of patches and $P$ is patch size.

**Spatial-Temporal Attention**:
- **Spatial**: Attend to important regions in current frame
- **Temporal**: Attend to relevant frames in history
- **Action**: Attend to action-relevant features

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

**Action Prediction Head**:
$$\pi(a|s) = \text{MLP}(\text{ViT}(s)[\text{CLS}])$$

Where $[\text{CLS}]$ is the classification token embedding.

### Multi-Modal Fusion

**Visual-Language Fusion**:
$$h_{\text{fused}} = \text{Attention}(h_{\text{vision}}, h_{\text{language}}, h_{\text{language}})$$

**Hierarchical Feature Integration**:
- **Low-level**: Pixel features, edge detection
- **Mid-level**: Objects, spatial relationships  
- **High-level**: Scene understanding, semantic concepts

### Attention-Based Policy Networks

**Self-Attention for State Processing**:
$$A_{\text{state}} = \text{SelfAttention}(\text{StateFeatures})$$

**Cross-Attention for Action Selection**:
$$A_{\text{action}} = \text{CrossAttention}(\text{ActionQueries}, \text{StateFeatures})$$

**Multi-Head Architecture**:
$$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O$$

## 4.4 Foundation Model Training Strategies

### Pre-Training Objectives

**Masked Language Modeling (MLM)**:
$$\mathcal{L}_{\text{MLM}} = -\sum_{i \in \text{masked}} \log p(x_i | x_{\setminus i})$$

**Masked Image Modeling (MIM)**:  
$$\mathcal{L}_{\text{MIM}} = ||\text{Reconstruct}(\text{Mask}(\text{Image})) - \text{Image}||^2$$

**Contrastive Learning**:
$$\mathcal{L}_{\text{contrastive}} = -\log \frac{\exp(\text{sim}(z_i, z_j)/\tau)}{\sum_{k} \exp(\text{sim}(z_i, z_k)/\tau)}$$

### Multi-Task Pre-Training

**Joint Training Objective**:
$$\mathcal{L}_{\text{joint}} = \sum_{t=1}^{T} \lambda_t \mathcal{L}_t + \mathcal{L}_{\text{reg}}$$

**Task Sampling Strategies**:
- **Uniform Sampling**: Equal probability for all tasks
- **Importance Sampling**: Weight by task difficulty/importance
- **Curriculum Learning**: Gradually increase task complexity

**Parameter Sharing Strategies**:
- **Shared Encoder**: Common feature extraction
- **Task-Specific Heads**: Specialized output layers
- **Adapter Layers**: Small task-specific modifications

### Fine-Tuning Approaches

**Full Fine-Tuning**:
- Update all parameters for target task
- Risk of catastrophic forgetting
- Requires substantial computational resources

**Parameter-Efficient Fine-Tuning**:

**LoRA (Low-Rank Adaptation)**:
$$W' = W + AB$$
where $A \in \mathbb{R}^{d \times r}$, $B \in \mathbb{R}^{r \times d}$ with $r << d$.

**Adapter Layers**:
$$h' = h + \text{Adapter}(h) = h + W_2 \sigma(W_1 h + b_1) + b_2$$

**Prefix Tuning**:
Add learnable prefix vectors to transformer inputs.

### Continual Learning for Foundation Models

**Elastic Weight Consolidation (EWC)**:
$$\mathcal{L}_{\text{EWC}} = \mathcal{L}_{\text{task}} + \lambda \sum_i F_i (\theta_i - \theta_i^*)^2$$

Where $F_i$ is Fisher information matrix diagonal.

**Progressive Networks**:
- Freeze previous task parameters
- Add new columns for new tasks
- Lateral connections for knowledge transfer

**Meta-Learning for Rapid Adaptation**:
$$\theta' = \theta - \alpha \nabla_{\theta} \mathcal{L}_{\text{support}}(\theta)$$
$$\mathcal{L}_{\text{meta}} = \mathbb{E}_{\text{tasks}} [\mathcal{L}_{\text{query}}(\theta')]$$

## 4.5 Emergent Capabilities

### Few-Shot Task Learning
Foundation models demonstrate remarkable ability to adapt to new tasks with minimal examples:

**In-Context Learning**:
- Provide examples in input prompt
- Model adapts without parameter updates
- Emergent capability from scale and diversity

**Meta-Learning Through Pre-Training**:
- Learn to learn from pre-training data distribution
- Transfer learning strategies emerge naturally
- Rapid adaptation to distribution shifts

### Compositional Reasoning
Combine primitive skills to solve complex tasks:

**Skill Composition**:
$$\text{ComplexTask} = \text{Compose}(\text{Skill}_1, \text{Skill}_2, \ldots, \text{Skill}_k)$$

**Hierarchical Planning**:
- Decompose complex goals into subgoals
- Learn primitive skills for subgoal achievement
- Compose skills dynamically based on context

### Cross-Domain Transfer
Knowledge learned in one domain transfers to related domains:

**Domain Adaptation**:
$$\mathcal{L}_{\text{adapt}} = \mathcal{L}_{\text{target}} + \lambda \mathcal{L}_{\text{domain}}$$

**Universal Policies**:
Single policy that works across multiple environments with different dynamics, observation spaces, and action spaces.

In [10]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.nn import MultiheadAttention, LayerNorm
import math
import numpy as np
from typing import List, Dict, Tuple, Optional
from dataclasses import dataclass

class PositionalEncoding(nn.Module):
    """Positional encoding for transformer models"""
    
    def __init__(self, d_model: int, max_len: int = 5000):
        super().__init__()
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * 
                           (-math.log(10000.0) / d_model))
        
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0).transpose(0, 1)
        self.register_buffer('pe', pe)
    
    def forward(self, x):
        return x + self.pe[:x.size(0), :]

class TransformerBlock(nn.Module):
    """Transformer block with multi-head attention and feedforward layers"""
    
    def __init__(self, d_model: int, n_heads: int, d_ff: int, dropout: float = 0.1):
        super().__init__()
        self.attention = MultiheadAttention(d_model, n_heads, dropout=dropout, batch_first=True)
        self.norm1 = LayerNorm(d_model)
        self.norm2 = LayerNorm(d_model)
        
        self.feedforward = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model),
            nn.Dropout(dropout)
        )
        
    def forward(self, x, mask=None):
        attn_out, attn_weights = self.attention(x, x, x, attn_mask=mask)
        x = self.norm1(x + attn_out)
        
        ff_out = self.feedforward(x)
        x = self.norm2(x + ff_out)
        
        return x, attn_weights

class VisionTransformer(nn.Module):
    """Vision Transformer for processing visual observations"""
    
    def __init__(self, img_size: int = 84, patch_size: int = 16, in_channels: int = 3,
                 d_model: int = 256, n_heads: int = 8, n_layers: int = 6, 
                 d_ff: int = 1024, dropout: float = 0.1):
        super().__init__()
        
        self.img_size = img_size
        self.patch_size = patch_size
        self.n_patches = (img_size // patch_size) ** 2
        self.d_model = d_model
        
        self.patch_embed = nn.Conv2d(in_channels, d_model, 
                                   kernel_size=patch_size, stride=patch_size)
        
        self.cls_token = nn.Parameter(torch.randn(1, 1, d_model))
        
        self.pos_embed = nn.Parameter(torch.randn(1, self.n_patches + 1, d_model))
        
        self.layers = nn.ModuleList([
            TransformerBlock(d_model, n_heads, d_ff, dropout)
            for _ in range(n_layers)
        ])
        
        self.norm = LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x):
        """
        Args:
            x: Input images [batch, channels, height, width]
        Returns:
            features: Encoded features [batch, n_patches + 1, d_model]
            attentions: List of attention weights for visualization
        """
        batch_size = x.shape[0]
        
        x = self.patch_embed(x)
        
        x = x.flatten(2).transpose(1, 2)
        
        cls_tokens = self.cls_token.expand(batch_size, -1, -1)
        x = torch.cat([cls_tokens, x], dim=1)
        
        x = x + self.pos_embed
        x = self.dropout(x)
        
        attentions = []
        for layer in self.layers:
            x, attn = layer(x)
            attentions.append(attn)
        
        x = self.norm(x)
        
        return x, attentions

class LanguageEncoder(nn.Module):
    """Transformer-based language encoder for processing instructions"""
    
    def __init__(self, vocab_size: int, d_model: int = 256, n_heads: int = 8,
                 n_layers: int = 4, max_seq_len: int = 128, dropout: float = 0.1):
        super().__init__()
        
        self.d_model = d_model
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoding = PositionalEncoding(d_model, max_seq_len)
        
        self.layers = nn.ModuleList([
            TransformerBlock(d_model, n_heads, d_model * 4, dropout)
            for _ in range(n_layers)
        ])
        
        self.norm = LayerNorm(d_model)
        
    def forward(self, tokens, attention_mask=None):
        """
        Args:
            tokens: Input token indices [batch, seq_len]
            attention_mask: Attention mask [batch, seq_len]
        Returns:
            encoded: Encoded language features [batch, seq_len, d_model]
        """
        x = self.embedding(tokens) * math.sqrt(self.d_model)
        x = self.pos_encoding(x.transpose(0, 1)).transpose(0, 1)
        
        if attention_mask is not None:
            mask = attention_mask.unsqueeze(1).repeat(1, attention_mask.size(1), 1)
            mask = mask.masked_fill(mask == 0, float('-inf'))
        else:
            mask = None
        
        for layer in self.layers:
            x, _ = layer(x, mask)
        
        x = self.norm(x)
        return x

class CrossModalFusion(nn.Module):
    """Fuse visual and language representations using cross-attention"""
    
    def __init__(self, d_model: int, n_heads: int = 8, dropout: float = 0.1):
        super().__init__()
        
        self.vision_to_lang = MultiheadAttention(d_model, n_heads, dropout=dropout, batch_first=True)
        self.lang_to_vision = MultiheadAttention(d_model, n_heads, dropout=dropout, batch_first=True)
        
        self.norm1 = LayerNorm(d_model)
        self.norm2 = LayerNorm(d_model)
        
        self.fusion_net = nn.Sequential(
            nn.Linear(d_model * 2, d_model),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(d_model, d_model)
        )
        
    def forward(self, vision_features, lang_features, lang_mask=None):
        """
        Args:
            vision_features: Visual features [batch, n_patches, d_model]
            lang_features: Language features [batch, seq_len, d_model]
            lang_mask: Language attention mask
        Returns:
            fused_features: Fused multi-modal features [batch, d_model]
        """
        vision_attended, _ = self.vision_to_lang(
            vision_features, lang_features, lang_features, key_padding_mask=lang_mask
        )
        vision_features = self.norm1(vision_features + vision_attended)
        
        lang_attended, _ = self.lang_to_vision(
            lang_features, vision_features, vision_features
        )
        lang_features = self.norm2(lang_features + lang_attended)
        
        vision_pooled = vision_features[:, 0]  # Use CLS token
        lang_pooled = lang_features.mean(dim=1)  # Average pooling
        
        combined = torch.cat([vision_pooled, lang_pooled], dim=1)
        fused = self.fusion_net(combined)
        
        return fused

class FoundationPolicy(nn.Module):
    """Foundation model-based policy with vision and language understanding"""
    
    def __init__(self, 
                 img_size: int = 84,
                 patch_size: int = 16, 
                 in_channels: int = 3,
                 vocab_size: int = 1000,
                 action_dim: int = 4,
                 d_model: int = 256,
                 n_heads: int = 8,
                 n_layers: int = 6,
                 max_seq_len: int = 64,
                 dropout: float = 0.1):
        super().__init__()
        
        self.d_model = d_model
        self.action_dim = action_dim
        
        self.vision_encoder = VisionTransformer(
            img_size, patch_size, in_channels, d_model, n_heads, n_layers, 
            d_model * 4, dropout
        )
        
        self.language_encoder = LanguageEncoder(
            vocab_size, d_model, n_heads, n_layers // 2, max_seq_len, dropout
        )
        
        self.fusion = CrossModalFusion(d_model, n_heads, dropout)
        
        self.policy_head = nn.Sequential(
            nn.Linear(d_model, d_model // 2),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(d_model // 2, action_dim)
        )
        
        self.value_head = nn.Sequential(
            nn.Linear(d_model, d_model // 2),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(d_model // 2, 1)
        )
        
        self.apply(self._init_weights)
    
    def _init_weights(self, module):
        """Initialize model weights"""
        if isinstance(module, nn.Linear):
            nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            nn.init.normal_(module.weight, mean=0.0, std=0.02)
    
    def forward(self, images, instructions=None, instruction_mask=None):
        """
        Args:
            images: Visual observations [batch, channels, height, width]
            instructions: Language instructions [batch, seq_len] (optional)
            instruction_mask: Instruction attention mask [batch, seq_len]
        Returns:
            action_logits: Action probabilities [batch, action_dim]
            values: State values [batch]
            attention_info: Dictionary with attention weights for visualization
        """
        vision_features, vision_attentions = self.vision_encoder(images)
        
        if instructions is not None:
            lang_features = self.language_encoder(instructions, instruction_mask)
            
            fused_features = self.fusion(vision_features, lang_features, instruction_mask)
        else:
            fused_features = vision_features[:, 0]
        
        action_logits = self.policy_head(fused_features)
        values = self.value_head(fused_features).squeeze(-1)
        
        attention_info = {
            'vision_attentions': vision_attentions,
            'fused_features': fused_features
        }
        
        return action_logits, values, attention_info
    
    def get_action(self, images, instructions=None, instruction_mask=None, 
                   deterministic=False):
        """Get action from policy"""
        with torch.no_grad():
            action_logits, values, attention_info = self.forward(
                images, instructions, instruction_mask
            )
            
            if deterministic:
                actions = torch.argmax(action_logits, dim=-1)
            else:
                action_dist = torch.distributions.Categorical(logits=action_logits)
                actions = action_dist.sample()
        
        return actions, values, attention_info

class FewShotLearner:
    """Few-shot learning using foundation models"""
    
    def __init__(self, foundation_model: FoundationPolicy):
        self.foundation_model = foundation_model
        self.task_examples = []
    
    def add_example(self, image, instruction, action, reward):
        """Add a few-shot example"""
        self.task_examples.append({
            'image': image,
            'instruction': instruction,
            'action': action,
            'reward': reward
        })
    
    def adapt_to_task(self, support_data, lr=1e-4, steps=10):
        """Adapt foundation model to new task using few examples"""
        optimizer = torch.optim.Adam(self.foundation_model.parameters(), lr=lr)
        
        for step in range(steps):
            total_loss = 0
            
            for example in support_data:
                images = example['images'].unsqueeze(0)
                instructions = example['instructions'].unsqueeze(0)
                actions = example['actions'].unsqueeze(0)
                rewards = example['rewards'].unsqueeze(0)
                
                action_logits, values, _ = self.foundation_model(images, instructions)
                
                action_loss = F.cross_entropy(action_logits, actions)
                value_loss = F.mse_loss(values, rewards)
                
                loss = action_loss + 0.5 * value_loss
                total_loss += loss
            
            optimizer.zero_grad()
            total_loss.backward()
            optimizer.step()
        
        return total_loss.item()

class PromptTemplate:
    """Template for generating task prompts"""
    
    def __init__(self, task_type: str):
        self.task_type = task_type
        self.templates = {
            'navigation': "Navigate to {goal} while avoiding {obstacles}. Current position: {position}.",
            'collection': "Collect all {objects} in the environment. Collected: {collected}/{total}.",
            'interaction': "Interact with {target} to {action}. Available actions: {actions}.",
            'puzzle': "Solve the puzzle by {instruction}. Current state: {state}."
        }
    
    def generate_prompt(self, **kwargs) -> str:
        """Generate prompt from template"""
        if self.task_type in self.templates:
            return self.templates[self.task_type].format(**kwargs)
        else:
            return f"Complete the task: {kwargs.get('instruction', 'Unknown task')}"
    
    def tokenize_prompt(self, prompt: str, tokenizer, max_length: int = 64) -> Dict:
        """Tokenize prompt for model input"""
        words = prompt.lower().split()
        
        vocab = {'[PAD]': 0, '[UNK]': 1, '[CLS]': 2, '[SEP]': 3}
        for word in words:
            if word not in vocab:
                vocab[word] = len(vocab)
        
        tokens = [vocab.get(word, vocab['[UNK]']) for word in words]
        
        if len(tokens) < max_length:
            tokens = tokens + [vocab['[PAD]']] * (max_length - len(tokens))
            mask = [1] * len(words) + [0] * (max_length - len(words))
        else:
            tokens = tokens[:max_length]
            mask = [1] * max_length
        
        return {
            'tokens': torch.tensor(tokens),
            'mask': torch.tensor(mask, dtype=torch.bool),
            'vocab': vocab
        }

print("✅ Foundation Models in RL implementation complete!")
print("Components implemented:")
print("- VisionTransformer: Process visual observations with attention")
print("- LanguageEncoder: Process text instructions") 
print("- CrossModalFusion: Fuse vision and language representations")
print("- FoundationPolicy: Multi-modal policy with interpretable attention")
print("- FewShotLearner: Rapid task adaptation")
print("- PromptTemplate: Task specification through natural language")


✅ Foundation Models in RL implementation complete!
Components implemented:
- VisionTransformer: Process visual observations with attention
- LanguageEncoder: Process text instructions
- CrossModalFusion: Fuse vision and language representations
- FoundationPolicy: Multi-modal policy with interpretable attention
- FewShotLearner: Rapid task adaptation
- PromptTemplate: Task specification through natural language


In [11]:

class MultiModalGridWorld(gym.Env):
    """Enhanced GridWorld with visual rendering and language instructions"""
    
    def __init__(self, size=8, render_size=84):
        super().__init__()
        self.size = size
        self.render_size = render_size
        self.reset()
        
        self.action_space = spaces.Discrete(4)  # Up, Down, Left, Right
        
        self.observation_space = spaces.Dict({
            'image': spaces.Box(low=0, high=255, shape=(3, render_size, render_size), dtype=np.uint8),
            'instruction': spaces.Box(low=0, high=1000, shape=(64,), dtype=np.int32),
            'instruction_mask': spaces.Box(low=0, high=1, shape=(64,), dtype=np.bool_)
        })
        
        self.actions = {0: [-1, 0], 1: [1, 0], 2: [0, -1], 3: [0, 1]}
        self.action_names = ['Up', 'Down', 'Left', 'Right']
        
        self.tasks = [
            'navigation', 'collection', 'avoidance', 'exploration'
        ]
        
        self.prompt_template = PromptTemplate('navigation')
    
    def reset(self, task_type='navigation', seed=None):
        super().reset(seed=seed)
        
        self.agent_pos = [0, 0]
        self.goal_pos = [self.size-1, self.size-1]
        
        self.obstacles = set()
        self.treasures = set()
        self.visited = {tuple(self.agent_pos)}
        
        for _ in range(self.size // 2):
            x, y = np.random.randint(1, self.size-1), np.random.randint(1, self.size-1)
            if [x, y] not in [self.agent_pos, self.goal_pos]:
                self.obstacles.add((x, y))
        
        if task_type == 'collection':
            for _ in range(3):
                x, y = np.random.randint(0, self.size), np.random.randint(0, self.size)
                if ([x, y] not in [self.agent_pos, self.goal_pos] and 
                    (x, y) not in self.obstacles):
                    self.treasures.add((x, y))
        
        self.collected_treasures = set()
        self.step_count = 0
        self.max_steps = self.size * self.size
        self.task_type = task_type
        
        return self._get_observation(), {}
    
    def step(self, action):
        old_pos = self.agent_pos.copy()
        new_pos = [
            self.agent_pos[0] + self.actions[action][0],
            self.agent_pos[1] + self.actions[action][1]
        ]
        
        reward = -0.1  # Step penalty
        
        if (0 <= new_pos[0] < self.size and 0 <= new_pos[1] < self.size and 
            tuple(new_pos) not in self.obstacles):
            self.agent_pos = new_pos
            self.visited.add(tuple(new_pos))
            
            if self.task_type == 'navigation':
                old_dist = abs(old_pos[0] - self.goal_pos[0]) + abs(old_pos[1] - self.goal_pos[1])
                new_dist = abs(new_pos[0] - self.goal_pos[0]) + abs(new_pos[1] - self.goal_pos[1])
                reward += (old_dist - new_dist) * 0.1
                
                if new_pos == self.goal_pos:
                    reward += 10
                    
            elif self.task_type == 'collection':
                if tuple(new_pos) in self.treasures and tuple(new_pos) not in self.collected_treasures:
                    self.collected_treasures.add(tuple(new_pos))
                    reward += 5
                    
                if len(self.collected_treasures) == len(self.treasures):
                    reward += 20
                    
            elif self.task_type == 'exploration':
                if tuple(new_pos) not in self.visited:
                    reward += 1
        else:
            reward -= 1  # Collision penalty
        
        self.step_count += 1
        
        terminated = self._check_termination()
        
        return self._get_observation(), reward, terminated, False, {}
    
    def _check_termination(self):
        if self.task_type == 'navigation':
            return self.agent_pos == self.goal_pos or self.step_count >= self.max_steps
        elif self.task_type == 'collection':
            return (len(self.collected_treasures) == len(self.treasures) or 
                   self.step_count >= self.max_steps)
        elif self.task_type == 'exploration':
            return (len(self.visited) >= self.size * self.size * 0.8 or 
                   self.step_count >= self.max_steps)
        return self.step_count >= self.max_steps
    
    def _get_observation(self):
        """Get multi-modal observation"""
        image = self._render_image()
        
        instruction_text = self._generate_instruction()
        instruction_data = self.prompt_template.tokenize_prompt(
            instruction_text, None, max_length=64
        )
        
        return {
            'image': image,
            'instruction': instruction_data['tokens'],
            'instruction_mask': instruction_data['mask']
        }
    
    def _render_image(self):
        """Render environment as RGB image"""
        image = np.ones((3, self.render_size, self.render_size), dtype=np.uint8) * 255
        
        cell_size = self.render_size // self.size
        
        for i in range(self.size + 1):
            x = i * cell_size
            image[:, x:x+1, :] = 200
            y = i * cell_size
            image[:, :, y:y+1] = 200
        
        for obs_x, obs_y in self.obstacles:
            x1, x2 = obs_x * cell_size, (obs_x + 1) * cell_size
            y1, y2 = obs_y * cell_size, (obs_y + 1) * cell_size
            image[:, x1:x2, y1:y2] = 0
        
        for treasure_x, treasure_y in self.treasures:
            if (treasure_x, treasure_y) not in self.collected_treasures:
                x1, x2 = treasure_x * cell_size, (treasure_x + 1) * cell_size
                y1, y2 = treasure_y * cell_size, (treasure_y + 1) * cell_size
                image[0, x1:x2, y1:y2] = 255  # Red
                image[1, x1:x2, y1:y2] = 255  # Green  
                image[2, x1:x2, y1:y2] = 0    # Blue (Yellow = Red + Green)
        
        goal_x, goal_y = self.goal_pos
        x1, x2 = goal_x * cell_size, (goal_x + 1) * cell_size
        y1, y2 = goal_y * cell_size, (goal_y + 1) * cell_size
        image[0, x1:x2, y1:y2] = 0    # Red
        image[1, x1:x2, y1:y2] = 255  # Green
        image[2, x1:x2, y1:y2] = 0    # Blue
        
        agent_x, agent_y = self.agent_pos
        x1, x2 = agent_x * cell_size, (agent_x + 1) * cell_size  
        y1, y2 = agent_y * cell_size, (agent_y + 1) * cell_size
        image[0, x1:x2, y1:y2] = 255  # Red
        image[1, x1:x2, y1:y2] = 0    # Green
        image[2, x1:x2, y1:y2] = 0    # Blue
        
        return image
    
    def _generate_instruction(self):
        """Generate natural language instruction"""
        if self.task_type == 'navigation':
            return f"Navigate to the goal at position ({self.goal_pos[0]}, {self.goal_pos[1]}). " \
                   f"Current position: ({self.agent_pos[0]}, {self.agent_pos[1]}). " \
                   f"Avoid obstacles and find the shortest path."
        elif self.task_type == 'collection':
            total = len(self.treasures)
            collected = len(self.collected_treasures)
            return f"Collect all {total} treasures. Progress: {collected}/{total} collected. " \
                   f"Yellow squares are treasures. Current position: ({self.agent_pos[0]}, {self.agent_pos[1]})."
        elif self.task_type == 'exploration':
            explored = len(self.visited)
            total_cells = self.size * self.size
            return f"Explore the environment. Visit at least 80% of cells. " \
                   f"Progress: {explored}/{total_cells} cells visited."
        else:
            return f"Complete the task. Current position: ({self.agent_pos[0]}, {self.agent_pos[1]})."

def visualize_attention_maps(model, observation, save_path=None):
    """Visualize attention maps from foundation model"""
    
    with torch.no_grad():
        images = observation['image'].unsqueeze(0).float() / 255.0
        instructions = observation['instruction'].unsqueeze(0)
        instruction_mask = observation['instruction_mask'].unsqueeze(0)
        
        action_logits, values, attention_info = model(images, instructions, instruction_mask)
        
        vision_attention = attention_info['vision_attentions'][-1][0]  # [n_heads, n_patches+1, n_patches+1]
        
        cls_attention = vision_attention.mean(0)[0, 1:]  # [n_patches]
        
        n_patches_per_dim = int(np.sqrt(len(cls_attention)))
        attention_map = cls_attention.reshape(n_patches_per_dim, n_patches_per_dim)
        
        fig, axes = plt.subplots(1, 3, figsize=(15, 5))
        
        original_image = images[0].permute(1, 2, 0).numpy()
        axes[0].imshow(original_image)
        axes[0].set_title('Original Image')
        axes[0].axis('off')
        
        im = axes[1].imshow(attention_map.numpy(), cmap='hot', interpolation='bilinear')
        axes[1].set_title('Vision Attention Map')
        axes[1].axis('off')
        plt.colorbar(im, ax=axes[1])
        
        from scipy.ndimage import zoom
        attention_resized = zoom(attention_map.numpy(), 
                               (original_image.shape[0] / attention_map.shape[0],
                                original_image.shape[1] / attention_map.shape[1]))
        
        axes[2].imshow(original_image)
        axes[2].imshow(attention_resized, alpha=0.6, cmap='hot')
        axes[2].set_title('Attention Overlay')
        axes[2].axis('off')
        
        plt.tight_layout()
        
        if save_path:
            plt.savefig(save_path)
        plt.show()
        
        action_probs = F.softmax(action_logits[0], dim=0)
        action_names = ['Up', 'Down', 'Left', 'Right']
        
        print("Action Predictions:")
        for i, (action, prob) in enumerate(zip(action_names, action_probs)):
            print(f"  {action}: {prob:.3f}")
        print(f"Predicted Value: {values[0]:.3f}")

def compare_models_performance(environments, models, episodes_per_env=100):
    """Compare performance of different models across environments"""
    
    results = {model_name: {env_name: [] for env_name in environments.keys()} 
              for model_name in models.keys()}
    
    for env_name, env in environments.items():
        print(f"\nTesting environment: {env_name}")
        
        for model_name, model in models.items():
            print(f"  Testing model: {model_name}")
            episode_rewards = []
            
            for episode in range(episodes_per_env):
                obs, _ = env.reset()
                episode_reward = 0
                done = False
                
                while not done:
                    if hasattr(model, 'get_action'):
                        images = torch.FloatTensor(obs['image']).unsqueeze(0) / 255.0
                        instructions = obs['instruction'].unsqueeze(0)
                        instruction_mask = obs['instruction_mask'].unsqueeze(0)
                        
                        action, _, _ = model.get_action(images, instructions, instruction_mask)
                        action = action.item()
                    else:
                        action = env.action_space.sample()
                    
                    obs, reward, terminated, truncated, _ = env.step(action)
                    done = terminated or truncated
                    episode_reward += reward
                
                episode_rewards.append(episode_reward)
            
            results[model_name][env_name] = episode_rewards
            avg_reward = np.mean(episode_rewards)
            std_reward = np.std(episode_rewards)
            print(f"    Average reward: {avg_reward:.2f} ± {std_reward:.2f}")
    
    return results

def plot_learning_curves(training_histories, save_path=None):
    """Plot learning curves for different approaches"""
    
    plt.figure(figsize=(15, 5))
    
    plt.subplot(1, 3, 1)
    for name, history in training_histories.items():
        episodes = range(len(history['rewards']))
        plt.plot(episodes, history['rewards'], label=name, alpha=0.7)
        
        if len(history['rewards']) > 10:
            window = min(50, len(history['rewards']) // 10)
            moving_avg = np.convolve(history['rewards'], np.ones(window)/window, mode='valid')
            plt.plot(range(window-1, len(history['rewards'])), moving_avg, 
                    label=f'{name} (MA)', linewidth=2)
    
    plt.xlabel('Episode')
    plt.ylabel('Reward')
    plt.title('Learning Curves')
    plt.legend()
    plt.grid(True, alpha=0.3)
    
    plt.subplot(1, 3, 2)
    for name, history in training_histories.items():
        if 'losses' in history and history['losses']:
            episodes = range(len(history['losses']))
            plt.plot(episodes, history['losses'], label=name, alpha=0.7)
    
    plt.xlabel('Training Step')
    plt.ylabel('Loss')
    plt.title('Training Loss')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.yscale('log')
    
    plt.subplot(1, 3, 3)
    for name, history in training_histories.items():
        if 'success_rate' in history and history['success_rate']:
            episodes = range(len(history['success_rate']))
            plt.plot(episodes, history['success_rate'], label=name, alpha=0.7)
    
    plt.xlabel('Episode')
    plt.ylabel('Success Rate')
    plt.title('Success Rate Over Time')
    plt.legend()
    plt.grid(True, alpha=0.3)
    
    plt.tight_layout()
    
    if save_path:
        plt.savefig(save_path)
    plt.show()

print("Setting up comprehensive experiments...")

mm_env = MultiModalGridWorld(size=6, render_size=84)

print("Testing multi-modal environment...")
obs, _ = mm_env.reset(task_type='navigation')
print(f"Image shape: {obs['image'].shape}")
print(f"Instruction tokens: {obs['instruction'].shape}")
print(f"Instruction mask: {obs['instruction_mask'].shape}")

print("Creating foundation model...")
foundation_model = FoundationPolicy(
    img_size=84,
    patch_size=16,
    in_channels=3,
    vocab_size=1000,
    action_dim=4,
    d_model=128,  # Smaller for demo
    n_heads=4,
    n_layers=3,
    max_seq_len=64,
    dropout=0.1
)

print("Testing foundation model...")
images = torch.FloatTensor(obs['image']).unsqueeze(0) / 255.0
instructions = obs['instruction'].unsqueeze(0)
instruction_mask = obs['instruction_mask'].unsqueeze(0)

action_logits, values, attention_info = foundation_model(images, instructions, instruction_mask)
print(f"Action logits shape: {action_logits.shape}")
print(f"Values shape: {values.shape}")
print(f"Number of attention layers: {len(attention_info['vision_attentions'])}")

print("\n✅ Comprehensive experimental setup complete!")
print("Available functions:")
print("- visualize_attention_maps(): Visualize model attention")  
print("- compare_models_performance(): Compare different approaches")
print("- plot_learning_curves(): Plot training progress")
print("\nEnvironment supports multiple task types: 'navigation', 'collection', 'exploration'")


Setting up comprehensive experiments...


AttributeError: 'MultiModalGridWorld' object has no attribute 'prompt_template'

# Conclusion and Future Directions

## Summary of Advanced Deep RL Concepts

This notebook has explored cutting-edge topics in Deep Reinforcement Learning that represent the current frontier of research and applications. We covered four major paradigms:

### 1. Continual Learning in RL
- **Key Insight**: Agents must learn new tasks while retaining knowledge from previous experiences
- **Main Challenges**: Catastrophic forgetting, interference between tasks, scalability
- **Solutions**: Elastic Weight Consolidation, Progressive Networks, Meta-learning approaches
- **Applications**: Robotics, adaptive systems, lifelong learning agents

### 2. Neurosymbolic Reinforcement Learning  
- **Key Insight**: Combining neural learning with symbolic reasoning for interpretable and robust agents
- **Main Challenges**: Integration of continuous and discrete representations, knowledge representation
- **Solutions**: Differentiable programming, logic-based constraints, hybrid architectures
- **Applications**: Autonomous systems, healthcare, safety-critical domains

### 3. Human-AI Collaborative Learning
- **Key Insight**: Leverage human expertise and feedback to improve agent learning and performance
- **Main Challenges**: Trust modeling, preference learning, real-time collaboration
- **Solutions**: RLHF, preference-based rewards, shared autonomy frameworks
- **Applications**: Human-robot interaction, personalized AI, assisted decision-making

### 4. Foundation Models in RL
- **Key Insight**: Pre-trained large models enable sample-efficient learning and strong generalization
- **Main Challenges**: Transfer learning, multi-modal integration, computational efficiency
- **Solutions**: Vision transformers, cross-modal attention, prompt engineering
- **Applications**: General-purpose AI agents, few-shot learning, multi-task systems

## Interconnections Between Paradigms

These four approaches are not isolated but can be combined synergistically:

**Continual + Neurosymbolic**: Symbolic knowledge provides structure for continual learning, preventing catastrophic forgetting through logical constraints.

**Human-AI + Foundation Models**: Foundation models provide better initialization for human-AI collaboration, while human feedback can guide foundation model fine-tuning.

**Neurosymbolic + Foundation Models**: Foundation models can learn to perform symbolic reasoning, while symbolic structures can guide foundation model architectures.

**All Four Combined**: A truly advanced RL system might use foundation models as initialization, incorporate human feedback for alignment, use symbolic reasoning for interpretability, and support continual learning for adaptation.

## Current Research Frontiers

### Emerging Challenges
1. **Scalability**: How do these methods scale to real-world complexity?
2. **Sample Efficiency**: Can we achieve superhuman performance with minimal data?
3. **Robustness**: How do agents handle distribution shifts and adversarial conditions?
4. **Alignment**: How do we ensure AI systems pursue intended objectives?
5. **Interpretability**: Can we understand and verify agent decision-making?

### Promising Directions
1. **Unified Architectures**: Single models that combine multiple paradigms
2. **Meta-Learning**: Learning to learn across paradigms and domains
3. **Causal Reasoning**: Understanding cause-and-effect relationships
4. **Compositional Learning**: Building complex behaviors from simple primitives
5. **Multi-Agent Collaboration**: Scaling human-AI collaboration to teams

## Practical Implementation Insights

### Key Lessons Learned
1. **Start Simple**: Begin with simplified versions before adding complexity
2. **Modular Design**: Build components that can be combined and reused
3. **Interpretability First**: Design for explainability from the beginning
4. **Human-Centered**: Consider human factors in system design
5. **Robust Evaluation**: Test across diverse scenarios and failure modes

### Implementation Best Practices
1. **Gradual Integration**: Introduce new paradigms incrementally
2. **Ablation Studies**: Understand the contribution of each component
3. **Multi-Metric Evaluation**: Use diverse evaluation criteria beyond reward
4. **Failure Analysis**: Learn from failures and edge cases
5. **Ethical Considerations**: Address bias, fairness, and safety concerns

## Future Applications

### Near-Term (1-3 years)
- **Personalized AI Assistants**: Agents that adapt to individual preferences and learn continuously
- **Robotic Process Automation**: Intelligent automation that can handle exceptions and learn from feedback
- **Educational AI**: Tutoring systems that adapt teaching strategies based on student progress
- **Healthcare Support**: AI systems that assist medical professionals with decision-making

### Medium-Term (3-7 years)
- **Autonomous Vehicles**: Self-driving cars that learn from human drivers and adapt to new environments
- **Smart Cities**: Urban systems that optimize resource allocation through continuous learning
- **Scientific Discovery**: AI agents that collaborate with researchers to generate and test hypotheses
- **Creative AI**: Systems that collaborate with humans in creative endeavors

### Long-Term (7+ years)
- **General Intelligence**: AI systems that can perform any cognitive task that humans can do
- **Scientific AI**: Autonomous systems capable of conducting independent scientific research
- **Collaborative Societies**: Seamless integration of human and AI capabilities in all aspects of society
- **Space Exploration**: AI systems capable of autonomous operation in extreme and unknown environments

## Conclusion

The field of Deep Reinforcement Learning continues to evolve rapidly, with these advanced paradigms representing the current cutting edge. Each approach addresses fundamental limitations of traditional RL and opens new possibilities for creating more capable, reliable, and aligned AI systems.

The key to success in this field is not just understanding individual techniques, but recognizing how they can be combined to create systems that are greater than the sum of their parts. As we move forward, the most impactful advances will likely come from principled integration of these paradigms with careful attention to real-world constraints and human values.

### Final Recommendations for Further Learning

1. **Hands-On Implementation**: Build and experiment with these systems yourself
2. **Stay Current**: Follow recent papers and conferences (NeurIPS, ICML, ICLR, AAAI)
3. **Interdisciplinary Learning**: Study cognitive science, philosophy, and domain-specific knowledge
4. **Community Engagement**: Participate in research communities and open-source projects
5. **Ethical Reflection**: Consider the societal implications of your work

The future of AI lies not just in more powerful algorithms, but in systems that can learn, reason, collaborate, and adapt in ways that align with human values and capabilities. These advanced RL paradigms provide the building blocks for that future.

---

**Congratulations! You have completed CA16 - Advanced Topics in Deep Reinforcement Learning**

This comprehensive exploration has covered the most cutting-edge approaches in modern RL research. You now have the theoretical foundations and practical implementation skills to contribute to the next generation of intelligent systems.

*"The best way to predict the future is to invent it."* - Alan Kay