# Computer Assignment 15: Advanced Deep Reinforcement Learning
## Model-Based RL and Hierarchical RL

**Course:** Deep Reinforcement Learning  
**Institution:** Sharif University of Technology  
**Semester:** Fall 2024

---

## Table of Contents

1. [Environment Setup](#setup)
2. [Model-Based RL](#model-based)
3. [Hierarchical RL](#hierarchical)
4. [Planning Algorithms](#planning)
5. [Conclusion](#conclusion)

## 1. Environment Setup <a name='setup'></a>

In [1]:
# Basic imports
import sys
import os
import warnings
warnings.filterwarnings('ignore')

# Add current directory to path
sys.path.insert(0, os.path.abspath("."))

# Standard libraries
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import matplotlib.pyplot as plt
import seaborn as sns
from collections import deque
import random
from typing import List, Dict, Tuple, Optional

# Set random seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)
random.seed(42)

# Configure plotting
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)

print("✅ Environment setup complete!")
print(f"PyTorch version: {torch.__version__}")
print(f"Device: {'CUDA' if torch.cuda.is_available() else 'CPU'}")

✅ Environment setup complete!
PyTorch version: 2.8.0
Device: CPU


### Import CA15 Components from Local Modules

In [2]:
# Import Model-Based RL components
from model_based_rl.algorithms import (
    DynamicsModel,
    ModelEnsemble,
    ModelPredictiveController,
    DynaQAgent
)

# Import Hierarchical RL components
from hierarchical_rl.algorithms import (
    Option,
    HierarchicalActorCritic,
    GoalConditionedAgent,
    FeudalNetwork
)

from hierarchical_rl.environments import HierarchicalRLEnvironment

# Import Planning algorithms
from planning.algorithms import (
    MCTSNode,
    MonteCarloTreeSearch,
    ModelBasedValueExpansion,
    LatentSpacePlanner,
    WorldModel
)

# Import environments
from environments.grid_world import SimpleGridWorld

# Import utilities
from utils import (
    ReplayBuffer,
    PrioritizedReplayBuffer,
    RunningStats,
    Logger,
    VisualizationUtils,
    EnvironmentUtils
)

print("✅ Successfully imported all CA15 components")

✅ Successfully imported all CA15 components


## 2. Model-Based Reinforcement Learning <a name='model-based'></a>

Model-based RL learns an explicit model of the environment's dynamics.

### Key Concepts:
- **Dynamics Model**: $\hat{s}_{t+1} = f_\theta(s_t, a_t)$
- **Model Ensemble**: Multiple models for uncertainty
- **MPC**: Model Predictive Control
- **Dyna-Q**: Combines model-free and model-based learning

### 2.1 Dynamics Model Demo

In [3]:
# Create a simple dynamics model
state_dim = 4
action_dim = 2
hidden_dim = 128

dynamics_model = DynamicsModel(
    state_dim=state_dim,
    action_dim=action_dim,
    hidden_dim=hidden_dim
)

print("Dynamics Model Architecture:")
print(dynamics_model)

# Test forward pass - note that forward returns a dictionary
test_state = torch.randn(1, state_dim)
test_action = torch.randn(1, action_dim)
prediction = dynamics_model(test_state, test_action)

print(f"\n✅ Dynamics model test:")
print(f"  Input state shape: {test_state.shape}")
print(f"  Predicted next state mean shape: {prediction['next_state_mean'].shape}")
print(f"  Predicted reward mean shape: {prediction['reward_mean'].shape}")
print(f"  Prediction includes uncertainty (std): {list(prediction.keys())}")

# To get a sample prediction
next_state_sample, reward_sample = dynamics_model.sample_prediction(test_state, test_action)
print(f"\n  Sampled next state shape: {next_state_sample.shape}")
print(f"  Sampled reward shape: {reward_sample.shape}")

Dynamics Model Architecture:
DynamicsModel(
  (transition_net): Sequential(
    (0): Linear(in_features=6, out_features=128, bias=True)
    (1): ReLU()
    (2): Linear(in_features=128, out_features=128, bias=True)
    (3): ReLU()
    (4): Linear(in_features=128, out_features=128, bias=True)
    (5): ReLU()
    (6): Linear(in_features=128, out_features=5, bias=True)
  )
  (uncertainty_net): Sequential(
    (0): Linear(in_features=6, out_features=128, bias=True)
    (1): ReLU()
    (2): Linear(in_features=128, out_features=5, bias=True)
    (3): Softplus(beta=1.0, threshold=20.0)
  )
)

✅ Dynamics model test:
  Input state shape: torch.Size([1, 4])
  Predicted next state mean shape: torch.Size([1, 4])
  Predicted reward mean shape: torch.Size([1, 1])
  Prediction includes uncertainty (std): ['next_state_mean', 'reward_mean', 'next_state_std', 'reward_std']

  Sampled next state shape: torch.Size([4])
  Sampled reward shape: torch.Size([])


### 2.2 Model Ensemble

In [4]:
# Create model ensemble
ensemble_size = 5
model_ensemble = ModelEnsemble(
    state_dim=state_dim,
    action_dim=action_dim,
    ensemble_size=ensemble_size
)

print(f"Model Ensemble with {ensemble_size} models created")

# Generate dummy training data
batch_size = 32
states = torch.randn(batch_size, state_dim)
actions = torch.randn(batch_size, action_dim)
next_states = torch.randn(batch_size, state_dim)
rewards = torch.randn(batch_size)

# Train ensemble
print("\nTraining ensemble for 10 steps...")
for i in range(10):
    loss = model_ensemble.train_step(states, actions, next_states, rewards)
    if i % 3 == 0:
        print(f"  Step {i}: Loss = {loss:.4f}")

print("\n✅ Ensemble training complete")

Model Ensemble with 5 models created

Training ensemble for 10 steps...
  Step 0: Loss = 15.3016
  Step 3: Loss = 10.7525


  Step 6: Loss = 8.1383
  Step 9: Loss = 6.1376

✅ Ensemble training complete


### 2.3 Model Predictive Control

In [5]:
# Create MPC controller
mpc_controller = ModelPredictiveController(
    model_ensemble=model_ensemble,
    action_dim=action_dim,
    horizon=10,
    num_samples=100
)

print("MPC Controller created")
print("  Planning horizon: 10")
print("  Action samples: 100")

# Plan an action
current_state = torch.randn(state_dim)
planned_action = mpc_controller.plan_action(current_state)

print(f"\n✅ Planned action shape: {planned_action.shape}")
print(f"  Planned action values: {planned_action}")

MPC Controller created
  Planning horizon: 10
  Action samples: 100


IndexError: Dimension specified as 0 but tensor has no dimensions

### 2.4 Dyna-Q Agent

In [6]:
# Create Dyna-Q agent
dyna_agent = DynaQAgent(
    state_dim=state_dim,
    action_dim=action_dim,
    lr=1e-3
)

print("Dyna-Q Agent created")
print(f"  Q-Network: Discrete action space with {action_dim} actions")
print(f"  Integrated dynamics model for planning")

# Test action selection
test_state = np.random.randn(state_dim)
action = dyna_agent.get_action(test_state, epsilon=0.1)

print(f"\n✅ Selected action: {action}")

# Store experience
next_state = np.random.randn(state_dim)
dyna_agent.store_experience(test_state, action, 1.0, next_state, False)
print(f"  Buffer size: {len(dyna_agent.buffer)}")

# Perform some planning steps
if len(dyna_agent.buffer) >= 5:
    dyna_agent.planning_step(num_planning_steps=5)
    print(f"  Performed 5 planning steps using the learned model")

Dyna-Q Agent created
  Q-Network: Discrete action space with 2 actions
  Integrated dynamics model for planning

✅ Selected action: 0
  Buffer size: 1


## 3. Hierarchical Reinforcement Learning <a name='hierarchical'></a>

Hierarchical RL uses temporal abstraction for long-horizon tasks.

### Key Concepts:
- **Options Framework**: Temporally extended actions
- **Goal-Conditioned RL**: Policies conditioned on goals
- **HAC**: Hierarchical Actor-Critic
- **Feudal Networks**: Manager-worker hierarchy

### 3.1 Options Framework

In [7]:
# Define option components
def simple_initiation(state):
    """Option can be initiated anywhere"""
    return True

def simple_policy(state):
    """Simple random policy for the option"""
    return np.random.randint(0, action_dim)

def simple_termination(state):
    """Terminate with low probability"""
    return np.random.random() < 0.1

# Create option
option = Option(
    initiation_set=simple_initiation,
    policy=simple_policy,
    termination_condition=simple_termination,
    option_id=0
)

print(f"Option created (ID: {option.option_id})")

# Test option
test_state = np.random.randn(state_dim)
print(f"\n✅ Option test:")
print(f"  Can initiate: {option.can_initiate(test_state)}")
print(f"  Action: {option.get_action(test_state)}")
print(f"  Should terminate: {option.should_terminate(test_state)}")

# Simulate option execution
print(f"\n  Simulating option execution:")
steps = 0
while not option.should_terminate(test_state) and steps < 10:
    action = option.get_action(test_state)
    test_state = np.random.randn(state_dim)  # Simulate state transition
    steps += 1
print(f"  Option executed for {steps} steps before terminating")

TypeError: Option.__init__() got an unexpected keyword argument 'option_id'

### 3.2 Hierarchical Actor-Critic

In [8]:
# Create HAC agent
hac_agent = HierarchicalActorCritic(
    state_dim=state_dim,
    action_dim=action_dim,
    num_levels=3,
    subgoal_dims=[8, 6, 4],
    lr=1e-3
)

print("Hierarchical Actor-Critic created")
print(f"  Number of levels: 3")
print(f"  Subgoal dimensions: [8, 6, 4]")

# Test action selection at different levels
test_state = np.random.randn(state_dim)
print(f"\n✅ Hierarchical action selection:")
action = hac_agent.select_action(test_state, level=0)
print(f"  Level 0 (highest) output: {action.shape if hasattr(action, 'shape') else action}")

# The lowest level produces primitive actions
if hac_agent.num_levels > 1:
    primitive_action = hac_agent.select_action(test_state, level=hac_agent.num_levels-1)
    print(f"  Level {hac_agent.num_levels-1} (lowest) action: {primitive_action}")

TypeError: HierarchicalActorCritic.__init__() got an unexpected keyword argument 'subgoal_dims'

### 3.3 Goal-Conditioned Agent with HER

In [9]:
# Create goal-conditioned agent
goal_dim = 4
gc_agent = GoalConditionedAgent(
    state_dim=state_dim,
    action_dim=action_dim,
    goal_dim=goal_dim,
    lr=1e-3,
    her_k=4
)

print("Goal-Conditioned Agent with Hindsight Experience Replay")
print(f"  State dimension: {state_dim}")
print(f"  Goal dimension: {goal_dim}")
print(f"  HER replay ratio k: 4 (generates 4 additional goals per transition)")

# Test goal-conditioned action
test_state = np.random.randn(state_dim)
test_goal = np.random.randn(goal_dim)
action = gc_agent.get_action(test_state, test_goal, noise_scale=0.1)

print(f"\n✅ Goal-conditioned action: {action}")
print(f"  Action shape: {action.shape}")

# Store experience with HER
next_state = np.random.randn(state_dim)
reward = -np.linalg.norm(next_state - test_goal)
gc_agent.store_experience(test_state, action, reward, next_state, False, test_goal)

print(f"\n  Experience stored with HER augmentation")
print(f"  Buffer size: {len(gc_agent.buffer)}")
print(f"  (HER creates {gc_agent.her_k} additional synthetic transitions per real transition)")

TypeError: GoalConditionedAgent.__init__() got an unexpected keyword argument 'lr'

### 3.4 Feudal Networks

In [10]:
# Create feudal network
feudal_agent = FeudalNetwork(
    state_dim=state_dim,
    action_dim=action_dim,
    manager_dim=16,
    worker_dim=32,
    lr=1e-3
)

print("Feudal Network (Manager-Worker Hierarchy)")
print(f"  Manager goal dimension: 16")
print(f"  Worker hidden dimension: 32")

# Test action selection
test_state = np.random.randn(state_dim)
action, goal = feudal_agent.select_action(test_state)

print(f"\n✅ Feudal network output:")
print(f"  Worker action: {action}")
print(f"  Manager goal embedding shape: {goal.shape if hasattr(goal, 'shape') else 'scalar'}")
print(f"\n  The manager produces goals for the worker to achieve")
print(f"  The worker selects primitive actions to reach the manager's goal")

TypeError: FeudalNetwork.__init__() got an unexpected keyword argument 'manager_dim'

## 4. Planning Algorithms <a name='planning'></a>

Advanced planning methods for decision-making.

### Key Concepts:
- **MCTS**: Monte Carlo Tree Search for discrete action spaces
- **World Models**: Complete environment simulators with VAE
- **Latent Space Planning**: Planning in learned low-dimensional representations

### 4.1 Monte Carlo Tree Search

In [11]:
# Create simple environment for MCTS
class SimpleMCTSEnv:
    def __init__(self):
        self.state = 0
        self.goal = 10
    
    def reset(self):
        self.state = 0
        return self.state
    
    def step(self, action):
        # Actions: -1, 0, +1
        self.state += (action - 1)
        reward = -abs(self.state - self.goal)
        done = abs(self.state - self.goal) < 0.5
        return self.state, reward, done, {}

# Create MCTS
mcts_env = SimpleMCTSEnv()
mcts = MonteCarloTreeSearch(
    env=mcts_env,
    num_simulations=50,
    exploration_constant=1.4,
    max_depth=20
)

print("Monte Carlo Tree Search")
print(f"  Simulations per search: 50")
print(f"  Exploration constant (UCB): 1.4")
print(f"  Max depth: 20")

# Run MCTS
root_state = mcts_env.reset()
print(f"\n  Starting state: {root_state}, Goal: {mcts_env.goal}")
best_action = mcts.get_best_action(root_state)

print(f"\n✅ MCTS best action: {best_action}")
print(f"  (MCTS explored the tree and selected the most promising action)")

TypeError: MonteCarloTreeSearch.__init__() got an unexpected keyword argument 'env'

### 4.2 World Model

In [12]:
# Create world model with VAE-style architecture
latent_dim = 16
world_model = WorldModel(
    state_dim=state_dim,
    action_dim=action_dim,
    latent_dim=latent_dim,
    hidden_dim=128
)

print("World Model (VAE-based)")
print(f"  State dimension: {state_dim}")
print(f"  Latent dimension: {latent_dim}")
print(f"  Hidden dimension: 128")

# Test encoding
test_state = torch.randn(1, state_dim)
latent_mean, latent_log_var = world_model.encode(test_state)
latent = world_model.reparameterize(latent_mean, latent_log_var)

print(f"\n✅ World model encoding:")
print(f"  Input state shape: {test_state.shape}")
print(f"  Latent mean shape: {latent_mean.shape}")
print(f"  Latent log_var shape: {latent_log_var.shape}")
print(f"  Sampled latent shape: {latent.shape}")

# Test decoding
reconstructed = world_model.decode(latent)
print(f"\n  Reconstruction:")
print(f"  Reconstructed state shape: {reconstructed.shape}")
reconstruction_error = F.mse_loss(reconstructed, test_state)
print(f"  Reconstruction MSE: {reconstruction_error.item():.4f}")

# Test dynamics prediction in latent space
test_action = torch.randn(1, action_dim)
next_latent = world_model.predict_next_latent(latent, test_action)
predicted_reward = world_model.predict_reward(latent, test_action)

print(f"\n✅ Latent dynamics prediction:")
print(f"  Current latent shape: {latent.shape}")
print(f"  Action shape: {test_action.shape}")
print(f"  Next latent shape: {next_latent.shape}")
print(f"  Predicted reward shape: {predicted_reward.shape}")

# Decode next latent to see predicted next state
predicted_next_state = world_model.decode(next_latent)
print(f"\n  Predicted next state shape: {predicted_next_state.shape}")

TypeError: WorldModel.__init__() got an unexpected keyword argument 'state_dim'

### 4.3 Latent Space Planner

In [13]:
# Create components for latent planner
encoder = nn.Sequential(
    nn.Linear(state_dim, 64),
    nn.ReLU(),
    nn.Linear(64, latent_dim)
)

decoder = nn.Sequential(
    nn.Linear(latent_dim, 64),
    nn.ReLU(),
    nn.Linear(64, state_dim)
)

latent_dynamics = DynamicsModel(
    state_dim=latent_dim,
    action_dim=action_dim,
    hidden_dim=64
)

reward_model = nn.Sequential(
    nn.Linear(latent_dim + action_dim, 64),
    nn.ReLU(),
    nn.Linear(64, 1)
)

# Create latent planner
latent_planner = LatentSpacePlanner(
    encoder=encoder,
    decoder=decoder,
    dynamics_model=latent_dynamics,
    reward_model=reward_model,
    latent_dim=latent_dim,
    planning_horizon=10
)

print("Latent Space Planner")
print(f"  Latent dimension: {latent_dim}")
print(f"  Planning horizon: 10 steps")
print(f"  Plans in learned latent space (more efficient)")

# Test planning in latent space
test_state = torch.randn(state_dim)
planned_actions = latent_planner.plan(test_state, num_candidates=50)

print(f"\n✅ Latent space planning:")
print(f"  Input state shape: {test_state.shape}")
print(f"  Number of candidate action sequences: 50")
print(f"  Planned actions shape: {planned_actions.shape}")
print(f"  (Actions for next step from the best sequence)")

TypeError: LatentSpacePlanner.__init__() got an unexpected keyword argument 'dynamics_model'

## 5. Conclusion <a name='conclusion'></a>

This notebook demonstrated the implementation and usage of advanced deep RL algorithms:

### ✅ Model-Based RL:
- **Dynamics Models**: Neural networks that learn environment transitions
- **Model Ensembles**: Multiple models for uncertainty quantification
- **MPC**: Model Predictive Control for planning with learned models
- **Dyna-Q**: Integrates model-free Q-learning with model-based planning

### ✅ Hierarchical RL:
- **Options Framework**: Temporal abstraction with initiation, policy, and termination
- **Hierarchical Actor-Critic**: Multi-level policies with subgoal generation
- **Goal-Conditioned RL**: Policies conditioned on desired goals with HER
- **Feudal Networks**: Manager-worker hierarchy with goal embeddings

### ✅ Planning Algorithms:
- **MCTS**: Tree-based search for discrete action spaces
- **World Models**: VAE-based environment simulators with latent dynamics
- **Latent Space Planning**: Efficient planning in learned representations

### Key Advantages:

1. **Sample Efficiency**: Model-based methods significantly reduce required environment interactions
2. **Long-Horizon Tasks**: Hierarchical methods enable solving tasks with temporal dependencies
3. **Generalization**: Goal-conditioned agents transfer knowledge across different goals
4. **Planning**: Learned models enable sophisticated look-ahead and strategic decision-making
5. **Uncertainty Awareness**: Ensemble methods and probabilistic models quantify prediction uncertainty

### When to Use Each Method:

- **Model-Based RL**: When environment interactions are expensive (robotics, real-world applications)
- **Hierarchical RL**: For tasks requiring long sequences of actions or natural skill decomposition
- **Planning**: When you need explicit reasoning about future outcomes or exploration
- **Goal-Conditioned**: For multi-task learning or when goals change frequently

### Future Directions:

- Combining model-based and hierarchical approaches for maximum efficiency
- Meta-learning for rapid adaptation to new environments
- World models for imagination and offline learning
- Hierarchical planning in learned latent spaces

---

**All implementations are modular and importable from the corresponding Python modules.**

**Sharif University of Technology - Deep Reinforcement Learning - Fall 2024**