# HW12: Hierarchical Reinforcement Learning

> - Full Name: **[Your Full Name]**
> - Student ID: **[Your Student ID]**

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/DeepRLCourse/Homework-12-Questions/blob/main/HW12_Notebook.ipynb)
[![Open In kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://raw.githubusercontent.com/DeepRLCourse/Homework-12-Questions/main/HW12_Notebook.ipynb)

## Overview
This assignment focuses on **Hierarchical Reinforcement Learning (HRL)**, exploring methods to structure policies across multiple levels of abstraction. We'll implement and experiment with:

1. **Options Framework** - Semi-Markov decision processes
2. **Feudal Hierarchies** - Manager-worker architectures
3. **Goal-Conditioned RL** - Universal Value Function Approximators (UVFA)
4. **Hindsight Experience Replay (HER)** - Learning from failed trajectories
5. **Skill Discovery** - DIAYN for unsupervised skill learning

The goal is to understand how temporal abstraction enables agents to solve complex, long-horizon tasks by decomposing them into simpler subtasks.


In [None]:
# @title Imports and Setup

import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import gymnasium as gym
from gymnasium import spaces
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.notebook import trange
from collections import defaultdict, deque
import random
from typing import Tuple, List, Dict, Optional
import warnings
warnings.filterwarnings('ignore')

# Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)
random.seed(42)

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Set plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")


## 1. Hierarchical Environment Setup

First, let's create a hierarchical environment that benefits from temporal abstraction. We'll use a multi-room navigation task where the agent must navigate through multiple rooms to reach a goal.


In [None]:
class MultiRoomEnv(gym.Env):
    """
    Multi-room navigation environment for hierarchical RL.
    Agent must navigate through rooms to reach a goal.
    """
    
    def __init__(self, num_rooms=4, room_size=5):
        super().__init__()
        
        self.num_rooms = num_rooms
        self.room_size = room_size
        self.grid_size = num_rooms * room_size
        
        # State: (x, y, room_id, goal_x, goal_y, goal_room)
        self.observation_space = spaces.Box(
            low=0, high=self.grid_size, shape=(6,), dtype=np.float32
        )
        
        # Actions: 0=up, 1=down, 2=left, 3=right
        self.action_space = spaces.Discrete(4)
        
        # Room layout
        self.rooms = self._create_rooms()
        self.doors = self._create_doors()
        
        # Goal and agent positions
        self.goal_pos = None
        self.goal_room = None
        self.agent_pos = None
        self.agent_room = None
        
    def _create_rooms(self):
        """Create room boundaries."""
        rooms = {}
        for i in range(self.num_rooms):
            x_start = i * self.room_size
            x_end = (i + 1) * self.room_size
            rooms[i] = (x_start, x_end, 0, self.room_size)
        return rooms
    
    def _create_doors(self):
        """Create doors between rooms."""
        doors = []
        for i in range(self.num_rooms - 1):
            door_x = (i + 1) * self.room_size - 1
            door_y = self.room_size // 2
            doors.append((door_x, door_y))
        return doors
    
    def reset(self, **kwargs):
        """Reset environment."""
        # Random goal in last room
        self.goal_room = self.num_rooms - 1
        goal_x = np.random.randint(
            self.goal_room * self.room_size + 1,
            (self.goal_room + 1) * self.room_size - 1
        )
        goal_y = np.random.randint(1, self.room_size - 1)
        self.goal_pos = (goal_x, goal_y)
        
        # Random start in first room
        self.agent_room = 0
        start_x = np.random.randint(1, self.room_size - 1)
        start_y = np.random.randint(1, self.room_size - 1)
        self.agent_pos = (start_x, start_y)
        
        return self._get_observation(), {}
    
    def step(self, action):
        """Execute action."""
        x, y = self.agent_pos
        
        # Action effects
        if action == 0:  # up
            new_y = max(0, y - 1)
            new_pos = (x, new_y)
        elif action == 1:  # down
            new_y = min(self.grid_size - 1, y + 1)
            new_pos = (x, new_y)
        elif action == 2:  # left
            new_x = max(0, x - 1)
            new_pos = (new_x, y)
        elif action == 3:  # right
            new_x = min(self.grid_size - 1, x + 1)
            new_pos = (new_x, y)
        
        # Check if move is valid (not through walls, but can go through doors)
        if self._is_valid_move(self.agent_pos, new_pos):
            self.agent_pos = new_pos
            self.agent_room = self._get_room_from_pos(new_pos)
        
        # Compute reward
        reward = self._compute_reward()
        
        # Check if done
        done = self._is_goal_reached()
        
        return self._get_observation(), reward, done, False, {}
    
    def _is_valid_move(self, old_pos, new_pos):
        """Check if move is valid."""
        old_x, old_y = old_pos
        new_x, new_y = new_pos
        
        # Check room boundaries
        old_room = self._get_room_from_pos(old_pos)
        new_room = self._get_room_from_pos(new_pos)
        
        # Can move within room
        if old_room == new_room:
            return True
        
        # Can move through doors
        if (old_x, old_y) in self.doors or (new_x, new_y) in self.doors:
            return True
        
        return False
    
    def _get_room_from_pos(self, pos):
        """Get room ID from position."""
        x, y = pos
        return min(x // self.room_size, self.num_rooms - 1)
    
    def _compute_reward(self):
        """Compute reward based on progress toward goal."""
        agent_x, agent_y = self.agent_pos
        goal_x, goal_y = self.goal_pos
        
        # Distance to goal
        distance = np.sqrt((agent_x - goal_x)**2 + (agent_y - goal_y)**2)
        
        # Sparse reward: only when reaching goal
        if self._is_goal_reached():
            return 100.0
        
        # Small negative reward for each step (encourage efficiency)
        return -0.1
    
    def _is_goal_reached(self):
        """Check if agent reached goal."""
        return (self.agent_pos == self.goal_pos and 
                self.agent_room == self.goal_room)
    
    def _get_observation(self):
        """Get current observation."""
        agent_x, agent_y = self.agent_pos
        goal_x, goal_y = self.goal_pos
        
        return np.array([
            agent_x, agent_y, self.agent_room,
            goal_x, goal_y, self.goal_room
        ], dtype=np.float32)
    
    def render(self, mode='human'):
        """Render environment."""
        if mode == 'human':
            grid = np.zeros((self.grid_size, self.grid_size))
            
            # Mark rooms
            for room_id, (x_start, x_end, y_start, y_end) in self.rooms.items():
                grid[y_start:y_end, x_start:x_end] = room_id + 1
            
            # Mark doors
            for door_x, door_y in self.doors:
                grid[door_y, door_x] = 0.5
            
            # Mark agent
            agent_x, agent_y = self.agent_pos
            grid[agent_y, agent_x] = -1
            
            # Mark goal
            goal_x, goal_y = self.goal_pos
            grid[goal_y, goal_x] = -2
            
            plt.figure(figsize=(8, 6))
            plt.imshow(grid, cmap='tab10')
            plt.title('Multi-Room Environment')
            plt.xlabel('X Position')
            plt.ylabel('Y Position')
            plt.colorbar()
            plt.show()

# Test the environment
print("Testing Multi-Room Environment...")
env = MultiRoomEnv(num_rooms=3, room_size=4)

# Run a few random episodes
for episode in range(3):
    obs, info = env.reset()
    episode_reward = 0
    
    for step in range(50):  # Max 50 steps
        action = env.action_space.sample()
        obs, reward, terminated, truncated, info = env.step(action)
        episode_reward += reward
        
        if terminated or truncated:
            break
    
    print(f"Episode {episode + 1}: Reward = {episode_reward:.2f}, "
          f"Final Room = {obs[2]:.0f}, Goal Room = {obs[5]:.0f}")

print("Environment test completed!")


## 2. Options Framework Implementation

The Options Framework provides temporal abstraction by allowing agents to choose temporally extended actions. An option consists of:
- **Initiation Set**: States where the option can be executed
- **Policy**: How to behave while executing the option
- **Termination Function**: When to stop executing the option


In [None]:
class Option:
    """
    An option represents a temporally extended action.
    """
    
    def __init__(self, option_id, initiation_set, policy, termination_function):
        self.option_id = option_id
        self.initiation_set = initiation_set  # Function: state -> bool
        self.policy = policy  # Function: state -> action
        self.termination_function = termination_function  # Function: state -> termination_prob
    
    def can_initiate(self, state):
        """Check if option can be initiated in given state."""
        return self.initiation_set(state)
    
    def get_action(self, state):
        """Get action from option policy."""
        return self.policy(state)
    
    def should_terminate(self, state):
        """Check if option should terminate."""
        return self.termination_function(state)


class OptionCritic(nn.Module):
    """
    Option-Critic architecture for learning options end-to-end.
    """
    
    def __init__(self, state_dim, num_options, action_dim, hidden_dim=128):
        super().__init__()
        
        self.state_dim = state_dim
        self.num_options = num_options
        self.action_dim = action_dim
        
        # Shared state encoder
        self.encoder = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU()
        )
        
        # Option policies (one for each option)
        self.option_policies = nn.ModuleList([
            nn.Sequential(
                nn.Linear(hidden_dim, hidden_dim),
                nn.ReLU(),
                nn.Linear(hidden_dim, action_dim)
            ) for _ in range(num_options)
        ])
        
        # Termination functions
        self.termination_functions = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, num_options),
            nn.Sigmoid()
        )
        
        # Q-value over options
        self.q_omega = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, num_options)
        )
        
        # Value function for options
        self.value_function = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 1)
        )
    
    def forward(self, state, current_option=None):
        """Forward pass through the network."""
        features = self.encoder(state)
        
        if current_option is not None:
            # Get action from current option
            action_logits = self.option_policies[current_option](features)
            
            # Get termination probability
            termination_probs = self.termination_functions(features)
            termination_prob = termination_probs[:, current_option]
            
            return action_logits, termination_prob
        else:
            # Select option
            q_omega = self.q_omega(features)
            return q_omega
    
    def get_value(self, state):
        """Get state value."""
        features = self.encoder(state)
        return self.value_function(features)
    
    def select_option(self, state, epsilon=0.1):
        """Select option using epsilon-greedy policy."""
        if np.random.random() < epsilon:
            return np.random.randint(self.num_options)
        
        with torch.no_grad():
            q_omega = self.forward(state)
            return q_omega.argmax(dim=-1).item()
    
    def get_action_and_termination(self, state, option):
        """Get action and termination probability for given option."""
        action_logits, termination_prob = self.forward(state, option)
        
        # Sample action
        action_probs = F.softmax(action_logits, dim=-1)
        action = torch.multinomial(action_probs, 1).item()
        
        # Sample termination
        should_terminate = np.random.random() < termination_prob.item()
        
        return action, should_terminate


class OptionLearningAgent:
    """
    Agent that learns options using the Option-Critic algorithm.
    """
    
    def __init__(self, state_dim, num_options, action_dim, lr=3e-4, gamma=0.99):
        self.state_dim = state_dim
        self.num_options = num_options
        self.action_dim = action_dim
        self.gamma = gamma
        
        # Networks
        self.option_critic = OptionCritic(state_dim, num_options, action_dim)
        self.optimizer = optim.Adam(self.option_critic.parameters(), lr=lr)
        
        # Experience buffer
        self.buffer = []
        
    def select_action(self, state, current_option=None, epsilon=0.1):
        """Select action using current option or select new option."""
        state_tensor = torch.FloatTensor(state).unsqueeze(0)
        
        if current_option is None:
            # Select new option
            option = self.option_critic.select_option(state_tensor, epsilon)
            return option, None
        else:
            # Get action from current option
            action, should_terminate = self.option_critic.get_action_and_termination(
                state_tensor, current_option
            )
            return action, should_terminate
    
    def store_transition(self, state, option, action, reward, next_state, terminated, truncated):
        """Store transition in buffer."""
        self.buffer.append({
            'state': state,
            'option': option,
            'action': action,
            'reward': reward,
            'next_state': next_state,
            'terminated': terminated,
            'truncated': truncated
        })
    
    def update(self, batch_size=32):
        """Update networks using stored experiences."""
        if len(self.buffer) < batch_size:
            return
        
        # Sample batch
        batch = random.sample(self.buffer, batch_size)
        
        states = torch.FloatTensor([t['state'] for t in batch])
        options = torch.LongTensor([t['option'] for t in batch])
        actions = torch.LongTensor([t['action'] for t in batch])
        rewards = torch.FloatTensor([t['reward'] for t in batch])
        next_states = torch.FloatTensor([t['next_state'] for t in batch])
        terminated = torch.BoolTensor([t['terminated'] for t in batch])
        
        # Compute Q-values
        q_omega = self.option_critic.q_omega(self.option_critic.encoder(states))
        q_values = q_omega.gather(1, options.unsqueeze(1)).squeeze(1)
        
        # Compute target Q-values
        with torch.no_grad():
            next_q_omega = self.option_critic.q_omega(self.option_critic.encoder(next_states))
            next_values = self.option_critic.get_value(next_states).squeeze(1)
            
            # Target for Q-learning
            targets = rewards + self.gamma * next_values * (~terminated)
        
        # Q-learning loss
        q_loss = F.mse_loss(q_values, targets)
        
        # Policy gradient loss (simplified)
        action_logits, termination_probs = self.option_critic(states, options)
        action_probs = F.softmax(action_logits, dim=-1)
        log_probs = F.log_softmax(action_logits, dim=-1)
        
        # Advantage estimation (simplified)
        advantages = targets - q_values.detach()
        
        # Policy loss
        policy_loss = -(log_probs.gather(1, actions.unsqueeze(1)).squeeze(1) * advantages).mean()
        
        # Termination loss (encourage appropriate termination)
        termination_loss = -(termination_probs.gather(1, options.unsqueeze(1)).squeeze(1) * advantages).mean()
        
        # Total loss
        total_loss = q_loss + policy_loss + termination_loss
        
        # Update
        self.optimizer.zero_grad()
        total_loss.backward()
        self.optimizer.step()
        
        return {
            'q_loss': q_loss.item(),
            'policy_loss': policy_loss.item(),
            'termination_loss': termination_loss.item(),
            'total_loss': total_loss.item()
        }


# Test the Option-Critic implementation
print("Testing Option-Critic Implementation...")

# Create environment and agent
env = MultiRoomEnv(num_rooms=3, room_size=4)
state_dim = env.observation_space.shape[0]
num_options = 3  # Navigate to room 1, 2, and 3
action_dim = env.action_space.n

agent = OptionLearningAgent(state_dim, num_options, action_dim)

# Test action selection
obs, _ = env.reset()
print(f"Initial observation: {obs}")

# Select initial option
option, _ = agent.select_action(obs)
print(f"Selected option: {option}")

# Get action from option
action, should_terminate = agent.select_action(obs, option)
print(f"Action from option {option}: {action}, Should terminate: {should_terminate}")

print("Option-Critic test completed!")


## 3. Feudal Hierarchies

Feudal RL implements a manager-worker hierarchy where:
- **Manager**: Sets high-level goals/commands
- **Worker**: Executes low-level actions to achieve goals
- **Communication**: Manager communicates goals to worker through embeddings


In [None]:
class FeudalNet(nn.Module):
    """
    Feudal Networks implementation with Manager-Worker hierarchy.
    """
    
    def __init__(self, state_dim, action_dim, goal_dim=16, c=10, hidden_dim=256):
        super().__init__()
        
        self.state_dim = state_dim
        self.action_dim = action_dim
        self.goal_dim = goal_dim
        self.c = c  # Manager update frequency
        
        # Shared perception module
        self.perception = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU()
        )
        
        # Manager network (LSTM for temporal dependencies)
        self.manager = nn.LSTM(hidden_dim, goal_dim, batch_first=True)
        
        # Worker network (receives state + goal)
        self.worker = nn.LSTM(hidden_dim + goal_dim, hidden_dim, batch_first=True)
        self.worker_policy = nn.Linear(hidden_dim, action_dim)
        
        # Value functions
        self.manager_value = nn.Linear(goal_dim, 1)
        self.worker_value = nn.Linear(hidden_dim, 1)
        
    def forward(self, state, manager_hidden=None, worker_hidden=None, t=0):
        """
        Forward pass through feudal network.
        
        Args:
            state: Current state
            manager_hidden: Manager LSTM hidden state
            worker_hidden: Worker LSTM hidden state
            t: Current timestep
        """
        batch_size = state.size(0)
        
        # Shared perception
        z = self.perception(state)
        
        # Manager operates every c timesteps
        if t % self.c == 0 or manager_hidden is None:
            # Manager sets goal
            if manager_hidden is None:
                manager_hidden = (torch.zeros(1, batch_size, self.goal_dim),
                                torch.zeros(1, batch_size, self.goal_dim))
            
            goal, manager_hidden = self.manager(z.unsqueeze(1), manager_hidden)
            goal = F.normalize(goal, dim=-1)  # Normalize goal vector
        else:
            # Use previous goal
            goal = torch.zeros(batch_size, 1, self.goal_dim)
            if manager_hidden is not None:
                goal = manager_hidden[0].transpose(0, 1)
        
        # Worker receives state + goal
        worker_input = torch.cat([z, goal.squeeze(1)], dim=-1)
        
        if worker_hidden is None:
            worker_hidden = (torch.zeros(1, batch_size, hidden_dim),
                           torch.zeros(1, batch_size, hidden_dim))
        
        worker_out, worker_hidden = self.worker(worker_input.unsqueeze(1), worker_hidden)
        
        # Worker action
        action_logits = self.worker_policy(worker_out.squeeze(1))
        
        # Value estimates
        manager_value = self.manager_value(goal.squeeze(1))
        worker_value = self.worker_value(worker_out.squeeze(1))
        
        return {
            'action_logits': action_logits,
            'goal': goal.squeeze(1),
            'manager_value': manager_value,
            'worker_value': worker_value,
            'manager_hidden': manager_hidden,
            'worker_hidden': worker_hidden
        }


class FeudalAgent:
    """
    Feudal RL agent with manager-worker hierarchy.
    """
    
    def __init__(self, state_dim, action_dim, goal_dim=16, c=10, lr=3e-4, gamma=0.99):
        self.state_dim = state_dim
        self.action_dim = action_dim
        self.goal_dim = goal_dim
        self.c = c
        self.gamma = gamma
        
        # Networks
        self.feudal_net = FeudalNet(state_dim, action_dim, goal_dim, c)
        self.optimizer = optim.Adam(self.feudal_net.parameters(), lr=lr)
        
        # Hidden states
        self.manager_hidden = None
        self.worker_hidden = None
        
        # Experience buffer
        self.buffer = []
        
    def select_action(self, state, t=0):
        """Select action using feudal hierarchy."""
        state_tensor = torch.FloatTensor(state).unsqueeze(0)
        
        with torch.no_grad():
            output = self.feudal_net(state_tensor, self.manager_hidden, self.worker_hidden, t)
            
            # Update hidden states
            self.manager_hidden = output['manager_hidden']
            self.worker_hidden = output['worker_hidden']
            
            # Sample action
            action_probs = F.softmax(output['action_logits'], dim=-1)
            action = torch.multinomial(action_probs, 1).item()
            
            return action, output['goal'].squeeze(0).numpy()
    
    def store_transition(self, state, action, reward, next_state, goal, manager_value, worker_value, done):
        """Store transition in buffer."""
        self.buffer.append({
            'state': state,
            'action': action,
            'reward': reward,
            'next_state': next_state,
            'goal': goal,
            'manager_value': manager_value,
            'worker_value': worker_value,
            'done': done
        })
    
    def compute_manager_reward(self, goals, states, c):
        """
        Compute manager reward based on transition embedding similarity.
        Manager is rewarded for setting goals that align with state transitions.
        """
        if len(goals) < 2:
            return torch.zeros(1)
        
        # Compute state embeddings
        state_tensors = torch.FloatTensor(states)
        embeddings = self.feudal_net.perception(state_tensors)
        
        # Compute transition directions
        transitions = embeddings[c:] - embeddings[:-c]
        
        # Compute cosine similarity between goals and transitions
        goal_tensors = torch.FloatTensor(goals[:-c])
        similarities = F.cosine_similarity(goal_tensors, transitions, dim=-1)
        
        return similarities.mean()
    
    def compute_worker_reward(self, goals, states, rewards, alpha=0.1):
        """
        Compute worker reward: extrinsic + intrinsic.
        Intrinsic reward encourages progress toward manager's goal.
        """
        if len(goals) < 2:
            return torch.FloatTensor(rewards)
        
        # Compute state embeddings
        state_tensors = torch.FloatTensor(states)
        embeddings = self.feudal_net.perception(state_tensors)
        
        # Compute progress toward goal
        goal_tensors = torch.FloatTensor(goals[:-1])
        transitions = embeddings[1:] - embeddings[:-1]
        
        # Intrinsic reward: progress toward goal
        intrinsic_rewards = F.cosine_similarity(goal_tensors, transitions, dim=-1)
        
        # Combine extrinsic and intrinsic rewards
        extrinsic_rewards = torch.FloatTensor(rewards[:-1])
        total_rewards = extrinsic_rewards + alpha * intrinsic_rewards
        
        return total_rewards
    
    def update(self, batch_size=32):
        """Update feudal networks."""
        if len(self.buffer) < batch_size:
            return
        
        # Sample batch
        batch = random.sample(self.buffer, batch_size)
        
        states = torch.FloatTensor([t['state'] for t in batch])
        actions = torch.LongTensor([t['action'] for t in batch])
        rewards = torch.FloatTensor([t['reward'] for t in batch])
        next_states = torch.FloatTensor([t['next_state'] for t in batch])
        goals = torch.FloatTensor([t['goal'] for t in batch])
        manager_values = torch.FloatTensor([t['manager_value'] for t in batch])
        worker_values = torch.FloatTensor([t['worker_value'] for t in batch])
        dones = torch.BoolTensor([t['done'] for t in batch])
        
        # Forward pass
        output = self.feudal_net(states)
        
        # Compute targets
        with torch.no_grad():
            next_output = self.feudal_net(next_states)
            next_manager_values = next_output['manager_value']
            next_worker_values = next_output['worker_value']
            
            manager_targets = rewards + self.gamma * next_manager_values * (~dones)
            worker_targets = rewards + self.gamma * next_worker_values * (~dones)
        
        # Manager loss
        manager_loss = F.mse_loss(output['manager_value'], manager_targets)
        
        # Worker loss
        worker_loss = F.mse_loss(output['worker_value'], worker_targets)
        
        # Policy loss (simplified)
        action_logits = output['action_logits']
        log_probs = F.log_softmax(action_logits, dim=-1)
        action_probs = log_probs.gather(1, actions.unsqueeze(1)).squeeze(1)
        
        # Advantage estimation
        advantages = worker_targets - worker_values.detach()
        policy_loss = -(action_probs * advantages).mean()
        
        # Total loss
        total_loss = manager_loss + worker_loss + policy_loss
        
        # Update
        self.optimizer.zero_grad()
        total_loss.backward()
        self.optimizer.step()
        
        return {
            'manager_loss': manager_loss.item(),
            'worker_loss': worker_loss.item(),
            'policy_loss': policy_loss.item(),
            'total_loss': total_loss.item()
        }


# Test Feudal Networks
print("Testing Feudal Networks...")

# Create environment and agent
env = MultiRoomEnv(num_rooms=3, room_size=4)
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n

agent = FeudalAgent(state_dim, action_dim, goal_dim=8, c=5)

# Test action selection
obs, _ = env.reset()
print(f"Initial observation: {obs}")

# Select action
action, goal = agent.select_action(obs, t=0)
print(f"Selected action: {action}")
print(f"Manager goal: {goal}")

# Test multiple steps
for t in range(3):
    action, goal = agent.select_action(obs, t=t)
    print(f"Step {t}: Action = {action}, Goal norm = {np.linalg.norm(goal):.3f}")

print("Feudal Networks test completed!")


## 4. Goal-Conditioned RL with Hindsight Experience Replay

Goal-Conditioned RL trains policies to reach diverse goals. Hindsight Experience Replay (HER) improves sample efficiency by learning from failed trajectories by treating achieved states as goals.


In [None]:
class GoalConditionedPolicy(nn.Module):
    """
    Universal Value Function Approximator (UVFA) for goal-conditioned RL.
    """
    
    def __init__(self, state_dim, goal_dim, action_dim, hidden_dim=256):
        super().__init__()
        
        self.state_dim = state_dim
        self.goal_dim = goal_dim
        self.action_dim = action_dim
        
        # Policy network (state + goal -> action)
        self.policy = nn.Sequential(
            nn.Linear(state_dim + goal_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, action_dim)
        )
        
        # Q-value network (state + goal + action -> Q-value)
        self.q_network = nn.Sequential(
            nn.Linear(state_dim + goal_dim + action_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 1)
        )
        
        # Value network (state + goal -> value)
        self.value_network = nn.Sequential(
            nn.Linear(state_dim + goal_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 1)
        )
    
    def forward(self, state, goal):
        """Forward pass through policy network."""
        x = torch.cat([state, goal], dim=-1)
        return self.policy(x)
    
    def get_q_value(self, state, goal, action):
        """Get Q-value for state-goal-action tuple."""
        x = torch.cat([state, goal, action], dim=-1)
        return self.q_network(x)
    
    def get_value(self, state, goal):
        """Get value for state-goal pair."""
        x = torch.cat([state, goal], dim=-1)
        return self.value_network(x)


class HERBuffer:
    """
    Experience buffer with Hindsight Experience Replay support.
    """
    
    def __init__(self, capacity=100000):
        self.capacity = capacity
        self.buffer = []
        self.position = 0
    
    def add(self, state, action, reward, next_state, goal, achieved_goal, done):
        """Add transition to buffer."""
        if len(self.buffer) < self.capacity:
            self.buffer.append(None)
        
        self.buffer[self.position] = {
            'state': state,
            'action': action,
            'reward': reward,
            'next_state': next_state,
            'goal': goal,
            'achieved_goal': achieved_goal,
            'done': done
        }
        
        self.position = (self.position + 1) % self.capacity
    
    def sample(self, batch_size):
        """Sample batch from buffer."""
        return random.sample(self.buffer, batch_size)
    
    def __len__(self):
        return len(self.buffer)


class HERAgent:
    """
    Goal-conditioned RL agent with Hindsight Experience Replay.
    """
    
    def __init__(self, state_dim, goal_dim, action_dim, lr=3e-4, gamma=0.99, 
                 her_ratio=0.8, target_update_freq=100):
        self.state_dim = state_dim
        self.goal_dim = goal_dim
        self.action_dim = action_dim
        self.gamma = gamma
        self.her_ratio = her_ratio
        self.target_update_freq = target_update_freq
        
        # Networks
        self.policy_net = GoalConditionedPolicy(state_dim, goal_dim, action_dim)
        self.target_policy_net = GoalConditionedPolicy(state_dim, goal_dim, action_dim)
        self.target_policy_net.load_state_dict(self.policy_net.state_dict())
        
        # Optimizers
        self.policy_optimizer = optim.Adam(self.policy_net.parameters(), lr=lr)
        
        # Experience buffer
        self.buffer = HERBuffer()
        
        # Update counter
        self.update_count = 0
        
    def select_action(self, state, goal, epsilon=0.1):
        """Select action using epsilon-greedy policy."""
        state_tensor = torch.FloatTensor(state).unsqueeze(0)
        goal_tensor = torch.FloatTensor(goal).unsqueeze(0)
        
        if np.random.random() < epsilon:
            return np.random.randint(self.action_dim)
        
        with torch.no_grad():
            action_logits = self.policy_net(state_tensor, goal_tensor)
            action_probs = F.softmax(action_logits, dim=-1)
            action = torch.multinomial(action_probs, 1).item()
        
        return action
    
    def store_transition(self, state, action, reward, next_state, goal, achieved_goal, done):
        """Store transition in buffer."""
        self.buffer.add(state, action, reward, next_state, goal, achieved_goal, done)
    
    def compute_reward(self, achieved_goal, desired_goal):
        """Compute reward based on goal achievement."""
        # Sparse reward: 1 if goal achieved, 0 otherwise
        if np.array_equal(achieved_goal, desired_goal):
            return 1.0
        return 0.0
    
    def hindsight_experience_replay(self, trajectory, strategy='future'):
        """
        Generate hindsight experiences from trajectory.
        
        Args:
            trajectory: List of (state, action, reward, next_state, goal, achieved_goal, done)
            strategy: 'future', 'final', or 'random'
        """
        states, actions, rewards, next_states, goals, achieved_goals, dones = zip(*trajectory)
        
        # Add original experiences
        for i in range(len(trajectory)):
            self.buffer.add(states[i], actions[i], rewards[i], next_states[i], 
                          goals[i], achieved_goals[i], dones[i])
        
        # Add hindsight experiences
        for t in range(len(trajectory)):
            if strategy == 'future':
                # Sample future achieved goal
                future_indices = list(range(t, len(trajectory)))
                if future_indices:
                    future_t = np.random.choice(future_indices)
                    new_goal = achieved_goals[future_t]
            elif strategy == 'final':
                # Use final achieved goal
                new_goal = achieved_goals[-1]
            elif strategy == 'random':
                # Sample random achieved goal
                new_goal = achieved_goals[np.random.randint(len(achieved_goals))]
            
            # Recompute rewards with new goal
            new_rewards = []
            for i in range(len(trajectory)):
                new_reward = self.compute_reward(achieved_goals[i], new_goal)
                new_rewards.append(new_reward)
            
            # Add modified experiences
            for i in range(len(trajectory)):
                self.buffer.add(states[i], actions[i], new_rewards[i], next_states[i],
                              new_goal, achieved_goals[i], dones[i])
    
    def update(self, batch_size=32):
        """Update policy using DQN with HER."""
        if len(self.buffer) < batch_size:
            return
        
        # Sample batch
        batch = self.buffer.sample(batch_size)
        
        states = torch.FloatTensor([t['state'] for t in batch])
        actions = torch.LongTensor([t['action'] for t in batch])
        rewards = torch.FloatTensor([t['reward'] for t in batch])
        next_states = torch.FloatTensor([t['next_state'] for t in batch])
        goals = torch.FloatTensor([t['goal'] for t in batch])
        dones = torch.BoolTensor([t['done'] for t in batch])
        
        # Current Q-values
        current_q_values = self.policy_net.get_q_value(states, goals, actions.unsqueeze(1).float())
        
        # Target Q-values
        with torch.no_grad():
            # Get next actions from target policy
            next_action_logits = self.target_policy_net(next_states, goals)
            next_actions = next_action_logits.argmax(dim=-1, keepdim=True)
            
            # Get target Q-values
            next_q_values = self.target_policy_net.get_q_value(next_states, goals, next_actions.float())
            target_q_values = rewards.unsqueeze(1) + self.gamma * next_q_values * (~dones).unsqueeze(1)
        
        # Q-learning loss
        q_loss = F.mse_loss(current_q_values, target_q_values)
        
        # Update policy
        self.policy_optimizer.zero_grad()
        q_loss.backward()
        self.policy_optimizer.step()
        
        # Update target network
        self.update_count += 1
        if self.update_count % self.target_update_freq == 0:
            self.target_policy_net.load_state_dict(self.policy_net.state_dict())
        
        return {'q_loss': q_loss.item()}


# Test Goal-Conditioned RL with HER
print("Testing Goal-Conditioned RL with HER...")

# Create environment and agent
env = MultiRoomEnv(num_rooms=3, room_size=4)
state_dim = env.observation_space.shape[0]
goal_dim = 2  # Goal is (x, y) position
action_dim = env.action_space.n

agent = HERAgent(state_dim, goal_dim, action_dim)

# Test action selection
obs, _ = env.reset()
goal = np.array([obs[3], obs[4]])  # Goal position from observation

action = agent.select_action(obs, goal)
print(f"Selected action: {action}")

# Test reward computation
achieved_goal = np.array([obs[0], obs[1]])  # Current position
reward = agent.compute_reward(achieved_goal, goal)
print(f"Reward for current position: {reward}")

# Test HER
trajectory = [
    (obs, action, reward, obs, goal, achieved_goal, False)
]
agent.hindsight_experience_replay(trajectory, strategy='future')
print(f"Buffer size after HER: {len(agent.buffer)}")

print("Goal-Conditioned RL with HER test completed!")


## 5. Skill Discovery with DIAYN

DIAYN (Diversity is All You Need) learns diverse skills without external rewards by maximizing mutual information between skills and states.


In [None]:
class SkillConditionedPolicy(nn.Module):
    """
    Policy conditioned on skill (latent variable).
    """
    
    def __init__(self, state_dim, skill_dim, action_dim, hidden_dim=256):
        super().__init__()
        
        self.state_dim = state_dim
        self.skill_dim = skill_dim
        self.action_dim = action_dim
        
        # Policy network (state + skill -> action)
        self.policy = nn.Sequential(
            nn.Linear(state_dim + skill_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, action_dim)
        )
    
    def forward(self, state, skill):
        """Forward pass through policy network."""
        x = torch.cat([state, skill], dim=-1)
        return self.policy(x)


class SkillDiscriminator(nn.Module):
    """
    Discriminator that predicts skill from state.
    """
    
    def __init__(self, state_dim, skill_dim, hidden_dim=256):
        super().__init__()
        
        self.state_dim = state_dim
        self.skill_dim = skill_dim
        
        # Discriminator network (state -> skill_logits)
        self.discriminator = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, skill_dim)
        )
    
    def forward(self, state):
        """Forward pass through discriminator network."""
        return self.discriminator(state)


class DIAYNAgent:
    """
    DIAYN agent for unsupervised skill discovery.
    """
    
    def __init__(self, state_dim, skill_dim, action_dim, lr=3e-4, gamma=0.99):
        self.state_dim = state_dim
        self.skill_dim = skill_dim
        self.action_dim = action_dim
        self.gamma = gamma
        
        # Networks
        self.policy = SkillConditionedPolicy(state_dim, skill_dim, action_dim)
        self.discriminator = SkillDiscriminator(state_dim, skill_dim)
        
        # Optimizers
        self.policy_optimizer = optim.Adam(self.policy.parameters(), lr=lr)
        self.discriminator_optimizer = optim.Adam(self.discriminator.parameters(), lr=lr)
        
        # Experience buffer
        self.buffer = []
        
    def sample_skill(self):
        """Sample random skill (one-hot vector)."""
        skill_id = np.random.randint(self.skill_dim)
        skill = np.zeros(self.skill_dim)
        skill[skill_id] = 1.0
        return skill, skill_id
    
    def select_action(self, state, skill, epsilon=0.1):
        """Select action using skill-conditioned policy."""
        state_tensor = torch.FloatTensor(state).unsqueeze(0)
        skill_tensor = torch.FloatTensor(skill).unsqueeze(0)
        
        if np.random.random() < epsilon:
            return np.random.randint(self.action_dim)
        
        with torch.no_grad():
            action_logits = self.policy(state_tensor, skill_tensor)
            action_probs = F.softmax(action_logits, dim=-1)
            action = torch.multinomial(action_probs, 1).item()
        
        return action
    
    def compute_intrinsic_reward(self, state, skill_id):
        """
        Compute intrinsic reward based on mutual information.
        Reward = log p(skill|state) - log p(skill)
        """
        state_tensor = torch.FloatTensor(state).unsqueeze(0)
        
        with torch.no_grad():
            skill_logits = self.discriminator(state_tensor)
            skill_probs = F.softmax(skill_logits, dim=-1)
            
            # log p(skill|state)
            log_p_skill_given_state = torch.log(skill_probs[0, skill_id] + 1e-8)
            
            # log p(skill) = log(1/num_skills)
            log_p_skill = np.log(1.0 / self.skill_dim)
            
            # Intrinsic reward
            intrinsic_reward = log_p_skill_given_state - log_p_skill
        
        return intrinsic_reward.item()
    
    def store_transition(self, state, skill_id, action, reward, next_state, done):
        """Store transition in buffer."""
        self.buffer.append({
            'state': state,
            'skill_id': skill_id,
            'action': action,
            'reward': reward,
            'next_state': next_state,
            'done': done
        })
    
    def update_policy(self, batch_size=32):
        """Update policy using intrinsic rewards."""
        if len(self.buffer) < batch_size:
            return
        
        # Sample batch
        batch = random.sample(self.buffer, batch_size)
        
        states = torch.FloatTensor([t['state'] for t in batch])
        skills = torch.LongTensor([t['skill_id'] for t in batch])
        actions = torch.LongTensor([t['action'] for t in batch])
        rewards = torch.FloatTensor([t['reward'] for t in batch])
        next_states = torch.FloatTensor([t['next_state'] for t in batch])
        dones = torch.BoolTensor([t['done'] for t in batch])
        
        # Convert skills to one-hot
        skill_one_hot = F.one_hot(skills, num_classes=self.skill_dim).float()
        
        # Policy loss (REINFORCE with intrinsic rewards)
        action_logits = self.policy(states, skill_one_hot)
        log_probs = F.log_softmax(action_logits, dim=-1)
        action_log_probs = log_probs.gather(1, actions.unsqueeze(1)).squeeze(1)
        
        # Policy gradient loss
        policy_loss = -(action_log_probs * rewards).mean()
        
        # Update policy
        self.policy_optimizer.zero_grad()
        policy_loss.backward()
        self.policy_optimizer.step()
        
        return {'policy_loss': policy_loss.item()}
    
    def update_discriminator(self, batch_size=32):
        """Update discriminator to predict skills from states."""
        if len(self.buffer) < batch_size:
            return
        
        # Sample batch
        batch = random.sample(self.buffer, batch_size)
        
        states = torch.FloatTensor([t['state'] for t in batch])
        skills = torch.LongTensor([t['skill_id'] for t in batch])
        
        # Discriminator loss (cross-entropy)
        skill_logits = self.discriminator(states)
        discriminator_loss = F.cross_entropy(skill_logits, skills)
        
        # Update discriminator
        self.discriminator_optimizer.zero_grad()
        discriminator_loss.backward()
        self.discriminator_optimizer.step()
        
        return {'discriminator_loss': discriminator_loss.item()}
    
    def update(self, batch_size=32):
        """Update both policy and discriminator."""
        policy_losses = self.update_policy(batch_size)
        discriminator_losses = self.update_discriminator(batch_size)
        
        return {**policy_losses, **discriminator_losses}


# Test DIAYN
print("Testing DIAYN Skill Discovery...")

# Create environment and agent
env = MultiRoomEnv(num_rooms=3, room_size=4)
state_dim = env.observation_space.shape[0]
skill_dim = 4  # Number of skills to discover
action_dim = env.action_space.n

agent = DIAYNAgent(state_dim, skill_dim, action_dim)

# Test skill sampling
skill, skill_id = agent.sample_skill()
print(f"Sampled skill: {skill}, ID: {skill_id}")

# Test action selection
obs, _ = env.reset()
action = agent.select_action(obs, skill)
print(f"Action from skill {skill_id}: {action}")

# Test intrinsic reward
intrinsic_reward = agent.compute_intrinsic_reward(obs, skill_id)
print(f"Intrinsic reward: {intrinsic_reward:.3f}")

# Test updates
agent.store_transition(obs, skill_id, action, 0.0, obs, False)
losses = agent.update()
print(f"Update losses: {losses}")

print("DIAYN test completed!")


## 6. Training and Evaluation Functions

Let's implement training and evaluation functions for all hierarchical RL methods.


In [None]:
def train_option_critic(env, agent, num_episodes=1000, max_steps=100, 
                        epsilon_start=1.0, epsilon_end=0.1, epsilon_decay=0.995):
    """
    Train Option-Critic agent.
    """
    epsilon = epsilon_start
    episode_rewards = []
    episode_lengths = []
    losses = []
    
    for episode in trange(num_episodes, desc="Training Option-Critic"):
        obs, _ = env.reset()
        episode_reward = 0
        episode_length = 0
        
        # Select initial option
        current_option = None
        option_steps = 0
        
        for step in range(max_steps):
            if current_option is None or option_steps >= 10:  # Option termination
                current_option, _ = agent.select_action(obs, epsilon=epsilon)
                option_steps = 0
            
            # Get action from current option
            action, should_terminate = agent.select_action(obs, current_option, epsilon=epsilon)
            
            # Execute action
            next_obs, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            
            # Store transition
            agent.store_transition(obs, current_option, action, reward, next_obs, terminated, truncated)
            
            # Update agent
            if len(agent.buffer) >= 32:
                loss_info = agent.update()
                losses.append(loss_info)
            
            episode_reward += reward
            episode_length += 1
            
            # Option termination
            if should_terminate or done:
                current_option = None
                option_steps = 0
            else:
                option_steps += 1
            
            obs = next_obs
            
            if done:
                break
        
        episode_rewards.append(episode_reward)
        episode_lengths.append(episode_length)
        
        # Decay epsilon
        epsilon = max(epsilon_end, epsilon * epsilon_decay)
    
    return episode_rewards, episode_lengths, losses


def train_feudal_agent(env, agent, num_episodes=1000, max_steps=100):
    """
    Train Feudal agent.
    """
    episode_rewards = []
    episode_lengths = []
    losses = []
    
    for episode in trange(num_episodes, desc="Training Feudal Agent"):
        obs, _ = env.reset()
        episode_reward = 0
        episode_length = 0
        
        # Reset hidden states
        agent.manager_hidden = None
        agent.worker_hidden = None
        
        trajectory = []
        
        for step in range(max_steps):
            # Select action
            action, goal = agent.select_action(obs, t=step)
            
            # Execute action
            next_obs, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            
            # Get value estimates
            state_tensor = torch.FloatTensor(obs).unsqueeze(0)
            with torch.no_grad():
                output = agent.feudal_net(state_tensor, agent.manager_hidden, agent.worker_hidden, step)
                manager_value = output['manager_value'].item()
                worker_value = output['worker_value'].item()
            
            # Store transition
            agent.store_transition(obs, action, reward, next_obs, goal, manager_value, worker_value, done)
            
            trajectory.append((obs, action, reward, next_obs, goal, done))
            
            episode_reward += reward
            episode_length += 1
            
            obs = next_obs
            
            if done:
                break
        
        # Update agent
        if len(agent.buffer) >= 32:
            loss_info = agent.update()
            losses.append(loss_info)
        
        episode_rewards.append(episode_reward)
        episode_lengths.append(episode_length)
    
    return episode_rewards, episode_lengths, losses


def train_her_agent(env, agent, num_episodes=1000, max_steps=100, 
                   epsilon_start=1.0, epsilon_end=0.1, epsilon_decay=0.995):
    """
    Train HER agent.
    """
    epsilon = epsilon_start
    episode_rewards = []
    episode_lengths = []
    losses = []
    
    for episode in trange(num_episodes, desc="Training HER Agent"):
        obs, _ = env.reset()
        episode_reward = 0
        episode_length = 0
        
        # Set goal (target position)
        goal = np.array([obs[3], obs[4]])  # Goal position from observation
        
        trajectory = []
        
        for step in range(max_steps):
            # Select action
            action = agent.select_action(obs, goal, epsilon=epsilon)
            
            # Execute action
            next_obs, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            
            # Compute achieved goal and reward
            achieved_goal = np.array([next_obs[0], next_obs[1]])  # Current position
            her_reward = agent.compute_reward(achieved_goal, goal)
            
            # Store transition
            agent.store_transition(obs, action, her_reward, next_obs, goal, achieved_goal, done)
            
            trajectory.append((obs, action, her_reward, next_obs, goal, achieved_goal, done))
            
            episode_reward += her_reward
            episode_length += 1
            
            obs = next_obs
            
            if done:
                break
        
        # Apply HER
        agent.hindsight_experience_replay(trajectory, strategy='future')
        
        # Update agent
        if len(agent.buffer) >= 32:
            loss_info = agent.update()
            losses.append(loss_info)
        
        episode_rewards.append(episode_reward)
        episode_lengths.append(episode_length)
        
        # Decay epsilon
        epsilon = max(epsilon_end, epsilon * epsilon_decay)
    
    return episode_rewards, episode_lengths, losses


def train_diayn_agent(env, agent, num_episodes=1000, max_steps=100):
    """
    Train DIAYN agent.
    """
    episode_rewards = []
    episode_lengths = []
    losses = []
    
    for episode in trange(num_episodes, desc="Training DIAYN Agent"):
        obs, _ = env.reset()
        episode_reward = 0
        episode_length = 0
        
        # Sample skill
        skill, skill_id = agent.sample_skill()
        
        for step in range(max_steps):
            # Select action
            action = agent.select_action(obs, skill)
            
            # Execute action
            next_obs, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            
            # Compute intrinsic reward
            intrinsic_reward = agent.compute_intrinsic_reward(next_obs, skill_id)
            
            # Store transition
            agent.store_transition(obs, skill_id, action, intrinsic_reward, next_obs, done)
            
            episode_reward += intrinsic_reward
            episode_length += 1
            
            obs = next_obs
            
            if done:
                break
        
        # Update agent
        if len(agent.buffer) >= 32:
            loss_info = agent.update()
            losses.append(loss_info)
        
        episode_rewards.append(episode_reward)
        episode_lengths.append(episode_length)
    
    return episode_rewards, episode_lengths, losses


def evaluate_agent(env, agent, num_episodes=100, method='option_critic'):
    """
    Evaluate agent performance.
    """
    episode_rewards = []
    episode_lengths = []
    success_rate = 0
    
    for episode in range(num_episodes):
        obs, _ = env.reset()
        episode_reward = 0
        episode_length = 0
        
        if method == 'option_critic':
            current_option = None
            option_steps = 0
            
            for step in range(100):  # Max steps
                if current_option is None or option_steps >= 10:
                    current_option, _ = agent.select_action(obs, epsilon=0.0)
                    option_steps = 0
                
                action, should_terminate = agent.select_action(obs, current_option, epsilon=0.0)
                obs, reward, terminated, truncated, _ = env.step(action)
                done = terminated or truncated
                
                episode_reward += reward
                episode_length += 1
                
                if should_terminate or done:
                    current_option = None
                    option_steps = 0
                else:
                    option_steps += 1
                
                if done:
                    if reward > 0:  # Success
                        success_rate += 1
                    break
        
        elif method == 'feudal':
            agent.manager_hidden = None
            agent.worker_hidden = None
            
            for step in range(100):
                action, goal = agent.select_action(obs, t=step)
                obs, reward, terminated, truncated, _ = env.step(action)
                done = terminated or truncated
                
                episode_reward += reward
                episode_length += 1
                
                if done:
                    if reward > 0:
                        success_rate += 1
                    break
        
        elif method == 'her':
            goal = np.array([obs[3], obs[4]])
            
            for step in range(100):
                action = agent.select_action(obs, goal, epsilon=0.0)
                obs, reward, terminated, truncated, _ = env.step(action)
                done = terminated or truncated
                
                achieved_goal = np.array([obs[0], obs[1]])
                her_reward = agent.compute_reward(achieved_goal, goal)
                episode_reward += her_reward
                episode_length += 1
                
                if done:
                    if her_reward > 0:
                        success_rate += 1
                    break
        
        elif method == 'diayn':
            skill, skill_id = agent.sample_skill()
            
            for step in range(100):
                action = agent.select_action(obs, skill, epsilon=0.0)
                obs, reward, terminated, truncated, _ = env.step(action)
                done = terminated or truncated
                
                intrinsic_reward = agent.compute_intrinsic_reward(obs, skill_id)
                episode_reward += intrinsic_reward
                episode_length += 1
                
                if done:
                    break
        
        episode_rewards.append(episode_reward)
        episode_lengths.append(episode_length)
    
    success_rate /= num_episodes
    
    return {
        'mean_reward': np.mean(episode_rewards),
        'std_reward': np.std(episode_rewards),
        'mean_length': np.mean(episode_lengths),
        'std_length': np.std(episode_lengths),
        'success_rate': success_rate
    }


def plot_training_results(results, title="Training Results"):
    """
    Plot training results for different methods.
    """
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    
    # Plot rewards
    ax1 = axes[0, 0]
    for method, data in results.items():
        rewards = data['rewards']
        # Smooth rewards
        window = min(50, len(rewards) // 10)
        if window > 1:
            smoothed = np.convolve(rewards, np.ones(window)/window, mode='valid')
            ax1.plot(smoothed, label=f"{method} (smoothed)")
        else:
            ax1.plot(rewards, label=method, alpha=0.3)
    
    ax1.set_title('Episode Rewards')
    ax1.set_xlabel('Episode')
    ax1.set_ylabel('Reward')
    ax1.legend()
    ax1.grid(True)
    
    # Plot episode lengths
    ax2 = axes[0, 1]
    for method, data in results.items():
        lengths = data['lengths']
        window = min(50, len(lengths) // 10)
        if window > 1:
            smoothed = np.convolve(lengths, np.ones(window)/window, mode='valid')
            ax2.plot(smoothed, label=f"{method} (smoothed)")
        else:
            ax2.plot(lengths, label=method, alpha=0.3)
    
    ax2.set_title('Episode Lengths')
    ax2.set_xlabel('Episode')
    ax2.set_ylabel('Length')
    ax2.legend()
    ax2.grid(True)
    
    # Plot evaluation results
    ax3 = axes[1, 0]
    methods = list(results.keys())
    mean_rewards = [results[method]['eval']['mean_reward'] for method in methods]
    std_rewards = [results[method]['eval']['std_reward'] for method in methods]
    
    ax3.bar(methods, mean_rewards, yerr=std_rewards, capsize=5)
    ax3.set_title('Evaluation Rewards')
    ax3.set_ylabel('Mean Reward')
    ax3.tick_params(axis='x', rotation=45)
    
    # Plot success rates
    ax4 = axes[1, 1]
    success_rates = [results[method]['eval']['success_rate'] for method in methods]
    
    ax4.bar(methods, success_rates)
    ax4.set_title('Success Rates')
    ax4.set_ylabel('Success Rate')
    ax4.tick_params(axis='x', rotation=45)
    
    plt.tight_layout()
    plt.show()


print("Training and evaluation functions defined!")


In [None]:
# Training and Evaluation Functions

def train_option_critic(env, agent, num_episodes, max_steps):
    """Train Option-Critic agent."""
    rewards = []
    lengths = []
    losses = []
    
    for episode in trange(num_episodes, desc="Training Option-Critic"):
        obs, _ = env.reset()
        episode_reward = 0
        episode_length = 0
        current_option = None
        
        for step in range(max_steps):
            if current_option is None:
                # Select new option
                current_option, _ = agent.select_action(obs)
            
            # Get action from current option
            action, should_terminate = agent.select_action(obs, current_option)
            
            # Execute action
            next_obs, reward, terminated, truncated, info = env.step(action)
            done = terminated or truncated
            
            # Store transition
            agent.store_transition(obs, current_option, action, reward, next_obs, terminated, truncated)
            
            # Update agent
            if len(agent.buffer) >= 32:
                loss_info = agent.update()
                losses.append(loss_info)
            
            episode_reward += reward
            episode_length += 1
            
            # Check termination
            if should_terminate or done:
                current_option = None
            
            obs = next_obs
            
            if done:
                break
        
        rewards.append(episode_reward)
        lengths.append(episode_length)
    
    return rewards, lengths, losses


def train_feudal_agent(env, agent, num_episodes, max_steps):
    """Train Feudal agent."""
    rewards = []
    lengths = []
    losses = []
    
    for episode in trange(num_episodes, desc="Training Feudal"):
        obs, _ = env.reset()
        episode_reward = 0
        episode_length = 0
        
        # Reset hidden states
        agent.manager_hidden = None
        agent.worker_hidden = None
        
        for step in range(max_steps):
            # Select action
            action, goal = agent.select_action(obs, t=step)
            
            # Execute action
            next_obs, reward, terminated, truncated, info = env.step(action)
            done = terminated or truncated
            
            # Get value estimates
            state_tensor = torch.FloatTensor(obs).unsqueeze(0)
            with torch.no_grad():
                output = agent.feudal_net(state_tensor, agent.manager_hidden, agent.worker_hidden, step)
                manager_value = output['manager_value'].item()
                worker_value = output['worker_value'].item()
            
            # Store transition
            agent.store_transition(obs, action, reward, next_obs, goal, manager_value, worker_value, done)
            
            # Update agent
            if len(agent.buffer) >= 32:
                loss_info = agent.update()
                losses.append(loss_info)
            
            episode_reward += reward
            episode_length += 1
            obs = next_obs
            
            if done:
                break
        
        rewards.append(episode_reward)
        lengths.append(episode_length)
    
    return rewards, lengths, losses


def train_her_agent(env, agent, num_episodes, max_steps):
    """Train HER agent."""
    rewards = []
    lengths = []
    losses = []
    
    for episode in trange(num_episodes, desc="Training HER"):
        obs, _ = env.reset()
        episode_reward = 0
        episode_length = 0
        
        # Set goal (target position)
        goal = np.array([obs[3], obs[4]])  # Goal position from observation
        
        trajectory = []
        
        for step in range(max_steps):
            # Select action
            action = agent.select_action(obs, goal)
            
            # Execute action
            next_obs, reward, terminated, truncated, info = env.step(action)
            done = terminated or truncated
            
            # Compute reward based on goal achievement
            achieved_goal = np.array([next_obs[0], next_obs[1]])  # Current position
            her_reward = agent.compute_reward(achieved_goal, goal)
            
            # Store transition
            agent.store_transition(obs, action, her_reward, next_obs, goal, achieved_goal, done)
            
            # Add to trajectory for HER
            trajectory.append((obs, action, her_reward, next_obs, goal, achieved_goal, done))
            
            # Update agent
            if len(agent.buffer) >= 32:
                loss_info = agent.update()
                losses.append(loss_info)
            
            episode_reward += her_reward
            episode_length += 1
            obs = next_obs
            
            if done:
                break
        
        # Apply HER to trajectory
        if len(trajectory) > 1:
            agent.hindsight_experience_replay(trajectory, strategy='future')
        
        rewards.append(episode_reward)
        lengths.append(episode_length)
    
    return rewards, lengths, losses


def train_diayn_agent(env, agent, num_episodes, max_steps):
    """Train DIAYN agent."""
    rewards = []
    lengths = []
    losses = []
    
    for episode in trange(num_episodes, desc="Training DIAYN"):
        obs, _ = env.reset()
        episode_reward = 0
        episode_length = 0
        
        # Sample skill for this episode
        skill, skill_id = agent.sample_skill()
        
        for step in range(max_steps):
            # Select action
            action = agent.select_action(obs, skill)
            
            # Execute action
            next_obs, reward, terminated, truncated, info = env.step(action)
            done = terminated or truncated
            
            # Compute intrinsic reward
            intrinsic_reward = agent.compute_intrinsic_reward(next_obs, skill_id)
            
            # Store transition
            agent.store_transition(obs, skill_id, action, intrinsic_reward, next_obs, done)
            
            # Update agent
            if len(agent.buffer) >= 32:
                loss_info = agent.update()
                losses.append(loss_info)
            
            episode_reward += intrinsic_reward
            episode_length += 1
            obs = next_obs
            
            if done:
                break
        
        rewards.append(episode_reward)
        lengths.append(episode_length)
    
    return rewards, lengths, losses


def evaluate_agent(env, agent, num_episodes, agent_type):
    """Evaluate agent performance."""
    rewards = []
    successes = []
    
    for episode in range(num_episodes):
        obs, _ = env.reset()
        episode_reward = 0
        success = False
        
        # Reset agent state if needed
        if agent_type == 'feudal':
            agent.manager_hidden = None
            agent.worker_hidden = None
        
        for step in range(50):  # Max 50 steps for evaluation
            if agent_type == 'option_critic':
                if step == 0:
                    current_option, _ = agent.select_action(obs, epsilon=0.0)
                action, should_terminate = agent.select_action(obs, current_option, epsilon=0.0)
                if should_terminate:
                    current_option = None
            elif agent_type == 'feudal':
                action, goal = agent.select_action(obs, t=step)
            elif agent_type == 'her':
                goal = np.array([obs[3], obs[4]])
                action = agent.select_action(obs, goal, epsilon=0.0)
            elif agent_type == 'diayn':
                # Use first skill for evaluation
                skill = np.zeros(agent.skill_dim)
                skill[0] = 1.0
                action = agent.select_action(obs, skill, epsilon=0.0)
            else:
                action = env.action_space.sample()
            
            obs, reward, terminated, truncated, info = env.step(action)
            episode_reward += reward
            
            if terminated:
                success = True
                break
        
        rewards.append(episode_reward)
        successes.append(success)
    
    return {
        'mean_reward': np.mean(rewards),
        'std_reward': np.std(rewards),
        'success_rate': np.mean(successes),
        'rewards': rewards,
        'successes': successes
    }


def plot_results(results):
    """Plot training results."""
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    
    # Plot rewards
    ax = axes[0, 0]
    for method, data in results.items():
        rewards = data['rewards']
        # Smooth rewards
        window = max(1, len(rewards) // 20)
        smoothed = np.convolve(rewards, np.ones(window)/window, mode='valid')
        ax.plot(smoothed, label=method, alpha=0.8)
    ax.set_title('Training Rewards')
    ax.set_xlabel('Episode')
    ax.set_ylabel('Reward')
    ax.legend()
    ax.grid(True)
    
    # Plot episode lengths
    ax = axes[0, 1]
    for method, data in results.items():
        lengths = data['lengths']
        # Smooth lengths
        window = max(1, len(lengths) // 20)
        smoothed = np.convolve(lengths, np.ones(window)/window, mode='valid')
        ax.plot(smoothed, label=method, alpha=0.8)
    ax.set_title('Episode Lengths')
    ax.set_xlabel('Episode')
    ax.set_ylabel('Steps')
    ax.legend()
    ax.grid(True)
    
    # Plot evaluation results
    ax = axes[1, 0]
    methods = list(results.keys())
    mean_rewards = [results[method]['eval']['mean_reward'] for method in methods]
    ax.bar(methods, mean_rewards, alpha=0.7)
    ax.set_title('Mean Evaluation Rewards')
    ax.set_ylabel('Reward')
    ax.tick_params(axis='x', rotation=45)
    
    # Plot success rates
    ax = axes[1, 1]
    success_rates = [results[method]['eval']['success_rate'] for method in methods]
    ax.bar(methods, success_rates, alpha=0.7, color='green')
    ax.set_title('Success Rates')
    ax.set_ylabel('Success Rate')
    ax.tick_params(axis='x', rotation=45)
    
    plt.tight_layout()
    plt.show()


print("Training and evaluation functions defined!")


## 7. Experiments and Comparison

Let's run experiments comparing different hierarchical RL methods on the multi-room navigation task.


In [None]:
# @title Run Experiments

# Set experiment parameters
NUM_EPISODES = 500  # Reduced for faster execution
MAX_STEPS = 50
EVAL_EPISODES = 50

# Create environment
env = MultiRoomEnv(num_rooms=3, room_size=4)
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n

print("Starting Hierarchical RL Experiments...")
print(f"Environment: {state_dim}D state, {action_dim} actions")
print(f"Training: {NUM_EPISODES} episodes, {MAX_STEPS} max steps")
print(f"Evaluation: {EVAL_EPISODES} episodes")
print()

# Initialize agents
agents = {}

# Option-Critic
print("Initializing Option-Critic...")
num_options = 3
agents['Option-Critic'] = OptionLearningAgent(state_dim, num_options, action_dim)

# Feudal Networks
print("Initializing Feudal Networks...")
agents['Feudal'] = FeudalAgent(state_dim, action_dim, goal_dim=8, c=5)

# HER
print("Initializing HER...")
goal_dim = 2
agents['HER'] = HERAgent(state_dim, goal_dim, action_dim)

# DIAYN
print("Initializing DIAYN...")
skill_dim = 4
agents['DIAYN'] = DIAYNAgent(state_dim, skill_dim, action_dim)

print("All agents initialized!")
print()

# Training results storage
results = {}

# Train Option-Critic
print("Training Option-Critic...")
rewards, lengths, losses = train_option_critic(env, agents['Option-Critic'], 
                                                NUM_EPISODES, MAX_STEPS)
eval_results = evaluate_agent(env, agents['Option-Critic'], EVAL_EPISODES, 'option_critic')
results['Option-Critic'] = {
    'rewards': rewards,
    'lengths': lengths,
    'losses': losses,
    'eval': eval_results
}
print(f"Option-Critic - Mean Reward: {eval_results['mean_reward']:.2f}, "
      f"Success Rate: {eval_results['success_rate']:.2f}")
print()

# Train Feudal Networks
print("Training Feudal Networks...")
rewards, lengths, losses = train_feudal_agent(env, agents['Feudal'], 
                                              NUM_EPISODES, MAX_STEPS)
eval_results = evaluate_agent(env, agents['Feudal'], EVAL_EPISODES, 'feudal')
results['Feudal'] = {
    'rewards': rewards,
    'lengths': lengths,
    'losses': losses,
    'eval': eval_results
}
print(f"Feudal - Mean Reward: {eval_results['mean_reward']:.2f}, "
      f"Success Rate: {eval_results['success_rate']:.2f}")
print()

# Train HER
print("Training HER...")
rewards, lengths, losses = train_her_agent(env, agents['HER'], 
                                          NUM_EPISODES, MAX_STEPS)
eval_results = evaluate_agent(env, agents['HER'], EVAL_EPISODES, 'her')
results['HER'] = {
    'rewards': rewards,
    'lengths': lengths,
    'losses': losses,
    'eval': eval_results
}
print(f"HER - Mean Reward: {eval_results['mean_reward']:.2f}, "
      f"Success Rate: {eval_results['success_rate']:.2f}")
print()

# Train DIAYN
print("Training DIAYN...")
rewards, lengths, losses = train_diayn_agent(env, agents['DIAYN'], 
                                            NUM_EPISODES, MAX_STEPS)
eval_results = evaluate_agent(env, agents['DIAYN'], EVAL_EPISODES, 'diayn')
results['DIAYN'] = {
    'rewards': rewards,
    'lengths': lengths,
    'losses': losses,
    'eval': eval_results
}
print(f"DIAYN - Mean Reward: {eval_results['mean_reward']:.2f}, "
      f"Success Rate: {eval_results['success_rate']:.2f}")
print()

print("All experiments completed!")

# Plot results
print("\\nPlotting results...")
plot_results(results)

# Print summary
print("\\n=== EXPERIMENT SUMMARY ===")
print("Method Comparison:")
for method, data in results.items():
    eval_data = data['eval']
    print(f"{method:15} - Reward: {eval_data['mean_reward']:6.2f} ± {eval_data['std_reward']:5.2f}, "
          f"Success: {eval_data['success_rate']:5.2f}")

print("\\nKey Insights:")
print("1. Hierarchical methods can learn complex behaviors through abstraction")
print("2. Different methods excel in different scenarios:")
print("   - Options: Good for temporal abstraction")
print("   - Feudal: Good for spatial abstraction") 
print("   - HER: Good for sparse reward problems")
print("   - DIAYN: Good for skill discovery without rewards")
print("3. The choice of method depends on the specific problem structure")


In [None]:
# @title Plot Results

# Plot training results
plot_training_results(results, "Hierarchical RL Methods Comparison")

# Print detailed results
print("\\n" + "="*60)
print("EXPERIMENTAL RESULTS SUMMARY")
print("="*60)

for method, data in results.items():
    eval_data = data['eval']
    print(f"\\n{method}:")
    print(f"  Mean Reward: {eval_data['mean_reward']:.3f} ± {eval_data['std_reward']:.3f}")
    print(f"  Mean Length: {eval_data['mean_length']:.1f} ± {eval_data['std_length']:.1f}")
    print(f"  Success Rate: {eval_data['success_rate']:.3f}")

print("\\n" + "="*60)
print("ANALYSIS")
print("="*60)

# Find best performing method
best_method = max(results.keys(), key=lambda k: results[k]['eval']['success_rate'])
print(f"\\nBest performing method: {best_method}")
print(f"Success rate: {results[best_method]['eval']['success_rate']:.3f}")

# Compare methods
print("\\nMethod Comparison:")
print("- Option-Critic: Learns temporally extended actions (options)")
print("- Feudal Networks: Manager-worker hierarchy with goal communication")
print("- HER: Goal-conditioned RL with hindsight experience replay")
print("- DIAYN: Unsupervised skill discovery through mutual information")

print("\\nKey Insights:")
print("1. Hierarchical methods can learn complex behaviors through abstraction")
print("2. Different methods excel in different scenarios:")
print("   - Options: Good for temporal abstraction")
print("   - Feudal: Good for spatial abstraction")
print("   - HER: Good for sparse reward problems")
print("   - DIAYN: Good for skill discovery without rewards")
print("3. The choice of method depends on the specific problem structure")


## 8. Analysis and Questions

### Key Concepts Demonstrated

1. **Temporal Abstraction**: Options framework allows agents to plan at multiple time scales
2. **Spatial Abstraction**: Feudal hierarchies enable planning across different spatial scales
3. **Goal-Conditioned Learning**: HER enables learning from failed trajectories
4. **Skill Discovery**: DIAYN learns diverse skills without external rewards

### Discussion Questions

**Answer the following questions based on your experiments:**

1. **Temporal Abstraction**: How does the Options framework help with long-horizon planning compared to flat RL? What are the trade-offs?

2. **Hierarchical Communication**: In Feudal Networks, how does the manager-worker communication work? What happens when the communication is noisy or delayed?

3. **Sample Efficiency**: Why is HER so effective for sparse reward problems? How does it compare to other exploration strategies?

4. **Skill Discovery**: What kinds of skills does DIAYN discover? How could these skills be used for downstream tasks?

5. **Method Selection**: Given a new hierarchical RL problem, how would you choose between these methods? What factors would influence your decision?

### Extensions and Future Work

- **Combining Methods**: How could you combine multiple hierarchical approaches?
- **Automatic Hierarchy**: How could the hierarchy structure be learned automatically?
- **Transfer Learning**: How could skills learned with DIAYN be transferred to new tasks?
- **Real-World Applications**: What real-world problems would benefit from hierarchical RL?

### Conclusion

Hierarchical Reinforcement Learning provides powerful tools for solving complex, long-horizon tasks by decomposing them into simpler subtasks. The methods explored in this assignment demonstrate different approaches to temporal and spatial abstraction, each with their own strengths and applications.

The key insight is that **abstraction enables agents to reason at multiple levels**, from high-level strategic planning to low-level tactical execution, making complex problems more tractable and enabling better sample efficiency and generalization.
