# HW12: Hierarchical Reinforcement Learning

**Course:** Deep Reinforcement Learning  
**Assignment:** Homework 12 - Hierarchical RL  
**Date:** 2024

---

## Overview

Hierarchical Reinforcement Learning (HRL) structures policies across multiple levels of abstraction, enabling agents to solve complex, long-horizon tasks by decomposing them into simpler subtasks. This assignment explores temporal abstraction, options framework, feudal architectures, and goal-conditioned policies.

## Learning Objectives

1. **Temporal Abstraction**: Understand multi-scale decision making
2. **Options Framework**: Master semi-Markov decision processes
3. **Feudal Hierarchies**: Learn manager-worker architectures
4. **Goal-Conditioned RL**: Train policies with diverse goals
5. **Skill Discovery**: Learn reusable primitives automatically
6. **Credit Assignment**: Address challenges across temporal scales

## Table of Contents

1. [Introduction to Hierarchical RL](#introduction)
2. [Options Framework](#options-framework)
3. [Feudal Hierarchies](#feudal-hierarchies)
4. [Goal-Conditioned RL](#goal-conditioned-rl)
5. [Skill Discovery](#skill-discovery)
6. [HAM Framework](#ham-framework)
7. [Evaluation and Comparison](#evaluation)
8. [Conclusion](#conclusion)


In [None]:
# Import necessary libraries
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import matplotlib.pyplot as plt
import seaborn as sns
from collections import defaultdict, deque
import gym
import random
from typing import Dict, List, Tuple, Optional
import warnings
warnings.filterwarnings('ignore')

# Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)
random.seed(42)

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Set plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")


## 1. Introduction to Hierarchical RL

### Motivation for Hierarchy

**Challenges in Flat RL:**
- **Long Horizons**: Credit assignment difficult over 1000+ steps
- **Sparse Rewards**: Random exploration ineffective
- **Complex Tasks**: Atomic actions insufficient
- **Transfer**: Hard to reuse learned behaviors

**Benefits of Hierarchy:**
- Temporal abstraction (plan at multiple scales)
- Reusable skills/subpolicies
- Exploration structure
- Transfer learning
- Compositional generalization

### Human Example:
```
Task: Make dinner
├─ Shop for ingredients
│  ├─ Drive to store
│  ├─ Find items
│  └─ Checkout
├─ Prepare food
│  ├─ Chop vegetables
│  ├─ Cook proteins
│  └─ Mix ingredients
└─ Serve meal
```


In [None]:
# Exercise 1.1: Create a Simple GridWorld Environment for HRL
class GridWorld:
    """
    A simple grid world environment for testing hierarchical RL algorithms.
    The agent must navigate from start to goal, potentially using hierarchical actions.
    """
    
    def __init__(self, width=10, height=10, start=(0, 0), goal=(9, 9)):
        self.width = width
        self.height = height
        self.start = start
        self.goal = goal
        self.state = start
        
        # Define atomic actions: up, down, left, right
        self.atomic_actions = [(0, 1), (0, -1), (-1, 0), (1, 0)]
        self.action_names = ['up', 'down', 'left', 'right']
        
        # Define hierarchical actions (options)
        self.options = {}
        
    def reset(self):
        """Reset environment to initial state"""
        self.state = self.start
        return self.state
        
    def step(self, action):
        """Take a step in the environment"""
        if isinstance(action, int):
            # Atomic action
            dx, dy = self.atomic_actions[action]
            new_x = max(0, min(self.width-1, self.state[0] + dx))
            new_y = max(0, min(self.height-1, self.state[1] + dy))
            self.state = (new_x, new_y)
        else:
            # Hierarchical action (option)
            self.state = action
            
        # Calculate reward
        reward = 1.0 if self.state == self.goal else -0.01
        
        # Check if done
        done = self.state == self.goal
        
        return self.state, reward, done, {}
    
    def get_distance_to_goal(self, state):
        """Calculate Manhattan distance to goal"""
        return abs(state[0] - self.goal[0]) + abs(state[1] - self.goal[1])
    
    def render(self):
        """Render the current state"""
        grid = np.zeros((self.height, self.width))
        grid[self.start[1], self.start[0]] = 1  # Start
        grid[self.goal[1], self.goal[0]] = 2     # Goal
        grid[self.state[1], self.state[0]] = 3   # Current position
        
        plt.figure(figsize=(8, 8))
        plt.imshow(grid, cmap='viridis')
        plt.title(f"GridWorld - Current: {self.state}, Goal: {self.goal}")
        plt.show()

# Test the environment
env = GridWorld()
print("GridWorld Environment Created!")
print(f"Start: {env.start}, Goal: {env.goal}")
print(f"Atomic actions: {env.action_names}")

# Test a few steps
state = env.reset()
print(f"Initial state: {state}")

for i in range(5):
    action = np.random.randint(4)  # Random atomic action
    next_state, reward, done, _ = env.step(action)
    print(f"Step {i+1}: Action {env.action_names[action]} -> State {next_state}, Reward {reward:.2f}")
    if done:
        print("Goal reached!")
        break


## 2. Options Framework

### Formal Definition

An **Option** is a temporally extended action defined as:
```
Option ω = (I_ω, π_ω, β_ω)

where:
- I_ω ⊆ S: Initiation set (where option can start)
- π_ω: S × A → [0,1]: Option policy
- β_ω: S → [0,1]: Termination function
```

### Semi-Markov Decision Process (SMDP)

Instead of choosing action at each step, choose option, execute until termination.

**Option-Value Functions:**
- Q(s, ω) = Expected return from executing option ω in state s
- Intra-option learning: Can update Q while executing option


In [None]:
# Exercise 2.1: Implement Handcrafted Options
class Option:
    """
    A handcrafted option for navigation tasks.
    Each option represents a skill like "move towards goal" or "explore area".
    """
    
    def __init__(self, name, initiation_set, policy_func, termination_func):
        self.name = name
        self.initiation_set = initiation_set
        self.policy_func = policy_func
        self.termination_func = termination_func
        
    def can_initiate(self, state):
        """Check if option can be initiated in given state"""
        return state in self.initiation_set
    
    def get_action(self, state):
        """Get action from option policy"""
        return self.policy_func(state)
    
    def should_terminate(self, state):
        """Check if option should terminate"""
        return self.termination_func(state)

class NavigateToGoalOption(Option):
    """Option that navigates towards a specific goal location"""
    
    def __init__(self, goal_location, env):
        self.goal = goal_location
        self.env = env
        
        # Can initiate from any state
        initiation_set = set()
        for x in range(env.width):
            for y in range(env.height):
                initiation_set.add((x, y))
        
        super().__init__(
            name=f"NavigateTo{goal_location}",
            initiation_set=initiation_set,
            policy_func=self._navigate_policy,
            termination_func=self._termination_condition
        )
    
    def _navigate_policy(self, state):
        """Policy: move towards goal using Manhattan distance"""
        current_x, current_y = state
        goal_x, goal_y = self.goal
        
        # Calculate direction to goal
        dx = goal_x - current_x
        dy = goal_y - current_y
        
        # Choose action that moves towards goal
        if abs(dx) > abs(dy):
            return 3 if dx > 0 else 2  # right or left
        else:
            return 0 if dy > 0 else 1  # up or down
    
    def _termination_condition(self, state):
        """Terminate when close to goal"""
        distance = self.env.get_distance_to_goal(state)
        return distance <= 1

class ExploreOption(Option):
    """Option that explores the environment randomly"""
    
    def __init__(self, env, exploration_steps=5):
        self.env = env
        self.exploration_steps = exploration_steps
        self.steps_taken = 0
        
        # Can initiate from any state
        initiation_set = set()
        for x in range(env.width):
            for y in range(env.height):
                initiation_set.add((x, y))
        
        super().__init__(
            name="Explore",
            initiation_set=initiation_set,
            policy_func=self._explore_policy,
            termination_func=self._termination_condition
        )
    
    def _explore_policy(self, state):
        """Policy: random exploration"""
        self.steps_taken += 1
        return np.random.randint(4)  # Random atomic action
    
    def _termination_condition(self, state):
        """Terminate after exploration_steps"""
        if self.steps_taken >= self.exploration_steps:
            self.steps_taken = 0  # Reset for next use
            return True
        return False

# Test handcrafted options
env = GridWorld()
goal_option = NavigateToGoalOption((9, 9), env)
explore_option = ExploreOption(env, exploration_steps=3)

print("Handcrafted Options Created:")
print(f"1. {goal_option.name}")
print(f"2. {explore_option.name}")

# Test goal navigation option
state = env.reset()
print(f"\nTesting {goal_option.name}:")
print(f"Initial state: {state}")

for step in range(10):
    if goal_option.can_initiate(state):
        action = goal_option.get_action(state)
        next_state, reward, done, _ = env.step(action)
        print(f"Step {step+1}: Action {env.action_names[action]} -> State {next_state}")
        
        if goal_option.should_terminate(next_state):
            print("Option terminated!")
            break
        state = next_state
    else:
        print("Cannot initiate option from current state")
        break
