# Tutorial 8 - Options

Please complete this tutorial to get an overview of options and an implementation of SMDP Q-Learning and Intra-Option Q-Learning.


### References:

 [Recent Advances in Hierarchical Reinforcement
Learning](https://people.cs.umass.edu/~mahadeva/papers/hrl.pdf) is a strong recommendation for topics in HRL that was covered in class. Watch Prof. Ravi's lectures on moodle or nptel for further understanding the core concepts. Contact the TAs for further resources if needed.


In [1]:
!pip install numpy==1.23



In [2]:
!pip install gym==0.22

Collecting gym==0.22
  Downloading gym-0.22.0.tar.gz (631 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m631.1/631.1 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: gym
  Building wheel for gym (pyproject.toml) ... [?25l[?25hdone
  Created wheel for gym: filename=gym-0.22.0-py3-none-any.whl size=708362 sha256=6639a8a65776e73e52aa3b1e41b5d6fd26e9447de6f482e35aff13fbce3e440d
  Stored in directory: /root/.cache/pip/wheels/42/e8/e8/6dfbc92a1dcd76c1a5e2bb982750fd6b7e792239f46039e6b1
Successfully built gym
Installing collected packages: gym
  Attempting uninstall: gym
    Found existing installation: gym 0.25.2
    Uninstalling gym-0.25.2:
      Successfully uninstalled gym-0.25.2
Successfully installed gym-0.22.0


In [3]:
'''
A bunch of imports, you don't have to worry about these
'''

import numpy as np
import random
import gym
from gym.wrappers import Monitor
import glob
import io
import matplotlib.pyplot as plt
from IPython.display import HTML




In [4]:
'''
The environment used here is extremely similar to the openai gym ones.
At first glance it might look slightly different.
The usual commands we use for our experiments are added to this cell to aid you
work using this environment.
'''

#Setting up the environment
from gym.envs.toy_text.cliffwalking import CliffWalkingEnv
env = CliffWalkingEnv()

env.reset()

#Current State
print(env.s)

# 4x12 grid = 48 states
print ("Number of states:", env.nS)

# Primitive Actions
action = ["up", "right", "down", "left"]
#correspond to [0,1,2,3] that's actually passed to the environment

# either go left, up, down or right
print ("Number of actions that an agent can take:", env.nA)

# Example Transitions
rnd_action = random.randint(0, 3)
print ("Action taken:", action[rnd_action])
next_state, reward, is_terminal, t_prob = env.step(rnd_action)
print ("Transition probability:", t_prob)
print ("Next state:", next_state)
print ("Reward recieved:", reward)
print ("Terminal state:", is_terminal)
env.render()

36
Number of states: 48
Number of actions that an agent can take: 4
Action taken: left
Transition probability: {'prob': 1.0}
Next state: 36
Reward recieved: -1
Terminal state: False
o  o  o  o  o  o  o  o  o  o  o  o
o  o  o  o  o  o  o  o  o  o  o  o
o  o  o  o  o  o  o  o  o  o  o  o
x  C  C  C  C  C  C  C  C  C  C  T



#### Options
We custom define very simple options here. They might not be the logical options for this settings deliberately chosen to visualise the Q Table better.


In [5]:
# We are defining two more options here
# Option 1 ["Away"] - > Away from Cliff (ie keep going up)
# Option 2 ["Close"] - > Close to Cliff (ie keep going down)

def Away(env,state):

    optdone = False
    optact = 0

    if (int(state/12) == 0):
        optdone = True

    return [optact,optdone]

def Close(env,state):

    optdone = False
    optact = 2

    if (int(state/12) == 2):
        optdone = True

    return [optact,optdone]


'''
Now the new action space will contain
Primitive Actions: ["up", "right", "down", "left"]
Options: ["Away","Close"]
Total Actions :["up", "right", "down", "left", "Away", "Close"]
Corresponding to [0,1,2,3,4,5]
'''

'\nNow the new action space will contain\nPrimitive Actions: ["up", "right", "down", "left"]\nOptions: ["Away","Close"]\nTotal Actions :["up", "right", "down", "left", "Away", "Close"]\nCorresponding to [0,1,2,3,4,5]\n'

# Task 1
Complete the code cell below


In [6]:
# epsilon-greedy action selection function
def egreedy_policy(q_values, state, epsilon):
    if np.random.rand() < epsilon:
        return np.random.randint(6)
    else:
        return np.argmax(q_values[state])

# Task 2
Below is an incomplete code cell with the flow of SMDP Q-Learning. Complete the cell and train the agent using SMDP Q-Learning algorithm.
Keep the **final Q-table** and **Update Frequency** table handy (You'll need it in TODO 4)

In [7]:
#### SMDP Q-Learning
update_frequency_smdp = np.zeros((env.nS, 6))  # For tracking update frequency
cumulative_rewards_smdp = []  # For tracking cumulative rewards
# Parameters
gamma = 0.9
alpha = 0.1
epsilon = 0.1

# Q-Values initialization
q_values_SMDP = np.zeros((48, 6))

# SMDP Q-Learning with Update Frequency
for episode in range(1000):
    state = env.reset()
    done = False
    total_reward = 0

    for _ in range(1000):
        action = egreedy_policy(q_values_SMDP, state, epsilon)

        if action < 4:  # Primitive action
            next_state, reward, done, _ = env.step(action)
            best_next_action = np.argmax(q_values_SMDP[next_state])
            q_values_SMDP[state, action] += alpha * (reward + gamma * q_values_SMDP[next_state, best_next_action] - q_values_SMDP[state, action])
            update_frequency_smdp[state, action] += 1  # Update frequency
            state = next_state
        else:
            reward_bar = 0
            beta = 1  # Discounting over steps within the option
            optdone = False
            # while not optdone and not done:
            for _ in range(1000):
                if action == 4:  # "Away" option
                    optact, optdone = Away(env, state)
                elif action == 5:  # "Close" option
                    optact, optdone = Close(env, state)

                next_state, reward, done, _ = env.step(optact)
                reward_bar += beta * reward
                beta *= gamma
                state = next_state
                if done or optdone:
                  break

            q_values_SMDP[state, action] += alpha * (reward_bar - q_values_SMDP[state, action])
            update_frequency_smdp[state, action] += 1  # Update frequency

        total_reward += reward
        if done:
          break

    cumulative_rewards_smdp.append(total_reward)


# Task 3
Using the same options and the SMDP code, implement Intra Option Q-Learning (In the code cell below). You *might not* always have to search through options to find the options with similar policies, think about it. Keep the **final Q-table** and **Update Frequency** table handy (You'll need it in TODO 4)



In [8]:
#### Intra-Option Q-Learning
q_values_IOQL = np.zeros((48, 6))
update_frequency_ioql = np.zeros((env.nS, 6))  # For tracking update frequency
cumulative_rewards_intra_option = []  # For tracking cumulative rewards

# Intra-Option Q-Learning with Update Frequency
for episode in range(1000):
    state = env.reset()
    done = False
    total_reward = 0

    for i in range(1000):
        action = egreedy_policy(q_values_IOQL, state, epsilon)

        if action < 4:  # Primitive action
            next_state, reward, done, _ = env.step(action)
            best_next_action = np.argmax(q_values_IOQL[next_state])
            q_values_IOQL[state, action] += alpha * (reward + gamma * q_values_IOQL[next_state, best_next_action] - q_values_IOQL[state, action])
            update_frequency_ioql[state, action] += 1  # Update frequency
            state = next_state
        else:
            optdone = False
            # while not optdone and not done:
            for _ in range(1000):
                if action == 4:  # "Away" option
                    optact, optdone = Away(env, state)
                elif action == 5:  # "Close" option
                    optact, optdone = Close(env, state)

                next_state, reward, done, _ = env.step(optact)
                best_next_action = np.argmax(q_values_IOQL[next_state])
                q_values_IOQL[state, action] += alpha * (reward + gamma * q_values_IOQL[next_state, best_next_action] - q_values_IOQL[state, action])
                update_frequency_ioql[state, action] += 1  # Update frequency
                state = next_state
                if done or optdone:
                  break

        total_reward += reward
        if done:
          break

    cumulative_rewards_intra_option.append(total_reward)


# Task 4
Compare the two Q-Tables and Update Frequencies and provide comments.

In [11]:
# Use this cell for Task 4 Code# Task 4
# Compare the two Q-Tables and Update Frequencies and provide comments.

# Compare final Q-tables
print("Comparison of final Q-tables:")
print("SMDP Q-Learning Q-table:")
print(q_values_SMDP)
print("\nIntra-Option Q-Learning Q-table:")
print(q_values_IOQL)

# Compare update frequencies
print("\nComparison of update frequencies:")
print("SMDP Q-Learning Update Frequency:")
print(update_frequency_smdp)
print("\nIntra-Option Q-Learning Update Frequency:")
print(update_frequency_ioql )

Comparison of final Q-tables:
SMDP Q-Learning Q-table:
[[  -1.           -1.           -1.           -1.           -3.42462733
     0.        ]
 [  -0.99999983   -0.99999999   -0.99999994   -0.99999952   -1.98603141
     0.        ]
 [  -0.271        -0.19         -0.3439       -0.5217031    -1.19542509
     0.        ]
 [  -0.1          -0.1          -0.1          -0.1          -0.1
     0.        ]
 [  -0.1          -0.1          -0.1          -0.1           0.
     0.        ]
 [  -0.1          -0.1          -0.1          -0.1           0.
     0.        ]
 [  -0.1          -0.1          -0.1          -0.1           0.
     0.        ]
 [  -0.1          -0.1          -0.1          -0.1           0.
     0.        ]
 [  -0.1          -0.1          -0.1          -0.1           0.
     0.        ]
 [  -0.1          -0.1          -0.1          -0.1           0.
     0.        ]
 [  -0.1          -0.1          -0.1          -0.1           0.
     0.        ]
 [  -0.1          -0.1       

Use this text cell for your comments - Task 4


# **Inference**



### Final Q-Tables Analysis:

- **SMDP Q-Learning Q-table** suggests a strong distinction between the values assigned to primitive actions (up, right, down, left) and the two options ("Away" and "Close"). For many states, the options have significantly different Q-values compared to primitive actions, indicating that the algorithm has learned when it's advantageous to execute these options.

- **Intra-Option Q-Learning Q-table** shows a more varied distribution of Q-values across both primitive actions and options. This indicates a more dynamic use of both options and primitive actions across different states.

### Update Frequencies Analysis:

- **SMDP Q-Learning Update Frequency** shows a high frequency of updates for the options compared to primitive actions in certain states, which indicates that the options were utilized extensively during the learning process.

- **Intra-Option Q-Learning Update Frequency** displays a more balanced update frequency between primitive actions and options, which indicates more exploratory behaviore not just with options but also with primitive actions.

### Inference:
- **SMDP Q-Learning** seems to prioritize options, potentially at the expense of exploring primitive actions, which might limit policy flexibility.
- **Intra-Option Q-Learning** provides a more balanced and nuanced approach, leveraging both options and primitive actions for a potentially more adaptable and refined policy.
- The significant difference in update frequencies and Q-values across the two methods highlights the trade-off between focusing on high-level strategies (options) versus detailed, action-level decisions in hierarchical reinforcement learning.