# Tutorial 8 - Options

Please complete this tutorial to get an overview of options and an implementation of SMDP Q-Learning and Intra-Option Q-Learning.


### References:

 [Recent Advances in Hierarchical Reinforcement
Learning](https://people.cs.umass.edu/~mahadeva/papers/hrl.pdf) is a strong recommendation for topics in HRL that was covered in class. Watch Prof. Ravi's lectures on moodle or nptel for further understanding the core concepts. Contact the TAs for further resources if needed.


In [1]:
# !pip install --upgrade gym

In [2]:
'''
A bunch of imports, you don't have to worry about these
'''

import numpy as np
from tqdm import tqdm
import random
import gym
#from gym.wrappers import Monitor
import glob
import io
import matplotlib.pyplot as plt
from IPython.display import HTML

In [3]:
# '''
# The environment used here is extremely similar to the openai gym ones.
# At first glance it might look slightly different.
# The usual commands we use for our experiments are added to this cell to aid you
# work using this environment.
# '''

# #Setting up the environment
# from gym.envs.toy_text.cliffwalking import CliffWalkingEnv
# env = CliffWalkingEnv()

# env.reset()

# #Current State
# print(env.s)

# # 4x12 grid = 48 states
# print ("Number of states:", env.nS)

# # Primitive Actions
# action = ["up", "right", "down", "left"]
# #correspond to [0,1,2,3] that's actually passed to the environment

# # either go left, up, down or right
# print ("Number of actions that an agent can take:", env.nA)

# # Example Transitions
# rnd_action = random.randint(0, 3)
# print ("Action taken:", action[rnd_action])
# next_state, reward, is_terminal, t_prob, *extras = env.step(rnd_action)
# print ("Transition probability:", t_prob)
# print ("Next state:", next_state)
# print ("Reward recieved:", reward)
# print ("Terminal state:", is_terminal)
# env.render()

  and should_run_async(code)


In [4]:
# Setting up the environment with rendering mode
from gym.envs.toy_text.cliffwalking import CliffWalkingEnv
env = CliffWalkingEnv(render_mode='ansi')  # Specify the rendering mode here

env.reset()

# Current State
print(env.s)

# 4x12 grid = 48 states
print("Number of states:", env.nS)

# Primitive Actions
action = ["up", "right", "down", "left"]
# Correspond to [0,1,2,3] that's actually passed to the environment

# Either go left, up, down or right
print("Number of actions that an agent can take:", env.nA)

# Example Transitions
rnd_action = random.randint(0, 3)
print("Action taken:", action[rnd_action])
next_state, reward, is_terminal, t_prob, *extras = env.step(rnd_action)
print("Transition probability:", t_prob)
print("Next state:", next_state)
print("Reward received:", reward)
print("Terminal state:", is_terminal)
env.render()


36
Number of states: 48
Number of actions that an agent can take: 4
Action taken: left
Transition probability: False
Next state: 36
Reward received: -1
Terminal state: False


['o  o  o  o  o  o  o  o  o  o  o  o\no  o  o  o  o  o  o  o  o  o  o  o\no  o  o  o  o  o  o  o  o  o  o  o\nx  C  C  C  C  C  C  C  C  C  C  T\n\n',
 'o  o  o  o  o  o  o  o  o  o  o  o\no  o  o  o  o  o  o  o  o  o  o  o\no  o  o  o  o  o  o  o  o  o  o  o\nx  C  C  C  C  C  C  C  C  C  C  T\n\n']

#### Options
We custom define very simple options here. They might not be the logical options for this settings deliberately chosen to visualise the Q Table better.


In [5]:
# We are defining two more options here
# Option 1 ["Away"] - > Away from Cliff (ie keep going up)
# Option 2 ["Close"] - > Close to Cliff (ie keep going down)

def Away(env,state):

    optdone = False
    optact = 0

    if (int(state/12) == 0):
        optdone = True

    return [optact,optdone]

def Close(env,state):

    optdone = False
    optact = 2

    if (int(state/12) == 2):
        optdone = True

    if (int(state/12) == 3):
        optdone = True

    return [optact,optdone]


'''
Now the new action space will contain
Primitive Actions: ["up", "right", "down", "left"]
Options: ["Away","Close"]
Total Actions :["up", "right", "down", "left", "Away", "Close"]
Corresponding to [0,1,2,3,4,5]
'''

'\nNow the new action space will contain\nPrimitive Actions: ["up", "right", "down", "left"]\nOptions: ["Away","Close"]\nTotal Actions :["up", "right", "down", "left", "Away", "Close"]\nCorresponding to [0,1,2,3,4,5]\n'

# Task 1
Complete the code cell below


In [13]:
#Q-Table: (States x Actions) === (env.ns(48) x total actions(6))
q_values_SMDP = np.zeros((48,6))
#Update_Frequency Data structure? Check TODO 4
option_freq_SMDP = np.zeros(q_values_SMDP.shape)
# TODO: epsilon-greedy action selection function
def egreedy_policy(q_values, state, epsilon):
    if np.random.rand() >= epsilon:
        # Choose the action with the highest Q-value
        return np.argmax(q_values[state])
    else:
        # Choose a random action
        return np.random.randint(q_values.shape[1])


# Task 2
Below is an incomplete code cell with the flow of SMDP Q-Learning. Complete the cell and train the agent using SMDP Q-Learning algorithm.
Keep the **final Q-table** and **Update Frequency** table handy (You'll need it in TODO 4)

In [22]:
#### SMDP Q-Learning

# Add parameters you might need here
gamma = 0.9
alpha = 0.1

# Iterate over 1000 episodes
for _ in range(1000):
    state = env.reset()
    done = False
    print(f"\rEPISODE: {_}", end = "")
    # While episode is not over
    while not done:
        _state = state

        # Choose action
        action = egreedy_policy(q_values_SMDP, state, epsilon=0.1)

        # Checking if primitive action
        if action < 4:
            # Perform regular Q-Learning update for state-action pair
            # Perform regular Q-Learning update for state-action pair
            next_state, reward, done, _,a = env.step(action)
            q_values_SMDP[state, action] += alpha * (reward + gamma * np.max(q_values_SMDP[next_state]) - q_values_SMDP[state, action])
            state = next_state
            option_freq_SMDP[state][action] += 1


        # Checking if action chosen is an option
        reward_bar = 0
        tau = 0
        if action == 4: # action => Away option

            optdone = False
            while (optdone == False):

                # Think about what this function might do?
                optact,optdone = Away(env,state)
                next_state, reward, done,_,a = env.step(optact)

                # Is this formulation right? What is this term?
                reward_bar = gamma*reward_bar + reward
                reward_bar = gamma*reward_bar + reward
                tau += 1
                # Complete SMDP Q-Learning Update
                # Remember SMDP Updates. When & What do you update?
                if optdone or done:
                    q_values_SMDP[_state, action] += alpha * (reward_bar + (gamma**tau) * np.max(q_values_SMDP[next_state]) - q_values_SMDP[_state,action])
                    option_freq_SMDP[_state][action] += 1
                    state = next_state

                state = next_state

        if action == 5: # action => Close option\
            optdone = False
            while (optdone == False):
                # Think about what this function might do?
                optact ,optdone = Close(env,state)
                next_state, reward, opt,_,a = env.step(optact)
                # Is this formulation right? What is this term?
                reward_bar = gamma*reward_bar + reward
                tau += 1
                # Complete SMDP Q-Learning Update
                # Remember SMDP Updates. When & What do you update?
                if optdone:
                    q_values_SMDP[_state, action] += alpha * (reward_bar +(gamma**tau) * np.max(q_values_SMDP[next_state]) - q_values_SMDP[_state,action])
                    option_freq_SMDP[_state][action] += 1
                state = next_state

EPISODE: 999

# Task 3
Using the same options and the SMDP code, implement Intra Option Q-Learning (In the code cell below). You *might not* always have to search through options to find the options with similar policies, think about it. Keep the **final Q-table** and **Update Frequency** table handy (You'll need it in TODO 4)



In [21]:
 #Q-Table: (States x Actions) === (env.ns(48) x total actions(6))
q_values_IO = np.zeros((48,6))
#Update_Frequency Data structure? Check TODO 4
option_freq_IO = np.zeros(q_values_SMDP.shape)

In [27]:
#### Intra-Option Q-Learning

# Add parameters you might need here
gamma = 0.99
alpha = 0.1
epsilon =0.1

# Iterate over 1000 episodes
for _ in range(1000):
    state = env.reset()
    done = False
    print(f"\rEPISODE: {_}", end = "")
    # While episode is not over
    while not done:


        # Choose action
        action = egreedy_policy(q_values_SMDP, state, epsilon)

        # Checking if primitive action
        if action < 4:
          # Perform regular Q-Learning update for state-action pair
          next_state, reward, done, _,a = env.step(action)
          q_values_IO[state, action] += alpha * (reward + gamma * np.max(q_values_IO[next_state]) - q_values_IO[state, action])
          state = next_state
          option_freq_IO[state][action] += 1


        # Checking if action chosen is an option
        if action == 4: # action => Away option
            optdone = False
            while (optdone == False):
                # Think about what this function might do?
                optact,optdone = Away(env,state)
                next_state, reward, done,_,a = env.step(optact)

                # Complete IO Q Learning Update
                q_values_IO[state, optact] += alpha * (reward + gamma * np.max(q_values_IO[next_state]) - q_values_IO[state, optact])
                option_freq_IO[state, optact] += 1

                # options qvalues
                if optdone:
                    q_values_IO[state, action] += alpha * (reward + gamma * (np.max(q_values_IO[next_state])) - q_values_IO[state, action] )
                else:
                    q_values_IO[state, action] += alpha * (reward + gamma *(q_values_IO[next_state, action]) - q_values_IO[state, action] )



                option_freq_IO[state, action] += 1

                state = next_state


        if action == 5: # action => Close option\
            optdone = False
            while (optdone == False):
                # Think about what this function might do?
                optact,optdone = Close(env,state)
                next_state, reward, done,_,a = env.step(optact)
                # Is this formulation right? What is this term?
                reward_bar = gamma*reward_bar + reward
                tau += 1
                q_values_IO[state, optact] += alpha * (reward + gamma * np.max(q_values_IO[next_state]) - q_values_IO[state, optact])
                option_freq_IO[state][optact] += 1

                 #options q_values

                if optdone:
                    q_values_IO[state, action] += alpha * (reward + gamma * (np.max(q_values_IO[next_state])) - q_values_IO[state, action] )
                else:
                    q_values_IO[state, action] += alpha * (reward + gamma *(q_values_IO[next_state, action]) - q_values_IO[state, action] )

                state = next_state







EPISODE: 999

# Task 4
Compare the two Q-Tables and Update Frequencies and provide comments.

In [28]:
q_values_SMDP

array([[  -9.99999999,   -9.99999999,   -9.99999999,   -9.99999999,
         -10.89999974,   -9.99999999],
       [  -9.99999999,   -9.99999999,   -9.99999999,   -9.99999999,
         -10.8999995 , -108.99999995],
       [  -9.99999999,   -9.99999999,   -9.99999999,   -9.99999999,
         -10.89999569, -108.99999988],
       [  -9.99999999,   -9.99999999,   -9.99999999,   -9.99999999,
         -10.89999983, -108.99999995],
       [  -9.99999999,   -9.99999999,   -9.99999999,   -9.99999999,
         -10.89999992, -108.99999991],
       [  -9.99999999,   -9.99999999,   -9.99999999,   -9.99999999,
         -10.89999975, -108.99999994],
       [  -9.99999999,   -9.99999999,   -9.99999999,   -9.99999999,
         -10.89999986, -108.99999978],
       [  -9.99999999,   -9.99999999,   -9.99999999,   -9.99999999,
         -10.89999987, -108.99999995],
       [  -9.99999999,   -9.99999999,   -9.99999999,   -9.99999999,
         -10.89999982, -108.99999913],
       [  -9.99999999,   -9.99999999,

Use this text cell for your comments - Task 4


In [29]:
q_values_IO

array([[  -3.15966547,   -3.8108354 ,   -2.18148102,   -3.15803174,
          -3.15964612,   -6.0174502 ],
       [  -3.83894237,   -3.68764781,   -2.86761859,   -3.15953706,
          -3.82622423, -103.05716065],
       [  -3.67371278,   -4.64906761,   -2.77210764,   -3.04821188,
          -3.67366402,  -80.26697171],
       [  -4.65007594,   -5.59188778,   -4.12588661,   -3.69185836,
          -4.64731518, -103.06551979],
       [  -5.60457481,   -4.99225139,   -4.9748281 ,   -4.65404957,
          -5.60406491, -103.06537724],
       [  -5.02275338,   -4.07105245,   -4.07252156,   -5.60335347,
          -5.01755161, -103.06569798],
       [  -4.06382949,   -3.10207384,   -3.11260132,   -5.02953898,
          -4.05828491, -103.06568474],
       [  -3.12278193,   -2.14662614,   -2.47990669,   -3.99484429,
          -3.10032758, -103.01673621],
       [  -2.1283847 ,   -2.19491196,   -1.20846008,   -1.51515327,
          -2.12837194,  -13.90701779],
       [  -1.86360406,   -1.62022821,

In [30]:
print(np.sum(option_freq_SMDP),np.sum(option_freq_IO))

168762.0 832948.0


  and should_run_async(code)
