In [1]:
import mdptoolbox
import numpy as np
from helpers import *

# Setup
In this notebook, we present a pymdptoolbox implementation of the gridworld environment defined in Lesson 2. To define an MDP in the mdptoolbox, we need to construct numpy arrays for both the transition matrix and reward matrix. 

We are going to use the transitions defined in Lesson 2, Video 5 "Quiz: The World - 2":
<img src="images/theworld.png",width=240,height=240>

And the rewards defined in Lesson 2, Video 12 "Quiz: More About Rewards - 3":
<img src="images/rewards.png",width=240,height=240>

In [2]:
#reward inputs
r_s = +2 #most states
r_g = +1 #good terminal state
r_b = -1 #bad terminal state

#transition inputs
p_intended = .8
p_opposite = 0.0
p_right = .1
p_left = .1

#create_mdp can be found in helpers.py
T , R = create_mdp([r_s, r_g, r_b], \
                   [p_intended, p_opposite, p_right, p_left])

# Value Iteration
Once we have both matrices, we are going to run value iteration using the mdptoolbox.

Note: we've defined the transition matrix T such that up is 0, right is 1, down is 2 and left is 3.

In [3]:
#create object, undiscounted for now
vi = mdptoolbox.mdp.ValueIteration(T, R, discount=1)

#run value iteration silently
vi.setSilent()
vi.run()

#print policy found by value iteration
print(np.array(vi.policy).reshape((3,4)))

[[0 0 3 0]
 [0 0 3 0]
 [1 0 3 2]]


# Finding Q-values
However, this doesn't tell us when the choice of action doesn't matter. We need to look at the Q (state-action) values to see if all actions for a given state have the same value. 

To explore this we will use the function get_q_values (found in helpers.py.) This function outputs a policy where "-1" denotes that all actions have the same Q-value (precision: the mdp object's epsilon value).

With verbose on, Q-values for each state are also presented. State numbers are as follows:

|  |  |   |   |
|---|---|----|----|
| 0 | 1 | 2  | 3  |  
| 4 | 5 | 6  | 7  |
| 8 | 9 | 10 | 11 |

In [4]:
#print policy with -1 where all actions have same Q value
#in verbose mode we will also see the Q values themselves
vi.setVerbose()
print("Q-values:\n")
policy = get_q_values(vi).reshape((3,4))
print("\n=====================\nPolicy:\n")
print(policy)

Q-values:

State 0: [ 2002.  2002.  2002.  2002.]
State 1: [ 2002.  2002.  2002.  2002.]
State 2: [ 1902.  1202.  1902.  2002.]
State 3: [ 1001.  1001.  1001.  1001.]
State 4: [ 2002.  2002.  2002.  2002.]
State 5: [ 100100.  100100.  100100.  100100.]
State 6: [ 1702.  -398.  1702.  2002.]
State 7: [-1001. -1001. -1001. -1001.]
State 8: [ 2002.  2002.  2002.  2002.]
State 9: [ 2002.  2002.  2002.  2002.]
State 10: [ 2002.  2002.  2002.  2002.]
State 11: [ -398.  1702.  2002.  1702.]

Policy:

[[-1 -1  3 -1]
 [-1 -1  3 -1]
 [-1 -1 -1  2]]


# Conclusions

State 10's Q-values are all the same. But what if we:

1. Change the transition matrix by making p_intended = .7 and p_opposite = .1?
2. Take it further and make p_intended = .7999 and p_opposite = .0001?

Feel free to play around with this notebook and the functions in helpers.py to get more of an intuition about MDPs.