In [1]:
%load_ext autoreload
%autoreload 2

import numpy as np

from simple_grid import simple_grid as gridworld
from simple_grid_agent import GridworldAgent as Agent

Read through all the classes and functions defined inside `simple_grid` environment and `GridworldAgent` to familiarize yourself with the details of this assignment.

Consider a simple gridworld where actions do not result in deterministic state changes. We specify that there is a $20\%$ probability that the selected action would result in a stochastic state transition

In [2]:
#stochastic environment
env = gridworld(wind_p=0.2)

The following set of commands will help you familiarize with different components of the gridworld

In [3]:
print('\n Reward For each Tile \n')
env.print_reward()


 Reward For each Tile 


----------
0 |0 |0 |
----------
0 |-5 |5 |
----------
0 |0 |0 |

Check out the set of possible actions for the grid

In [4]:
print('\n Set of possible actions in numerical form. These are actual inputs to the gridworld agent \n')
print(env.action_space)

print('\n Set of possible actions in the grid in text form. They map 1 to 1 from numbers above to direction \n')
print(env.action_text)


 Set of possible actions in numerical form. These are actual inputs to the gridworld agent 

[0 1 2 3]

 Set of possible actions in the grid in text form. They map 1 to 1 from numbers above to direction 

['U' 'L' 'D' 'R']


Consider a policy which tries to reach the goal state(+5) as fast as possible. Below we define the policy to evaluate the state values for this policy

In [5]:
#stochastic environment
env = gridworld(wind_p=0.2)

#initial policy
policy_fast = {(0, 0): 3,
          (0, 1): 3,
          (0, 2): 2,
          (1, 0): 3,
          (1, 1): 3,
          (1, 2): 0,
          (2, 0): 3,
          (2, 1): 0,
          (2, 2): 0}

#stochastic agent - epsilon greedy with decays
a = Agent(env, policy = policy_fast, gamma = 0.9, 
            start_epsilon=0.9,end_epsilon=0.3,epsilon_decay=0.9)

print('\n Policy: Fastest Path to Goal State(Does not take reward into consideration) \n')
a.print_policy()


 Policy: Fastest Path to Goal State(Does not take reward into consideration) 


----------
R |R |D |
----------
R |R |U |
----------
R |U |U |

**Q1**

Implement the `get_v` and `get_q` methods to estimate the state value and state-action value in `simple_grid_agent.py`. These may be used later on for debugging your code

**Q2** 

The Monte Carlo rollout itself has been implemented in `simple_grid_agent.py` inside the `run_episode` method.

**Implement** 

First-visit as well as any-visit Monte Carlo state-value estimation equations inside `mc_predict_v` in `simple_grid_agent.py`.
These have been discussed in class. Refer to Sutton and Barto Chapter 5 for further details to implement them.

Test and report inside this notebook the results using the following commands. Are there sufficient differences in the state values under anyvisit and firstvisit MC Prediction? Why?

**ANS: After resetting the action state it could be observed that the first visit and any visit yields somewhere around similar values for each states. Since both states leads to a convergence at inf, and 10,000 iterations are large enough for the this problem it seems like both the method reach to convergence.**

NB: assume anyvist and everyvisit to be interchangeable terms

In [6]:
# reset agent to compute first visit
a = Agent(env, policy = policy_fast, gamma = 0.9, 
            start_epsilon=0.9,end_epsilon=0.3,epsilon_decay=0.9)

# evaluate state values for policy_fast for both first-vist and any-vist
print('\n State Values for first_visit MC state estiamtion \n')
a.mc_predict_v()
a.print_v()


#Reset the agent to compute anyvisit.
a = Agent(env, policy = policy_fast, gamma = 0.9, 
            start_epsilon=0.9,end_epsilon=0.3,epsilon_decay=0.9)

print('\n State Values for any_visit MC state estiamtion \n')
a.mc_predict_v(first_visit=False)
a.print_v()


 State Values for first_visit MC state estiamtion 


---------------
-1.0 |0.8 |2.8 |
---------------
-3.6 |1.9 |0 |
---------------
-3.9 |-3.3 |2.5 |
 State Values for any_visit MC state estiamtion 


---------------
-1.1 |0.8 |2.9 |
---------------
-3.7 |1.9 |0 |
---------------
-4.0 |-3.4 |2.5 |

**Q3** 

The Monte Carlo rollout itself has been implemented in `simple_grid_agent.py` inside the `run_episode` method.

**Implement** 

First-visit as well as any-visit Monte Carlo state-action value estimation equations inside `mc_predict_q` in `simple_grid_agent.py`
These have been discussed in class. Refer to Sutton and Barto Chapter 5 for further details to implement them.

Test and report inside this notebook the results using the following commands. Are there sufficient differences in the state values under anyvisit and firstvisit MC Q value Prediction? Why?

**ANS: After resetting the action state it could be observed that the first visit and any visit yields somewhere around similar values for each states. Since both states leads to a convergence at inf, and 10,000 iterations are large enough for the this problem it seems like both the method reach to convergence.**

In [7]:
#Reset Agent for the first value prediction
a = Agent(env, policy = policy_fast, gamma = 0.9, 
            start_epsilon=0.9,end_epsilon=0.3,epsilon_decay=0.9)

# evaluate state action values for policy_fast
print('\n State action Values for first_visit MC state action estiamtion \n')
a.mc_predict_q()
print('\n Actions', env.action_text, '\n')
for i in a.q: print(i,a.q[i])

#reset agent for any visit or multi visit
a = Agent(env, policy = policy_fast, gamma = 0.9, 
            start_epsilon=0.9,end_epsilon=0.3,epsilon_decay=0.9)
# evaluate state action values for policy_fast
print('\n State action Values for any_visit MC state action estiamtion \n')
a.mc_predict_q(first_visit=False)
print('\n Actions', env.action_text, '\n')
for i in a.q: print(i,a.q[i])


 State action Values for first_visit MC state action estiamtion 


 Actions ['U' 'L' 'D' 'R'] 

(2, 0) [-4.16750273 -4.51888798 -4.36521357 -3.79427285]
(2, 1) [-3.66475661 -4.47381768 -4.02167861  1.0279243 ]
(1, 1) [-0.4765078  -4.22739671 -3.98085441  3.332586  ]
(0, 0) [-1.22337933 -2.66102216 -4.68203987 -0.82394693]
(0, 1) [-0.83928958 -2.35159725 -3.97763362  1.48407266]
(0, 2) [ 1.45533301 -0.47113593  3.30069162  1.58107833]
(2, 2) [ 3.31022766 -4.0348658   0.94273806  1.30462188]
(1, 0) [-2.47429108 -4.26871391 -4.52783806 -3.70494007]
(1, 2) [0. 0. 0. 0.]

 State action Values for any_visit MC state action estiamtion 


 Actions ['U' 'L' 'D' 'R'] 

(2, 0) [-3.95821625 -4.45990965 -4.6544496  -3.7249584 ]
(1, 0) [-1.75911673 -4.03550175 -4.62940037 -3.39973448]
(2, 1) [-3.55405506 -4.54431308 -3.80355254  1.0367748 ]
(2, 2) [ 3.28200222 -4.02699107  1.07404017  0.95232936]
(1, 1) [-0.47115175 -4.19427461 -3.80856445  3.33069783]
(0, 2) [ 1.50983023 -0.6416251   3.35934537  1

**Q4**

Now we implement Monte Carlo control using state-action values. 

**Implement**

Complete the snippet in `mc_control_q` inside `simple_grid_agent.py`

Test and report inside this notebook the results using the following commands

In [8]:
#stochastic environment
env = gridworld(wind_p=0.2)

#initial policy
policy_fast = {(0, 0): 3,
          (0, 1): 3,
          (0, 2): 2,
          (1, 0): 3,
          (1, 1): 3,
          (1, 2): 0,
          (2, 0): 3,
          (2, 1): 0,
          (2, 2): 0}

#stochastic agent - epsilon greedy with decays
a = Agent(env, policy = policy_fast, gamma = 0.9, 
        start_epsilon=0.9,end_epsilon=0.3,epsilon_decay=0.9)

# Run MC Control
a.mc_control_q(n_episode = 1000,first_visit=False)
a.print_policy()

print('\n Actions: {env.action_text} \n')
for i in a.q: print(i,a.q[i])


----------
L |R |D |
----------
U |R |U |
----------
R |R |U |
 Actions: {env.action_text} 

(2, 0) [-4.04658836 -5.34450892 -4.90007443 -3.94344805]
(2, 1) [-3.65231798 -4.7209081  -4.8003984   0.88024116]
(1, 1) [-0.27084665 -4.68888426 -3.92718024  3.28968687]
(1, 0) [-1.95307692 -4.10897386 -6.21430029 -3.74355583]
(0, 1) [-0.12830397 -0.9415     -2.51577778  1.51963865]
(0, 2) [ 1.36764705 -1.78515905  3.34775114  1.46954031]
(0, 0) [-3.77986127 -0.715942   -2.45271267 -0.97200872]
(2, 2) [ 3.29859775 -4.68125002  1.06604     1.78871429]
(1, 2) [0. 0. 0. 0.]


**Q5**

Bonus!

**Implement**

Greedy within The Limit of  Iinfinite Exploration MC Control in `mc_control_glie` function inside `simple_grid_agent.py`

Test and report inside this notebook the results using the following commands

In [9]:
#stochastic environment
env = gridworld(wind_p=0.2)

#initial policy
policy_fast = {(0, 0): 3,
          (0, 1): 3,
          (0, 2): 2,
          (1, 0): 3,
          (1, 1): 3,
          (1, 2): 0,
          (2, 0): 3,
          (2, 1): 0,
          (2, 2): 0}

#stochastic agent - epsilon greedy with decays
a = Agent(env, policy = policy_fast, gamma = 0.9, 
        start_epsilon=0.9,end_epsilon=0.3,epsilon_decay=0.9)

a.mc_control_glie(n_episode = 1000)
a.print_policy()
print('\n Actions', env.action_text, '\n')
for i in a.q: print(i,a.q[i])


----------
R |R |D |
----------
U |R |U |
----------
R |R |U |
 Actions ['U' 'L' 'D' 'R'] 

(2, 0) [-4.80224638 -4.5433438  -4.61854381 -3.90056043]
(2, 1) [-3.63059444 -5.04126388 -4.62309266  1.0182127 ]
(2, 2) [ 3.20808258 -3.88123942  1.41190246  0.56157875]
(1, 0) [-2.81703112 -5.26846742 -5.04917858 -4.06274592]
(1, 1) [-0.45919111 -4.72471858 -4.25052735  3.21130167]
(0, 0) [-5.17786799 -4.18227904 -4.35256601 -1.08375241]
(0, 1) [-0.80037442 -2.2623703  -3.08151107  1.58831274]
(0, 2) [ 1.44496469 -0.61447723  3.38559966  1.0171323 ]
(1, 2) [0. 0. 0. 0.]
