In [1]:
%load_ext autoreload
%autoreload 2

import numpy as np

from simple_grid import simple_grid as gridworld
from simple_grid_agent import GridworldAgent as Agent

Read through all the classes and functions defined inside `simple_grid` environment and `GridworldAgent` to familiarize yourself with the details of this assignment.

Consider a simple gridworld where actions do not result in deterministic state changes. We specify that there is a $20\%$ probability that the selected action would result in a stochastic state transition

In [2]:
#stochastic environment
env = gridworld(wind_p=0.2)

The following set of commands will help you familiarize with different components of the gridworld

In [3]:
print('\n Reward For each Tile \n')
env.print_reward()


 Reward For each Tile 


----------
0 |0 |0 |
----------
0 |-5 |5 |
----------
0 |0 |0 |

Check out the set of possible actions for the grid

In [4]:
print('\n Set of possible actions in numerical form. These are actual inputs to the gridworld agent \n')
print(env.action_space)

print('\n Set of possible actions in the grid in text form. They map 1 to 1 from numbers above to direction \n')
print(env.action_text)


 Set of possible actions in numerical form. These are actual inputs to the gridworld agent 

[0 1 2 3]

 Set of possible actions in the grid in text form. They map 1 to 1 from numbers above to direction 

['U' 'L' 'D' 'R']


Consider a policy which tries to reach the goal state(+5) as fast as possible. Below we define the policy to evaluate the state values for this policy

In [5]:
#stochastic environment
env = gridworld(wind_p=0.2)

#initial policy
policy_fast = {(0, 0): 3,
          (0, 1): 3,
          (0, 2): 2,
          (1, 0): 3,
          (1, 1): 3,
          (1, 2): 0,
          (2, 0): 3,
          (2, 1): 0,
          (2, 2): 0}

#stochastic agent - epsilon greedy with decays
a = Agent(env, policy = policy_fast, gamma = 0.9, 
            start_epsilon=0.9,end_epsilon=0.3,epsilon_decay=0.9)

print('\n Policy: Fastest Path to Goal State(Does not take reward into consideration) \n')
a.print_policy()


 Policy: Fastest Path to Goal State(Does not take reward into consideration) 


----------
R |R |D |
----------
R |R |U |
----------
R |U |U |

**Q1**

Implement the `get_v` and `get_q` methods to estimate the state value and state-action value in `simple_grid_agent.py`. These may be used later on for debugging your code

**Q2** 

The Monte Carlo rollout itself has been implemented in `simple_grid_agent.py` inside the `run_episode` method.

**Implement** 

First-visit as well as any-visit Monte Carlo state-value estimation equations inside `mc_predict_v` in `simple_grid_agent.py`.
These have been discussed in class. Refer to Sutton and Barto Chapter 5 for further details to implement them.

Test and report inside this notebook the results using the following commands. Are there sufficient differences in the state values under anyvisit and firstvisit MC Prediction? Why?

NB: assume anyvist and everyvisit to be interchangeable terms

In [6]:
 # evaluate state values for policy_fast for both first-vist and any-vist
print('\n State Values for first_visit MC state estimation \n')
a.mc_predict_v()
a.print_v()

print('\n State Values for any_visit MC state estimation \n')
a.mc_predict_v(first_visit=False)
a.print_v()


 State Values for first_visit MC state estimation 


---------------
-0.7 |0.7 |2.8 |
---------------
-3.7 |1.9 |0 |
---------------
-4.0 |-3.4 |2.4 |
 State Values for any_visit MC state estimation 


---------------
-0.5 |0.5 |1.6 |
---------------
-1.9 |1.1 |0 |
---------------
-2.3 |-1.9 |1.4 |

**Q3** 

The Monte Carlo rollout itself has been implemented in `simple_grid_agent.py` inside the `run_episode` method.

**Implement** 

First-visit as well as any-visit Monte Carlo state-action value estimation equations inside `mc_predict_q` in `simple_grid_agent.py`
These have been discussed in class. Refer to Sutton and Barto Chapter 5 for further details to implement them.

Test and report inside this notebook the results using the following commands. Are there sufficient differences in the state values under anyvisit and firstvisit MC Q value Prediction? Why?

**My Answer**

The difference is large enough if the Agent is not reset, as it is continuously being overwritten and its not very clear. 
But, if we reset the Agent each time, we can observe that the first visit and any visit methods have almost similar values. This is due to values converging at *infinity* (very high values), and 10000 iterations seem to be large enough for the problem. Thus, it seems like both the MC methods reach near convergence, hence having similar values.


In [9]:
#Resetting Agent for the first value method
a = Agent(env, policy = policy_fast, gamma = 0.9, 
            start_epsilon=0.9,end_epsilon=0.3,epsilon_decay=0.9)

# evaluate state action values for policy_fast
print('\n State action Values for first_visit MC state action estiamtion \n')
a.mc_predict_q()
print('\n Actions', env.action_text, '\n')
for i in a.q: print(i,a.q[i])

#Resetting Agent for the any value method
a = Agent(env, policy = policy_fast, gamma = 0.9, 
            start_epsilon=0.9,end_epsilon=0.3,epsilon_decay=0.9)
    
# evaluate state action values for policy_fast
print('\n State action Values for any_visit MC state action estimation \n')
a.mc_predict_q(first_visit=False)
print('\n Actions', env.action_text, '\n')
for i in a.q: print(i,a.q[i])


 State action Values for first_visit MC state action estiamtion 


 Actions ['U' 'L' 'D' 'R'] 

(2, 0) [-3.90329919 -3.90329919 -3.90329919 -3.90329919]
(2, 1) [-3.34172175 -3.34172175 -3.34172175 -3.34172175]
(2, 2) [2.57401957 2.57401957 2.57401957 2.57401957]
(1, 1) [1.96939246 1.96939246 1.96939246 1.96939246]
(0, 1) [0.79994935 0.79994935 0.79994935 0.79994935]
(0, 0) [-0.99826627 -0.99826627 -0.99826627 -0.99826627]
(0, 2) [2.72404822 2.72404822 2.72404822 2.72404822]
(1, 0) [-3.56870279 -3.56870279 -3.56870279 -3.56870279]
(1, 2) [0. 0. 0. 0.]

 State action Values for any_visit MC state action estimation 


 Actions ['U' 'L' 'D' 'R'] 

(2, 0) [-3.96960571 -3.96960571 -3.96960571 -3.96960571]
(1, 0) [-3.6943788 -3.6943788 -3.6943788 -3.6943788]
(1, 1) [1.92188501 1.92188501 1.92188501 1.92188501]
(0, 1) [0.80356688 0.80356688 0.80356688 0.80356688]
(0, 2) [2.68179496 2.68179496 2.68179496 2.68179496]
(2, 1) [-3.39222104 -3.39222104 -3.39222104 -3.39222104]
(2, 2) [2.5771943 2.5

**Q4**

Now we implement Monte Carlo control using state-action values. 

**Implement**

Complete the snippet in `mc_control_q` inside `simple_grid_agent.py`

Test and report inside this notebook the results using the following commands

In [10]:
#stochastic environment
env = gridworld(wind_p=0.2)

#initial policy
policy_fast = {(0, 0): 3,
          (0, 1): 3,
          (0, 2): 2,
          (1, 0): 3,
          (1, 1): 3,
          (1, 2): 0,
          (2, 0): 3,
          (2, 1): 0,
          (2, 2): 0}

#stochastic agent - epsilon greedy with decays
a = Agent(env, policy = policy_fast, gamma = 0.9, 
        start_epsilon=0.9,end_epsilon=0.3,epsilon_decay=0.9)

# Run MC Control
a.mc_control_q(n_episode = 1000,first_visit=False)
a.print_policy()

print('\n Actions: {env.action_text} \n')
for i in a.q: print(i,a.q[i])


----------
U |U |U |
----------
U |U |U |
----------
U |U |U |
 Actions: {env.action_text} 

(2, 0) [-4.06428327 -4.06428327 -4.06428327 -4.06428327]
(2, 1) [-3.61741734 -3.61741734 -3.61741734 -3.61741734]
(1, 1) [1.83840261 1.83840261 1.83840261 1.83840261]
(1, 0) [-4.09648402 -4.09648402 -4.09648402 -4.09648402]
(0, 1) [0.00152062 0.00152062 0.00152062 0.00152062]
(0, 0) [-2.77851795 -2.77851795 -2.77851795 -2.77851795]
(0, 2) [2.76265248 2.76265248 2.76265248 2.76265248]
(2, 2) [2.16922656 2.16922656 2.16922656 2.16922656]
(1, 2) [0. 0. 0. 0.]


**Q5**

Bonus!

**Implement**

Greedy within The Limit of  Iinfinite Exploration MC Control in `mc_control_glie` function inside `simple_grid_agent.py`

Test and report inside this notebook the results using the following commands

In [12]:
#stochastic environment
env = gridworld(wind_p=0.2)

#initial policy
policy_fast = {(0, 0): 3,
          (0, 1): 3,
          (0, 2): 2,
          (1, 0): 3,
          (1, 1): 3,
          (1, 2): 0,
          (2, 0): 3,
          (2, 1): 0,
          (2, 2): 0}

#stochastic agent - epsilon greedy with decays
a = Agent(env, policy = policy_fast, gamma = 0.9, 
        start_epsilon=0.9,end_epsilon=0.3,epsilon_decay=0.9)

a.mc_control_glie(n_episode = 1000)
a.print_policy()
print('\n Actions', env.action_text, '\n')
for i in a.q: print(i,a.q[i])


----------
R |R |D |
----------
R |R |U |
----------
R |U |U |
 Actions ['U' 'L' 'D' 'R'] 

(2, 0) [-4.43721661 -4.7530688  -4.30619662 -3.81638594]
(2, 1) [-3.46260196 -5.2509598  -4.58948807  0.47365521]
(1, 1) [-0.61335505 -4.59878555 -4.0423211   3.42311793]
(1, 0) [-1.5817005  -3.15445273 -3.70110562 -3.28002722]
(2, 2) [ 3.24556312 -2.35457034  2.348       2.30352941]
(0, 1) [-0.77147403 -2.32595881 -3.04926663  1.70161609]
(0, 2) [ 0.59276    -0.70399232  3.42073377  1.30634498]
(0, 0) [-1.93911085 -3.15227375 -2.038      -1.22695815]
