In [1]:
import numpy as np
import MDP
import RL2

In [2]:
''' Construct simple MDP as described in Lecture 2a Slides 13-14'''
T = np.array([[[0.5,0.5,0,0],[0,1,0,0],[0.5,0.5,0,0],[0,1,0,0]],[[1,0,0,0],[0.5,0,0,0.5],[0.5,0,0.5,0],[0,0,0.5,0.5]]])
R = np.array([[0,0,10,10],[0,0,10,10]])
discount = 0.9
mdp = MDP.MDP(T,R,discount)
rlProblem = RL2.RL2(mdp,np.random.normal)

## Question 0: REINFORCE for Company Env

The following cell shows results for reinforce algorithm for the company environment. The argument optionlr=2 is a learning rate option which selects different learning rate cool down periords. This learning rate option is selected based as it performed the best in Maze environment (as can be seen from the graph).
The learning rate:<br>
episode number (0-59): 0.004<br>
episode number (60-119): 0.003<br>
episode number (120-179): 0.002<br>
episode number (180:inf): 0.001<br>

The results match the results we obtain from other approximate and deterministic methods like Q-Learning, value iteration and policy iteration. The stochastic policy in REINFORCE algorithm still shows a small possibility of choosing bad actions, however it could be because of numerical approximations in python. The policy is derived by applying softmax over all actions' policy parameters for a given state. This non-linear and strictly increasing transformation of the policy parameters to generate policy can be used to evaluate the policy parameter results as a value function. Higher value of policy parameters generate a higher probability of taking that action.

In [19]:
# Test REINFORCE
[policyParams,policy] = rlProblem.reinforce(
    s0=0,initialPolicyParams=np.random.rand(mdp.nActions,mdp.nStates),
    nEpisodes=1000,nSteps=100,optionlr=2)
print ("\nREINFORCE results")
print (policyParams)
print (policy)
print ("last 10 episode rewards: {}".format(rlProblem.get_reinforce_cumulative_reward()[-10:]))


REINFORCE results
[[   1.08779021 -107.64104214  -72.11244365 -118.98323027]
 [-190.65790366    1.39528767    1.55156471    1.55799232]]
[[  1.00000000e+00   4.42714387e-48   1.01889083e-32   4.46284897e-53]
 [  5.31989696e-84   1.00000000e+00   1.00000000e+00   1.00000000e+00]]
last 10 episode rewards: [ 61.83802954  41.71160069  36.4315967    3.13309967   4.96709418
  40.15740254  66.25189017  10.63413635  33.92608282  54.1812685 ]


## Question 0: Model Based RL for Company Env

The following results show that the Model Based RL computes the correct policy in the Company environment (based on the results from Value Iteration). Model based RL algorithm will learn a model (transition matrix) of the environment by generating an expectation over the samples accumulated. The value function generated by model based RL is $[ 30.35689505,  37.17731415,  42.57629584,  52.68348209]$ and the value function generated for Company environment is $[ 31.58404185,  38.60295392,  44.0231138 ,  54.2005363 ]$. We see that the values are comparable and the differences are because model based RL didn't have the real transition matrix and thus was trying to evaluate the best policy on an estimate of the transition matrix (generated by sampling (s,a,s') from the environment). We achieve a final deterministic policy which is exactly the same as other approaches.

In [20]:
# Test model-based RL
[V,policy] = rlProblem.modelBasedRL(s0=0,defaultT=np.ones([mdp.nActions,mdp.nStates,mdp.nStates])/mdp.nStates,initialR=np.zeros([mdp.nActions,mdp.nStates]),nEpisodes=100,nSteps=100,epsilon=0.05)
print ("\nmodel-based RL results")
print (V)
print (policy)


model-based RL results
[ 30.35689505  37.17731415  42.57629584  52.68348209]
[0 1 1 1]
