## ACTOR CRITIC

In this demo lets employ actor-critic method to learn optimal policies.

We can summarise the steps of the AC algorithm as follow:

1.Produce the action for the current state  (a_t)<br>
2.Observe next state (s_t+1) and  and the reward r<br>
3.Update the utility of state (s_t)(critic) <br>
4.Update the probability of the action using the error (actor)

In [1]:
import numpy as np
np.set_printoptions(precision=3,suppress=True)
from gridworld import GridWorld



In [2]:
def softmax(x):
    
    return np.exp(x - np.max(x)) / np.sum(np.exp(x - np.max(x)))

In [3]:
def update_critic(value_matrix, observation, new_observation, 
                   reward, alpha, gamma, done):
    
    u = value_matrix[observation[0], observation[1]]
    u_t1 = value_matrix[new_observation[0], new_observation[1]]
    delta = reward + ((gamma * u_t1) - u)
    value_matrix[observation[0], observation[1]] += alpha * delta
    return value_matrix, delta

In [4]:
def update_actor(state_action_matrix, observation, action, delta, beta_matrix=None):
   
    col = observation[1] + (observation[0]*4)
    if beta_matrix is None: beta = 1
    else: beta = 1 / beta_matrix[action,col]
    state_action_matrix[action, col] += beta * delta
    return state_action_matrix 

1.Lets create a grid world, the states marked 1 are terminal state and those marked -1 contain obstacles. <br>
2.The agent receives a reward of -0.04 for every move from non-terminal states <br>
3.The  actions are UP(0), RIGHT(1), DOWN(2) and LEFT(3)

In [5]:
env = GridWorld(3, 4)

#Define the state matrix
state_matrix = np.zeros((3,4))
state_matrix[0, 3] = 1
state_matrix[1, 3] = 1
state_matrix[1, 1] = -1
print("State Matrix:")
print(state_matrix)

State Matrix:
[[ 0.  0.  0.  1.]
 [ 0. -1.  0.  1.]
 [ 0.  0.  0.  0.]]


In [6]:
#Define the reward matrix
reward_matrix = np.full((3,4), -0.04)
reward_matrix[0, 3] = 1
reward_matrix[1, 3] = -1
print("Reward Matrix:")
print(reward_matrix)

Reward Matrix:
[[-0.04 -0.04 -0.04  1.  ]
 [-0.04 -0.04 -0.04 -1.  ]
 [-0.04 -0.04 -0.04 -0.04]]


In [7]:
#Define the transition matrix
transition_matrix = np.array([[0.8, 0.1, 0.0, 0.1],
                              [0.1, 0.8, 0.1, 0.0],
                              [0.0, 0.1, 0.8, 0.1],
                              [0.1, 0.0, 0.1, 0.8]])

state_action_matrix = np.random.random((4,12))
print("State-Action Matrix:")
print(state_action_matrix)

State-Action Matrix:
[[0.877 0.623 0.738 0.287 0.449 0.701 0.547 0.125 0.596 0.672 0.95  0.988]
 [0.578 0.422 0.19  0.539 0.258 0.735 0.949 0.284 0.44  0.645 0.407 0.563]
 [0.084 0.558 0.357 0.663 0.431 0.784 0.354 0.663 0.265 0.199 0.958 0.022]
 [0.119 0.29  0.228 0.241 0.368 0.774 0.768 0.4   0.732 0.31  0.006 0.724]]


In [8]:
env.setStateMatrix(state_matrix)
env.setRewardMatrix(reward_matrix)
env.setTransitionMatrix(transition_matrix)

value_matrix = np.zeros((3,4))
print("Utility Matrix:")
print(value_matrix)

Utility Matrix:
[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]


In [9]:
gamma = 0.999
alpha = 0.001 
beta_matrix = np.zeros((4,12))
tot_epoch = 30000
print_epoch = 1000

In [10]:
for epoch in range(tot_epoch):
    #Reset and return the first observation
    observation = env.reset(exploring_starts=True)
    for step in range(1000):
        #Estimating the action through Softmax
        col = observation[1] + (observation[0]*4)
        action_array = state_action_matrix[:, col]
        action_distribution = softmax(action_array)
        action = np.random.choice(4, 1, p=action_distribution)
        new_observation, reward, done = env.step(action)
        value_matrix, delta = update_critic(value_matrix, observation, 
                                              new_observation, reward, alpha, gamma, done)
        state_action_matrix = update_actor(state_action_matrix, observation, 
                                           action, delta, beta_matrix=None)
        observation = new_observation
        if done: break


    if(epoch % print_epoch == 0):
        print("")
        print("State Value matrix after " + str(epoch+1) + " iterations:") 
        print(value_matrix)
        print("")
        #print("State-Action matrix after " + str(epoch+1) + " iterations:") 
        #print(state_action_matrix)
#Time to check the utility matrix obtained
print("value matrix after " + str(tot_epoch) + " iterations:")
print(value_matrix)
print("State-Action matrix after  " + str(tot_epoch) + " iterations:")
print(state_action_matrix)


State Value matrix after 1 iterations:
[[ 0.     0.    -0.     0.   ]
 [-0.     0.    -0.     0.   ]
 [-0.    -0.    -0.    -0.001]]


State Value matrix after 1001 iterations:
[[-0.021  0.095  0.55   0.   ]
 [-0.053  0.     0.062  0.   ]
 [-0.061 -0.058 -0.041 -0.106]]


State Value matrix after 2001 iterations:
[[ 0.021  0.277  0.764  0.   ]
 [-0.06   0.     0.214  0.   ]
 [-0.077 -0.07  -0.009 -0.129]]


State Value matrix after 3001 iterations:
[[ 0.087  0.435  0.863  0.   ]
 [-0.05   0.     0.369  0.   ]
 [-0.082 -0.062  0.062 -0.14 ]]


State Value matrix after 4001 iterations:
[[ 0.171  0.566  0.904  0.   ]
 [-0.026  0.     0.502  0.   ]
 [-0.085 -0.04   0.146 -0.127]]


State Value matrix after 5001 iterations:
[[ 0.266  0.665  0.927  0.   ]
 [ 0.008  0.     0.578  0.   ]
 [-0.083 -0.011  0.231 -0.106]]


State Value matrix after 6001 iterations:
[[ 0.363  0.74   0.938  0.   ]
 [ 0.066  0.     0.63   0.   ]
 [-0.075  0.023  0.302 -0.073]]


State Value matrix after 7001 iterat