# Policy Evaluation

This is a full implementation of the policy-evaluation algorithm. All we need is: the policy we’re trying to evaluate, the MDP, the discount factor, gamma, defaults to 1, and theta (a small number that we use to check for converge)

In [6]:
import numpy as np

def policy_evaluation(pi, P, gamma=1.0, theta=1e-10):
    
    # initialize the first-iteration estimates to zero.
    prev_V = np.zeros(len(P))
    
    # looping forever...
    while True:
        # initialize the current-iteration estimates to zero as well.
        V = np.zeros(len(P))
        
        # loop through all states to estimate the state-value function
        for s in range(len(P)):
            
            # we use the policy pi to get the possible transitions,
            # each transition tuple has a probability, next state, 
            # reward, and a done flag indicating whether the next_state 
            # is terminal or not
            for prob, next_state, reward, done in P[s][pi(s)]:
                
                # calculate the value of that state by summing up the 
                # weighted value of that transition
                V[s] += prob * (reward + gamma * prev_V[next_state])
        
        # at the end of each iteration (a state sweep), we make sure 
        # that the state-value functions are changing;
        # otherwise, we call it converged
        if np.max(np.abs(prev_V - V)) < theta:
            break
        
        # finally, copy to get ready for the next iteration or 
        prev_V = V.copy()
        
    # return the latest state-value function    
    return V

We can write also a simple function to print the value function in a readable way: 

In [9]:
def print_state_value_function(V, P, n_cols=4, prec=3, title='State-value function:'):
    print(title)
    for s in range(len(P)):
        v = V[s]
        print("| ", end="")
        if np.all([done for action in P[s].values() for _, _, _, done in action]):
            print("".rjust(9), end=" ")
        else:
            print(str(s).zfill(2), '{}'.format(np.round(v, prec)).rjust(6), end=" ")
        if (s + 1) % n_cols == 0: print("|")

Let’s now run policy evaluation for the two policies "Go-get-it" and "Careful" for the FL environment.

In [2]:
import gymnasium as gym

frozen_lake = gym.make('FrozenLake-v1')
P = frozen_lake.env.unwrapped.P
goal_state = 15

LEFT, DOWN, RIGHT, UP = range(4)

In [3]:
go_get_pi = lambda s: {
    0:RIGHT, 1:RIGHT, 2:DOWN, 3:LEFT,
    4:DOWN, 5:LEFT, 6:DOWN, 7:LEFT,
    8:RIGHT, 9:RIGHT, 10:DOWN, 11:LEFT,
    12:LEFT, 13:RIGHT, 14:RIGHT, 15:LEFT
}[s]

In [4]:
careful_pi = lambda s: {
    0:LEFT, 1:UP, 2:UP, 3:UP,
    4:LEFT, 5:LEFT, 6:UP, 7:LEFT,
    8:UP, 9:DOWN, 10:LEFT, 11:LEFT,
    12:LEFT, 13:RIGHT, 14:RIGHT, 15:LEFT
}[s]

In [10]:
V = policy_evaluation(go_get_pi, P, gamma=0.99)
print_state_value_function(V, P, prec=4)

State-value function:
| 00 0.0342 | 01 0.0231 | 02 0.0468 | 03 0.0231 |
| 04 0.0463 |           | 06 0.0957 |           |
| 08  0.094 | 09 0.2386 | 10 0.2901 |           |
|           | 13 0.4329 | 14 0.6404 |           |


In [11]:
V = policy_evaluation(careful_pi, P, gamma=0.99)
print_state_value_function(V, P, prec=4)

State-value function:
| 00 0.4079 | 01 0.3754 | 02 0.3543 | 03 0.3438 |
| 04 0.4203 |           | 06 0.1169 |           |
| 08 0.4454 | 09  0.484 | 10 0.4328 |           |
|           | 13 0.5884 | 14 0.7107 |           |


The Go-get-it policy doesn’t pay well in the FL environment! Whereas the Careful policy is much better.