# DAT257x: Reinforcement Learning Explained

## Lab 4: Dynamic Programming

### Exercise 4.2 Policy Iteration

Implement the algorithm for Policy Iteration in the code cell below.  

Note that there is a subtle difference between the algorithm for Policy Evaluation, which assumes the policy is stochastic, and the Policy Evaluation step for the Policy Iteration algorithm, which assumes the policy is deterministic.  This means that you cannot directly call your previous code, but you can reuse large pieces of it for the Policy Evaluation step.


In [41]:
import tester       # required for testing and grading your code
import numpy as np

def policy_iteration(state_count, gamma, theta, get_available_actions, get_transitions):
    """
    This function computes the optimal value function and policy for the specified MDP, using the Policy Iteration algorithm.
    'state_count' is the total number of states in the MDP. States are represented as 0-relative numbers.
    'gamma' is the MDP discount factor for rewards.
    'theta' is the small number threshold to signal convergence of the value function (see Iterative Policy Evaluation algorithm).
    'get_available_actions' returns a list of the MDP available actions for the specified state parameter.
    'get_transitions' is the MDP state / reward transiton function.  It accepts two parameters, state and action, and returns
        a list of tuples, where each tuple is of the form: (next_state, reward, probabiliity).  
    """
    V = state_count*[0]                # init all state value estimates to 0
    pi = state_count*[0]
    
    # init with a policy with first avail action for each state
    for s in range(state_count):
        avail_actions = get_available_actions(s)
        pi[s] = avail_actions[0][0]
    
    while True:
        V = policy_eval_in_place(V, state_count, gamma, theta, pi, get_transitions)
        is_stable = True
        for s in range(state_count):
            prev_action = pi[s]
            avail_actions = get_available_actions(s)
            expected_rewards = list(sum(t[2] * (t[1] + gamma * V[t[0]]) for t in get_transitions(s, a)) for a in avail_actions)
            max_index = np.argmax(expected_rewards)
            pi[s] = avail_actions[max_index]
            is_stable = (is_stable and (pi[s] == prev_action))
        if is_stable == True:
            break
        
    # insert code here to iterate using policy evaluation and policy improvement (see Policy Iteration algorithm)
    return (V, pi)        # return both the final value function and the final policy


def policy_eval_in_place(V, state_count, gamma, theta, pi, get_transitions):
    while True:
        delta = 0
        for s in range(state_count):
            prev_v = V[s]
            V[s] = sum(t[2] * (t[1] + gamma * V[t[0]]) for t in get_transitions(s, pi[s]))
            delta = max(delta, np.absolute(prev_v - V[s]))
        if delta < theta:
            break
    return V

tester.policy_iteration_test(policy_iteration)  


Testing: Policy Iteration
passed test: return value is tuple
passed test: length of tuple = 2
passed test: v is list of length=15
passed test: values of v elements
passed test: pi is list of length=15
passed test: values of pi elements
PASSED: Policy Iteration passcode = 9970-010
