# CS486 - Artificial Intelligence
## Lesson 15 - MDP Policy Iteration

Markov decision processes model decision making in instances when outcomes are partially non-deterministic. While value iteration provides a simple, non-linear approach to determining an optimal policy for MDPs, it has a few problems:

* Value iteration is slow: $O(S^2A)$ where $S$ is the number of states being modeled and $A$ is the number of actions, per iteration. 
* The maximum expected value, and therefore optimal action, at each state seldom change.
* Policies converge faster than values.  

Today we will look at **policy iteration** which produces an optimal policy through **policy evaluation** and **policy extraction**. 

In [None]:
from helpers import *
from aima.mdp import *
from aima.notebook import psource

## Policy Extraction 

Last time we used value iteration to generate the expected values for *Draw HiLo*:

```python
V(s) = {
    1: 2.5054945054945055,
    2: 2.2197802197802194,
    3: 1.9340659340659339,
    4: 1.648351648351648,
    5: 1.3626373626373625,
    6: 1.0769230769230766,
    7: 0.7912087912087912,
    8: 1.0769230769230766,
    9: 1.3626373626373625,
    10: 1.648351648351648,
    11: 1.9340659340659339,
    12: 2.2197802197802194,
    13: 2.5054945054945055,
    'bet': 0.714285714285714,
    'lose': -1,
    'win': 2.714285714285714
}
```

Those values were passed to the `best_policy` function to determine which actions were optimal at each state. But how exactly does `best_policy` use the expected values to choose actions?

In [None]:
psource(best_policy)
psource(expected_utility)

These functions, which compute the Bellman equation below, visit a state and compute a depth-1 `expectimax` at each node to determine which action produced the expected value. 

$$ \pi_{k+1}(s) \leftarrow \arg \max_{a} \sum_{s'}T(s,a,s')\left[R(s,a,s')+\gamma V^{\pi_i}(s')\right] $$

But notice that policy extraction yields the best action for the *next iteration of expected values at each state*. It's possible that, given a non-optimal policy, policy extraction will actually produce a different, better policy than $\pi_k$.  

## Policy Evaluation 

So if policy extraction yields a different policy, how much better is it? To answer that, we should compute the expected value at every state. We don't need to consider every action at every state, just the ones in our new policy, so this should be a lot faster than value iteration:

$$ V^\pi_{k+1}(s) \leftarrow \sum_{s'}T(s,\pi_i(s),s')\left[R(s,\pi_i(s),s')+\gamma V^{\pi_i}_{k}(s')\right] $$

This definitely looks complicated, but notice that there is no $max$ over the actions since we already know which action we're taking. We're just plugging in the policy. In the AIMA version of policy evaluation, they do not wait for values to converge, they simply iterate 20 times which, in practice, is usually sufficient. 

In [None]:
psource(policy_evaluation)

## Policy Iteration

Each **policy iteration** runs **policy evaluation** to find the expected values for a policy and **policy extraction** to improve it. Note that the AIMA version  inlines policy extraction in their `policy_iteration` function. 

So where do we get our first policy? We pick a random one!

In [None]:
psource(policy_iteration)

So how much better is policy iteration? Well, let's see how it performs on *Draw HiLo*:

In [None]:
from helpers import HiLo
hilo = HiLo()

In [None]:
%%time
expected_values = value_iteration(hilo)
pi1 = best_policy(hilo, expected_values)

In [None]:
%%time
pi2 = policy_iteration(hilo)

In [None]:
pi1 == pi2