#### Sutton and Barto, Reinforcement Learning 2nd. Edition, page 80.
![Sutton and Barto, Reinforcement Learning 2nd. Edition.](./Figures/PolicyIteration.png)

Policy Iteration for estimating π

**Policy Iteration**

In iterative policy evaluation the values for a fixed policy are used.  Once these values have been determined we can then examine the rewards and values at destination states to determine if there is a better policy.  *Policy Iteration* is the resulting algorithm. 

This calculation is repeated until the policy does not change.

**Policy Iteration Algorithm**
```python
Given a policy
while not converged:
    compute values using iterative policy evaluation
    compute new policy from values
```

In [1]:
from rlgridworld.standard_grid import create_standard_grid
from rlgridworld.algorithms import iterative_policy_evaluation
from rlgridworld.algorithms import compute_policy_from_values

In [2]:
gw = create_standard_grid()

In [3]:
policy = { 
    (0,0):'up', (0,1):'right',(0,2):'right',(0,3):'up',
    (1,0):'up', (1,1):'', (1,2):'right', (1,3):'',
    (2,0):'right', (2,1):'right', (2,2):'right', (2,3):''
    }

In [4]:
# from page 80 of Sutton and Barto, RL, 2nd. Ed.
def policy_iteration(gw, policy, gamma=0.9, epsilon=0.001):
    while True:
        # perform iterative policy evaluation to update values
        iterative_policy_evaluation(gw, policy, gamma, epsilon)
        # update policy from new values
        new_policy = compute_policy_from_values(gw, gamma)
        # see if policy has changed
        for action in policy:
            if policy[action] == new_policy[action]:
                policy_stable = True
            else:
                policy_stable = False
                break
        # update policy
        policy = new_policy
        # repeat until policy does not change
        if policy_stable == True:
            break

In [5]:
print("")
print("Initial Policy")
gw.print_policy(policy)
print("")

# note: this execution of iterative policy evaluation is not part 
# of the policy iteration algorithm.  It is for the purpose of 
# displaying the values associated with the input policy

iterative_policy_evaluation(gw, policy)
print("Initial Policy Values")
gw.print_values()

# run policy iteration algorithm
policy_iteration(gw, policy)
# compute policy from optimal values
new_policy = compute_policy_from_values(gw)

# print new policy and values
print("") 
print("New Policy")
gw.print_policy(new_policy)
print("")
print("New Policy Values")
gw.print_values()
print("")


Initial Policy
-------------------------------------
|  Right |  Right |  Right |        |
-------------------------------------
|     Up |        |  Right |        |
-------------------------------------
|     Up |  Right |  Right |     Up |
-------------------------------------

Initial Policy Values
-------------------------------------
|   0.81 |   0.90 |   1.00 |   0.00 |
-------------------------------------
|   0.73 |   0.00 |  -1.00 |   0.00 |
-------------------------------------
|   0.66 |  -0.81 |  -0.90 |  -1.00 |
-------------------------------------

New Policy
-------------------------------------
|  Right |  Right |  Right |        |
-------------------------------------
|     Up |        |     Up |        |
-------------------------------------
|  Right |  Right |     Up |   Left |
-------------------------------------

New Policy Values
-------------------------------------
|   0.81 |   0.90 |   1.00 |   0.00 |
-------------------------------------
|   0.73 |   0.00 