#### Sutton and Barto, Reinforcement Learning 2nd. Edition, page 80.

![Sutton and Barto, Reinforcement Learning 2nd. Edition.](./Figures/PolicyIteration.png)
Policy Iteration for estimating π



**Policy Iteration**



In iterative policy evaluation the values for a fixed policy are used.  Once these values have been determined we can then examine the rewards and values at destination states to determine if there is a better policy.  *Policy Iteration* is the resulting algorithm. 



This calculation is repeated until the policy does not change.



**Policy Iteration Algorithm**

```python

Given a policy

while not converged:

    compute values using iterative policy evaluation

    compute new policy from values

```

### Import Necessary Libraries

Import functions required for creating the grid world environment and implementing the policy iteration algorithm components (policy evaluation and policy improvement).

In [1]:
from rlgridworld.standard_grid import create_standard_grid
from rlgridworld.algorithms import iterative_policy_evaluation
from rlgridworld.algorithms import compute_policy_from_values

### Create Grid World Environment

Instantiate the standard grid world environment used for the examples.

In [2]:
gw = create_standard_grid()

### Define Initial Policy

Define an initial, arbitrary policy for the agent. This policy maps states (represented by coordinates) to actions ('up', 'down', 'left', 'right'). Empty strings indicate terminal states or states with no defined action in this initial policy.

In [3]:
policy = { 
    (0,0):'right', (0,1):'right',(0,2):'right',(0,3):'up',
    (1,0):'up', (1,1):'', (1,2):'right', (1,3):'',
    (2,0):'right', (2,1):'right', (2,2):'right', (2,3):''
    }

### Define Policy Iteration Function

Implement the core Policy Iteration algorithm as described in Sutton and Barto (page 80). This function iteratively performs policy evaluation (calculating state values V(s) for the current policy) and policy improvement (updating the policy based on the calculated values) until the policy stabilizes (no longer changes between iterations).

In [4]:
# from page 80 of Sutton and Barto, RL, 2nd. Ed.

def policy_iteration(gw, policy, gamma=0.9, epsilon=0.001):

    while True:

        # perform iterative policy evaluation to update values
        # This function modifies gw.V internally
        iterative_policy_evaluation(gw, policy, gamma, epsilon)

        # update policy from new values stored in gw.V
        new_policy = compute_policy_from_values(gw, gamma)

        # see if policy has changed
        policy_stable = True # Assume stable initially
        current_states = set(policy.keys())
        new_states = set(new_policy.keys())
        all_states = current_states.union(new_states)

        for state in all_states:
            current_action = policy.get(state, '') # Get action or default to ''
            new_action = new_policy.get(state, '') # Get action or default to ''
            if current_action != new_action:
                policy_stable = False
                break

        # update policy for the next iteration
        policy = new_policy

        # repeat until policy does not change
        if policy_stable:
            break
    # Return the final stable policy 
    return policy

### Policy Iteration

1.  Print the initial policy.
2.  Evaluate the initial policy using `iterative_policy_evaluation` just to see the starting values (this step is *not* part of the core policy iteration loop itself, but done here for demonstration).
3.  Print the values associated with the initial policy.
4.  Execute the `policy_iteration` function to find the optimal policy. We pass a copy of the initial policy to avoid modifying the original dictionary if it's needed elsewhere.
5.  The `policy_iteration` function returns the final, stable policy.

In [5]:
initial_policy_display = policy.copy()
print("")
print("Initial Policy")
gw.print_policy(initial_policy_display)
print("")

# note: this execution of iterative policy evaluation is not part 
# of the policy iteration algorithm.  It is for the purpose of 
# displaying the values associated with the input policy

# Create a temporary grid instance or reset values if needed for display
temp_gw_for_initial_values = create_standard_grid() 
iterative_policy_evaluation(temp_gw_for_initial_values, initial_policy_display) # Use default gamma/epsilon
print("Initial Policy Values (before iteration)")
temp_gw_for_initial_values.print_values()
del temp_gw_for_initial_values # Clean up temporary object

# run policy iteration algorithm on the main grid world object 'gw'
# Make sure gw's values are reset or start fresh if necessary before this step
gw = create_standard_grid() # Re-create to ensure clean state values
final_policy = policy_iteration(gw, policy.copy()) # Pass a copy of the original policy
# The policy_iteration function modifies the values in gw internally.

# print new policy and values
print("") 
print("Final Policy (after iteration)")
gw.print_policy(final_policy)
print("")
print("Final Policy Values (after iteration)")
gw.print_values()
print("")


Initial Policy
-------------------------------------
|  Right |  Right |  Right |        |
-------------------------------------
|     Up |        |  Right |        |
-------------------------------------
|  Right |  Right |  Right |     Up |
-------------------------------------

Initial Policy Values (before iteration)
-------------------------------------
|   0.81 |   0.90 |   1.00 |   0.00 |
-------------------------------------
|   0.73 |   0.00 |  -1.00 |   0.00 |
-------------------------------------
|  -0.73 |  -0.81 |  -0.90 |  -1.00 |
-------------------------------------

Final Policy (after iteration)
-------------------------------------
|  Right |  Right |  Right |        |
-------------------------------------
|     Up |        |     Up |        |
-------------------------------------
|  Right |  Right |     Up |   Left |
-------------------------------------

Final Policy Values (after iteration)
-------------------------------------
|   0.81 |   0.90 |   1.00 |   0.00