Iterative Policy Evaluation, for estimating V

**Iterative Policy Evaluation**

A **policy** must say what to do at every state. The idea of iterative policy evaluation is to use this policy to compute values.  That is, loop over all states and compute a value for all of them using the given policy.  

Repeat this calculation, over all state, until the values do not change.

**Iterative Policy Evaluation Algorithm**
``` python
Given a policy
while not converged:
    # compute at each state compute new value until values do not change
    value = reward + gamma*value at dest
```

**Create standard grid world problem and its state space**

In [14]:
from rlgridworld.standard_grid import create_standard_grid
gw = create_standard_grid()

gw.set_reward((0,0), "up", -2)

**Specify a policy function for the state space**

In [15]:
policy = { 
    (0,0):'up', (0,1):'right',(0,2):'right',(0,3):'up',
    (1,0):'up', (1,1):'', (1,2):'right', (1,3):'',
    (2,0):'right', (2,1):'right', (2,2):'right', (2,3):''
    }

In [16]:
print("Policy")
gw.print_policy(policy)
print("Initial Values")
gw.print_values()

Policy
-------------------------------------
|  Right |  Right |  Right |        |
-------------------------------------
|     Up |        |  Right |        |
-------------------------------------
|     Up |  Right |  Right |     Up |
-------------------------------------
Initial Values
-------------------------------------
|   0.00 |   0.00 |   0.00 |   0.00 |
-------------------------------------
|   0.00 |   0.00 |   0.00 |   0.00 |
-------------------------------------
|   0.00 |   0.00 |   0.00 |   0.00 |
-------------------------------------


The code below is the implementation of the iterative policy evaluation algorithm.  It loops over all states.  It does no computations at barrier or terminal states.  

The code then retrieves information from the policy and grid and performs the key computation
```python
# compute at each state compute new value until values do not change
value = reward + gamma*value_at_dest
```
at each state.

In [17]:
def iterative_policy_evaluation(gw, policy, gamma=0.9, theta=0.001):
    
    while True:
        biggest_change = 0
        for node in gw:
            state = node.state
            if not gw.is_terminal(state) and not gw.is_barrier(state):
                # get current (old) value
                old_value = gw.get_value(state)
                # get action from policy
                action = policy[state]
                # get immediate reward for action
                reward = gw.get_reward_for_action(state, action)
                # get value at destination state
                value_at_dest = gw.get_value_at_destination(state, action)
                # compute new value
                new_value = reward + gamma*value_at_dest
                # set new value for state
                gw.set_value(state, new_value)
                # see if |new_value-old_value| is larger than biggest_change
                biggest_change = max(
                    biggest_change, abs(new_value-old_value))
        # iterated over all states, so see if biggest_change is small enough
        if biggest_change < theta:
            break

In [18]:
print("Policy")
gw.print_policy(policy)
iterative_policy_evaluation(gw, policy, gamma = 0.9)
print("Values for the policy")
gw.print_values()

Policy
-------------------------------------
|  Right |  Right |  Right |        |
-------------------------------------
|     Up |        |  Right |        |
-------------------------------------
|     Up |  Right |  Right |     Up |
-------------------------------------
Values for the policy
-------------------------------------
|   0.81 |   0.90 |   1.00 |   0.00 |
-------------------------------------
|   0.73 |   0.00 |  -1.00 |   0.00 |
-------------------------------------
|  -1.34 |  -0.81 |  -0.90 |  -1.00 |
-------------------------------------
