#### Sutton and Barto, Reinforcement Learning 2nd. Edition, page 83.

![Sutton and Barto, Reinforcement Learning 2nd. Edition.](./Figures/ValueIteration.png)
Value Iteration, for estimating π



**Value Iteration**



Given a **policy**, one finds the associated **values** using the **iterative policy evaluation** algorithm. Given **values**, one can find the associated **policy**.  Iterating policy evaluation and finding a policy until the policy does not change is the **policy iteration** algorithm.



**Value iteration** proceeds by interleaving value calculations with policy updates.  Convergence occurs when the values do not change.



Note that for this algorithm one does not need an initial policy.



**Value Iteration Algorithm**

```python

initialize values (to zero, or randomly)

while not converged:

    for each state 

        for each decision (at each state)

             value = max(reward + gamma*value at dest)

compute policy from values

```

### Import Necessary Libraries

Import functions required for creating the grid world environment and computing the policy from state values.

In [1]:
from rlgridworld.standard_grid import create_standard_grid
from rlgridworld.algorithms import compute_policy_from_values

### Define Value Iteration Function

Implement the Value Iteration algorithm as described in Sutton and Barto (page 83). This function iteratively updates the value of each state based on the maximum expected return achievable from that state, until the values converge (the maximum change in value across all states is below a small threshold `epsilon`). Unlike Policy Iteration, it doesn't explicitly store or iterate on a policy during the value calculation phase.

In [4]:
# from page 83 of Sutton and Barto, RL 2nd. Ed.
def value_iteration(gw, gamma=0.9, epsilon=0.001):
    count = 0
    while True:
        count += 1
        biggest_change_in_value = 0
        for node in gw:
            state = node.state
            if not gw.is_terminal(state) and not gw.is_barrier(state):
                old_value = gw.get_value(state)
                new_value = float('-inf')
                # valid decisions and rewards at current state
                dr = gw.valid_decisions_and_rewards(state)
                for action, reward in dr.items():
                    reward = gw.get_reward_for_action(state, action)
                    value_at_dest = gw.get_value_at_destination(state, action)
                    value = reward + gamma*value_at_dest
                    if value > new_value:
                        new_value = value
                    gw.set_value(state, new_value)
                biggest_change_in_value = max(biggest_change_in_value,
                                                  abs(new_value - old_value))
        if biggest_change_in_value < epsilon:
            break

### Create Grid World and Run Value Iteration

1.  Instantiate the standard grid world environment.
2.  Print the initial state values (usually initialized to zero).
3.  Execute the `value_iteration` function to compute the optimal state values. This function modifies the values stored within the `gw` object.
4.  Print the converged state values after value iteration.
5.  Compute the optimal policy derived from the final state values using `compute_policy_from_values`.
6.  Print the resulting optimal policy.

In [5]:
gw = create_standard_grid()

print("")
print("Initial Values")
gw.print_values()

# compute values using Value Iteration
# This modifies the values within the 'gw' object directly
value_iteration(gw)

print("")
print("Values after Value Iteration")
gw.print_values()

# compute policy from the final values stored in 'gw'
policy = compute_policy_from_values(gw)

print("") 
print("Optimal Policy from Value Iteration")
gw.print_policy(policy)
print("") # Add a final newline for cleaner output


Initial Values
-------------------------------------
|   0.00 |   0.00 |   0.00 |   0.00 |
-------------------------------------
|   0.00 |   0.00 |   0.00 |   0.00 |
-------------------------------------
|   0.00 |   0.00 |   0.00 |   0.00 |
-------------------------------------

Values after Value Iteration
-------------------------------------
|   0.81 |   0.90 |   1.00 |   0.00 |
-------------------------------------
|   0.73 |   0.00 |   0.90 |   0.00 |
-------------------------------------
|   0.66 |   0.73 |   0.81 |   0.73 |
-------------------------------------

Optimal Policy from Value Iteration
-------------------------------------
|  Right |  Right |  Right |        |
-------------------------------------
|     Up |        |     Up |        |
-------------------------------------
|  Right |  Right |     Up |   Left |
-------------------------------------

