# Iterative Policy Evaluation

This notebook demonstrates the Iterative Policy Evaluation algorithm, used to estimate the state-value function V(s) for a given policy in a Reinforcement Learning environment.

Reference: Sutton and Barto, Reinforcement Learning 2nd. Edition, page 75.

![Iterative Policy Evaluation Algorithm](./figures/IterativePolicyEvaluation.png)
*Figure: Pseudocode for Iterative Policy Evaluation from Sutton and Barto.*

## Concept

A **policy** dictates the action to take in every state. Iterative Policy Evaluation aims to compute the value (expected return) of being in each state, assuming the agent follows the given policy.

The core idea is to repeatedly sweep through all states and update their values based on the expected return, which considers the immediate reward and the discounted value of the next state according to the policy. This process continues until the values converge (i.e., the changes in values between iterations become very small).

**Algorithm Outline:**
```
Initialize V(s) arbitrarily (e.g., V(s)=0 for all non-terminal states)
Loop until convergence:
  Δ ← 0
  Loop for each state s:
    v ← V(s)  # Store old value
    # Update rule based on the expected value under the policy π
    V(s) ← Σ [ P(s',r|s,π(s)) * (r + γ * V(s')) ] 
    # Simplified for deterministic policy/environment:
    # V(s) ← reward(s, π(s)) + γ * V(next_state(s, π(s)))
    Δ ← max(Δ, |v - V(s)|)
  If Δ < θ (a small threshold): break loop
```

## 1. Setup: Grid World Environment

Import necessary libraries and create the standard grid world environment.

In [1]:
from rlgridworld.standard_grid import create_standard_grid

gw = create_standard_grid()

# Optional: Modify rewards if needed (example commented out)
#gw.set_reward((0,0), "up", -2) 
#gw.set_reward((0,1), "right", -2)

## 2. Define the Policy (π)

Specify the policy we want to evaluate. This is a dictionary mapping states `(row, col)` to actions (`'up'`, `'down'`, `'left'`, `'right'`). Empty strings (`''`) denote terminal or barrier states where no action is taken.

In [2]:
policy = { 
    (0,0):'up', (0,1):'right',(0,2):'right',(0,3):'up',
    (1,0):'up', (1,1):'', (1,2):'right', (1,3):'',
    (2,0):'right', (2,1):'right', (2,2):'right', (2,3):''
}

## 3. Initial State

Print the policy and the initial values (usually zero) before evaluation.

In [3]:
print("Policy to Evaluate:")
gw.print_policy(policy)

print("\nInitial Values (Before Evaluation):")
gw.print_values()

Policy to Evaluate:
-------------------------------------
|  Right |  Right |  Right |        |
-------------------------------------
|     Up |        |  Right |        |
-------------------------------------
|     Up |  Right |  Right |     Up |
-------------------------------------

Initial Values (Before Evaluation):
-------------------------------------
|   0.00 |   0.00 |   0.00 |   0.00 |
-------------------------------------
|   0.00 |   0.00 |   0.00 |   0.00 |
-------------------------------------
|   0.00 |   0.00 |   0.00 |   0.00 |
-------------------------------------


## 4. Iterative Policy Evaluation Implementation

The function `iterative_policy_evaluation` implements the algorithm described above.

- It repeatedly loops through all states (`while True` loop, breaks when converged).
- For each non-terminal, non-barrier state:
    - It gets the current value `old_value`.
    - It finds the action prescribed by the `policy` for that state.
    - It calculates the expected `new_value` using the Bellman equation for V(s) under the policy: `new_value = reward + gamma * value_at_dest`.
    - It updates the state's value in the grid world `gw.set_value(state, new_value)`.
    - It keeps track of the `biggest_change` in values during the sweep.
- The outer loop terminates (`break`) when the `biggest_change` across all states in a sweep is less than a small threshold `theta`, indicating convergence.

In [4]:
def iterative_policy_evaluation(gw, policy, gamma=0.9, theta=0.001):
    """Performs iterative policy evaluation to find V(s) for a given policy."""
    while True:
        biggest_change = 0
        # Loop through each state in the grid
        for node in gw:
            state = node.state
            # Only evaluate non-terminal and non-barrier states
            if not gw.is_terminal(state) and not gw.is_barrier(state):
                # Get current (old) value
                old_value = gw.get_value(state)
                
                # Get action from the fixed policy
                action = policy[state]
                
                # Get immediate reward for taking that action from this state
                reward = gw.get_reward_for_action(state, action)
                
                # Get the value of the state we'd land in (destination)
                value_at_dest = gw.get_value_at_destination(state, action)
                
                # Compute the updated value using the Bellman expectation equation
                new_value = reward + gamma * value_at_dest
                
                # Update the value function for the state
                gw.set_value(state, new_value)
                
                # Track the maximum change in value across all states
                biggest_change = max(
                    biggest_change, abs(new_value - old_value))
                    
        # Check for convergence after iterating over all states
        if biggest_change < theta:
            break # Values have converged

## 5. Run Evaluation (Gamma = 0.9)

Execute the iterative policy evaluation function with a discount factor `gamma = 0.9` and print the resulting converged values.

In [5]:
print("Policy:")
gw.print_policy(policy)

# Run the evaluation
iterative_policy_evaluation(gw, policy, gamma = 0.9)

print("\nConverged Values for the policy (gamma=0.9):")
gw.print_values()

Policy:
-------------------------------------
|  Right |  Right |  Right |        |
-------------------------------------
|     Up |        |  Right |        |
-------------------------------------
|     Up |  Right |  Right |     Up |
-------------------------------------

Converged Values for the policy (gamma=0.9):
-------------------------------------
|   0.81 |   0.90 |   1.00 |   0.00 |
-------------------------------------
|   0.73 |   0.00 |  -1.00 |   0.00 |
-------------------------------------
|   0.66 |  -0.81 |  -0.90 |  -1.00 |
-------------------------------------


## 6. Run Evaluation (Gamma = 0.8)

Repeat the evaluation, but this time with a different discount factor `gamma = 0.8`. Note that this requires resetting the values first or running on a fresh grid object if you want to compare accurately, as the `iterative_policy_evaluation` modifies the `gw` object in place. For simplicity here, we run it on the same `gw`, overwriting the previous values.

In [6]:
print("Policy:")
gw.print_policy(policy)

iterative_policy_evaluation(gw, policy, gamma = 0.8)

print("\nConverged Values for the policy (gamma=0.8):")
gw.print_values()

Policy:
-------------------------------------
|  Right |  Right |  Right |        |
-------------------------------------
|     Up |        |  Right |        |
-------------------------------------
|     Up |  Right |  Right |     Up |
-------------------------------------

Converged Values for the policy (gamma=0.8):
-------------------------------------
|   0.64 |   0.80 |   1.00 |   0.00 |
-------------------------------------
|   0.51 |   0.00 |  -1.00 |   0.00 |
-------------------------------------
|   0.41 |  -0.64 |  -0.80 |  -1.00 |
-------------------------------------
