# Policy Iteration in Reinforcement Learning

Policy iteration is an algorithm used to compute the optimal policy $\pi^*$ for a given Markov Decision Process (MDP). The process starts with an initial policy and iteratively improves it until it converges to the optimal policy.

## Policy Evaluation

Policy evaluation is the process of determining the state-value function $V^{\pi}$ for a given policy $\pi$. This involves solving the Bellman expectation equation for $V^{\pi}$:

---


>$$V^{\pi}(s) = \mathbb{E}_{\pi} [R_{t+1} + \gamma V^{\pi}(S_{t+1}) | S_t = s]$$


---

This iterative process continues until the value function converges.

## Policy Improvement

Policy improvement is the process of using the value function $V^{\pi}$ to find a better policy $\pi'$ . The policy improvement step uses the following equation:

---


>$$\pi'(s) = \arg\max_{a} \sum_{s'} P(s'|s,a) \left[ R(s,a,s') + \gamma V^{\pi}(s') \right]$$


---

This step is repeated iteratively to improve the policy.


## Convergence of Policy Iteration

Policy iteration converges to the optimal policy $\pi^*$ when the policy no longer changes between iterations. The process involves two main steps:

1. **Policy Evaluation**: Compute the value function $V^{\pi}$ for the current policy $\pi$.
2. **Policy Improvement**: Use $V^{\pi}$ to find a better policy $\pi'$.

These steps are repeated until the policy converges to the optimal policy $\pi^*$.


## Policy Iteration Algorithm

The Policy Iteration algorithm for estimating $\pi^*$ is as follows:

1. **Initialization**: Start with an arbitrary policy $\pi$.
2. **Policy Evaluation**: Compute the value function $V^{\pi}$ for the current policy $\pi$.
3. **Policy Improvement**: Update the policy $\pi$ using the value function $V^{\pi}$.
4. **Convergence Check**: If the policy $\pi$ has not changed, stop. Otherwise, go back to step 2.

### Pseudocode


1. Initialize $\pi$ arbitrarily
2. Repeat:
    - Policy Evaluation: Compute $V^\pi$
    - Policy Improvement: Update $\pi$ using $V^\pi$
    - If $\pi$ is unchanged: Converged

## Example Implementation of Policy Iteration in Python

Below is an example implementation of the Policy Iteration algorithm in Python.


In [1]:
import numpy as np

# Define the states, actions, transition probabilities, and rewards
states = [0, 1, 2]
actions = ['a', 'b']
transition_probs = {
    0: {'a': {0: 0.5, 1: 0.5}, 'b': {1: 1.0}},
    1: {'a': {0: 1.0}, 'b': {2: 1.0}},
    2: {'a': {2: 1.0}, 'b': {0: 1.0}}
}
rewards = {
    0: {'a': {0: 0, 1: 1}, 'b': {1: 2}},
    1: {'a': {0: 0}, 'b': {2: 3}},
    2: {'a': {2: 0}, 'b': {0: 4}}
}
gamma = 0.9
theta = 0.01

def policy_evaluation(policy, states, actions, transition_probs, rewards, gamma, theta):
    V = {s: 0 for s in states}
    while True:
        delta = 0
        for s in states:
            v = V[s]
            new_v = sum(policy[s][a] * sum(transition_probs[s][a][s_prime] * (rewards[s][a][s_prime] + gamma * V[s_prime])
                        for s_prime in transition_probs[s][a])
                        for a in actions)
            V[s] = new_v
            delta = max(delta, abs(v - new_v))
        if delta < theta:
            break
    return V

def policy_improvement(states, actions, transition_probs, rewards, V, gamma):
    policy = {}
    for s in states:
        action_values = {}
        for a in actions:
            action_values[a] = sum(transition_probs[s][a][s_prime] * (rewards[s][a][s_prime] + gamma * V[s_prime])
                                   for s_prime in transition_probs[s][a])
        best_action = max(action_values, key=action_values.get)
        policy[s] = {a: 1 if a == best_action else 0 for a in actions}
    return policy

def policy_iteration(states, actions, transition_probs, rewards, gamma, theta):
    policy = {s: {a: 1 / len(actions) for a in actions} for s in states}
    while True:
        V = policy_evaluation(policy, states, actions, transition_probs, rewards, gamma, theta)
        new_policy = policy_improvement(states, actions, transition_probs, rewards, V, gamma)
        if new_policy == policy:
            break
        policy = new_policy
    return policy, V

# Run policy iteration
optimal_policy, optimal_value_function = policy_iteration(states, actions, transition_probs, rewards, gamma, theta)

print("Optimal Policy:")
for state in optimal_policy:
    print(f"State {state}: {optimal_policy[state]}")

print("\nOptimal State-Value Function:")
for state in optimal_value_function:
    print(f"V({state}) = {optimal_value_function[state]}")


Optimal Policy:
State 0: {'a': 0, 'b': 1}
State 1: {'a': 0, 'b': 1}
State 2: {'a': 0, 'b': 1}

Optimal State-Value Function:
V(0) = 29.254688524862296
V(1) = 30.292367643612547
V(2) = 30.329219672376066


## Explanation of the Policy Iteration Implementation

The implementation of the policy iteration algorithm involves the following steps:

1. **Initialization**:
   - We start with an arbitrary policy where each action is equally likely for each state.

2. **Policy Evaluation**:
   - We iteratively compute the value function $V^{\pi}$ for the current policy $\pi$ until it converges.
   - The value function update is done by summing over the expected returns of each action and state transition.

3. **Policy Improvement**:
   - Using the current value function $V^{\pi}$, we determine the best action for each state that maximizes the expected return.
   - The policy is updated to take this best action.

4. **Convergence Check**:
   - If the policy does not change after the improvement step, the algorithm has converged to the optimal policy $\pi^*$.


The optimal policy indicates the best action to take in each state, and the optimal state-value function provides the maximum expected return starting from each state under the optimal policy.