# Policy Iteration Algorithm

Policy Iteration is a method used to solve Markov Decision Processes (MDPs). It alternates between two steps: policy evaluation and policy improvement, iteratively improving the policy until it converges to the optimal policy.

### Steps of the Policy Iteration Algorithm:

1. **Initialization**:
    - Start with an arbitrary policy $\pi$.
    - Initialize the value function $V(s)$ for all states $s$.

2. **Policy Evaluation**:
    - Compute the value function $V^\pi(s) $ for the current policy $\pi$ by solving the Bellman equation:  
        $$  
        V^\pi(s) = \sum_{s'} P(s'|s, \pi(s)) \left[ R(s, \pi(s), s') + \gamma V^\pi(s') \right]  
        $$  
        where:  
        - $P(s'|s, a)$: Transition probability.  
        - $R(s, a, s')$: Reward for transitioning from state \( s \) to \( s' \) using action \( a \).  
        - $\gamma$: Discount factor. 
    - Update the policy by acting greedily with respect to the value function:
      $$
      \pi'(s) = \arg\max_a \sum_{s'} P(s'|s, a) \left[ R(s, a, s') + \gamma V^\pi(s') \right]
      $$

4. **Convergence**:
    - Repeat the policy evaluation and policy improvement steps until the policy no longer changes (i.e., it converges to the optimal policy $\pi^*$).

Policy Iteration is guaranteed to converge to the optimal policy in a finite number of iterations for finite MDPs.


In [25]:
import numpy as np
import gym

In [26]:
def policy_evaluation(env, policy, v, gamma=1.0):
    """
    Evaluate the value function for a given policy.

    Parameters:
    - env: The environment object (e.g., OpenAI Gym environment) that provides the transition dynamics.
    - policy: A numpy array representing the current policy, where policy[state] gives the action to take in that state.
    - v: A numpy array representing the current value function, where v[state] is the value of the state.
    - gamma: The discount factor (default is 1.0), which determines the importance of future rewards.

    Returns:
    - v: The updated value function after policy evaluation.
    """
    
    # Create a copy of the current value function to track updates
    v_ = np.copy(v)
    
    # Iterate over all states in the environment
    for state in range(env.observation_space.n):
        # Get the action dictated by the current policy for this state
        policy_action = policy[state]
        
        # Update the value of the current state using the Bellman equation
        # Sum over all possible transitions (probability, next state, reward, done)
        v[state] = sum([
            prob * (reward + gamma * v_[state_]) 
            for prob, state_, reward, _ in env.env.P[state][policy_action]
        ])
        
    # Return the final value function
    return v



def policy_improvement(env, v, gamma=1.0):
    # Initialize a new policy with zeros for all states
    policy = np.zeros(env.observation_space.n)
    
    # Iterate over all states in the environment
    for state in range(env.observation_space.n):
        # Initialize an array to store the action-value function (Q-values) for all actions
        q = np.zeros(env.action_space.n)
        
        # Iterate over all possible actions for the current state
        for action in range(env.action_space.n):
            # Compute the Q-value for the current action by summing over all possible transitions
            # (probability, next state, reward, done)
            q[action] = sum([
                prob * (reward + gamma * v[state_]) 
                for prob, state_, reward, _ in env.env.P[state][action]
            ])
        
        # Select the action that maximizes the Q-value and update the policy for the current state
        policy[state] = np.argmax(q)
    
    # Return the improved policy
    return policy

In [45]:
def policy_iteration(env, gamma=1.0):
    """ Policy-Iteration algorithm """

    # Initialize the value function for all states to 0
    v = np.zeros(env.observation_space.n)

    # Initialize policy randomly. 
    policy = np.random.choice(env.action_space.n, size=(env.observation_space.n))
    max_iterations = 200000
    gamma = 1.0
    for i in range(max_iterations):
        new_v = policy_evaluation(env, policy, v, gamma)
        new_policy = policy_improvement(env, new_v, gamma)
        #print(new_policy.reshape(8,8))
        if (np.all(policy == new_policy)):
            print('Policy-Iteration converged at step %d.' % (i + 1))


            print(new_v.reshape(8,8))
            break
        policy = np.copy(new_policy)
        v_ = np.copy(new_v)
    return policy






# Testing the Policy Iteration Algorithm on FrozenLake Environment

The FrozenLake environment is a classic reinforcement learning problem where the agent must navigate a grid world to reach the goal while avoiding holes. The environment is stochastic, meaning the agent's actions may not always lead to the intended outcome.

### Environment Details:
- **Environment Name**: `FrozenLake8x8-v1`
- **Grid Size**: 8x8
- **Objective**: Navigate from the starting point to the goal while avoiding holes.
- **Actions**: Discrete actions represented by numbers:
    - `0`: Left
    - `1`: Down
    - `2`: Right
    - `3`: Up
- **Rewards**: A reward of 1 is given for reaching the goal; otherwise, the reward is 0.

### Testing Workflow:
1. **Policy Iteration**:
    - Use the Policy Iteration algorithm to compute the optimal policy for the FrozenLake environment.
    - The algorithm alternates between policy evaluation and policy improvement until convergence.

2. **Evaluation**:
    - Evaluate the optimal policy by running multiple episodes and calculating the average score.

3. **Visualization**:
    - Test the optimal policy by running a single episode and rendering the environment to visualize the agent's behavior.

### Results:
- **Optimal Policy**: The computed policy is stored in the variable `optimal_policy`.
- **Average Score**: The average score over multiple episodes is stored in the variable `score`.

In [46]:
def run_episode(env, policy, gamma = 1.0, render = False):
    """
    Runs a single episode using the given policy.

    Parameters:
    - env: The environment object (e.g., OpenAI Gym environment).
    - policy: A numpy array representing the policy, where policy[state] gives the action to take in that state.
    - gamma: The discount factor (default is 1.0), which determines the importance of future rewards.
    - render: A boolean flag to render the environment during the episode (default is False).

    Returns:
    - total_reward: The total discounted reward obtained during the episode.
    """
    observation = env.reset()  # Reset the environment to the initial state.
    total_reward = 0  # Initialize the total reward.
    step_idx = 0  # Initialize the step counter.

    observation = observation[0]  # Extract the initial state from the reset output.

    while True:
        if render:
            env.render()  # Render the environment if the render flag is True.

        action = int(policy[observation])  # Select the action based on the policy.
        try:
            # Take the action and observe the next state, reward, and done flag.
            observation, reward, done, truncated, _ = env.step(action)
            # Accumulate the discounted reward.
            total_reward += (gamma ** step_idx * reward)
            step_idx += 1  # Increment the step counter.
        except Exception as e:
            # Handle any exceptions that occur during the episode.
            print(f"An error occurred during the episode: {e}")
            break
        if done:  # Exit the loop if the episode is done.
            break

    return total_reward  # Return the total discounted reward.



def evaluate_policy(env, policy, gamma = 1.0, n = 100):
    """
    Evaluates the given policy by running multiple episodes and averaging the total rewards.

    Parameters:
    - env: The environment object (e.g., OpenAI Gym environment).
    - policy: A numpy array representing the policy, where policy[state] gives the action to take in that state.
    - gamma: The discount factor (default is 1.0), which determines the importance of future rewards.
    - n: The number of episodes to run for evaluation (default is 100).

    Returns:
    - The average total discounted reward over n episodes.
    """
    scores = [run_episode(env, policy, gamma, False) for _ in range(n)]  # Run n episodes and collect scores.
    return np.mean(scores)  # Return the average score.

In [47]:
def test_episode(env, policy):
    """
    Runs a single episode to test the policy and renders the environment.

    Parameters:
    - env: The environment object (e.g., OpenAI Gym environment).
    - policy: A numpy array representing the policy, where policy[state] gives the action to take in that state.
    """
    observation = env.reset()  # Reset the environment to the initial state.
    observation = observation[0]  # Extract the initial state from the reset output.
    while True:
        env.render()  # Render the environment.
        # Take the action based on the policy and observe the next state and done flag.
        observation, _, done, truncated, _ = env.step(int(policy[observation]))
        if done:  # Exit the loop if the episode is done.
            break

In [48]:
env_name  = 'FrozenLake8x8-v1' # 'FrozenLake-v0'
env = gym.make(env_name)
optimal_policy = policy_iteration(env, gamma = 1.0)
score = evaluate_policy(env, optimal_policy, gamma = 1.0)
print('Average scores = ', score)

Policy-Iteration converged at step 14.
[[2.23710419e-05 1.50115964e-04 8.24174273e-04 3.11605616e-03
  9.23296806e-03 1.89907148e-02 3.18189392e-02 3.91687673e-02]
 [4.74600609e-05 2.48381288e-04 9.66345381e-04 3.84761014e-03
  1.36333729e-02 2.89550277e-02 5.46041590e-02 7.24123447e-02]
 [9.97288504e-05 2.66779902e-04 6.99147329e-04 0.00000000e+00
  1.93712316e-02 3.90834647e-02 9.70311537e-02 1.32345202e-01]
 [1.54924692e-04 5.73493159e-04 1.85826001e-03 6.41400770e-03
  2.17022105e-02 0.00000000e+00 1.49852320e-01 2.19497137e-01]
 [9.36656708e-05 2.29146373e-04 5.17042866e-04 0.00000000e+00
  5.34253933e-02 1.04591939e-01 1.84205041e-01 3.44963766e-01]
 [1.67260127e-05 0.00000000e+00 0.00000000e+00 1.76505430e-02
  5.37816574e-02 1.08324139e-01 0.00000000e+00 5.25233804e-01]
 [1.75623133e-05 0.00000000e+00 1.73636919e-03 5.51519360e-03
  0.00000000e+00 1.88091539e-01 0.00000000e+00 7.50029741e-01]
 [7.48489066e-05 2.27264697e-04 6.67158830e-04 0.00000000e+00
  2.29612611e-01 4.71167

In [43]:
# env = gym.make(env_name, render_mode='human')
# test_episode(env, optimal_policy)

In [49]:
print(optimal_policy.reshape(8, 8))

[[1. 2. 2. 2. 2. 2. 2. 2.]
 [1. 2. 2. 3. 2. 2. 2. 1.]
 [1. 2. 0. 0. 2. 3. 2. 1.]
 [3. 2. 3. 1. 0. 0. 2. 1.]
 [3. 3. 0. 0. 2. 1. 3. 2.]
 [0. 0. 0. 1. 3. 0. 0. 2.]
 [0. 0. 1. 0. 0. 0. 0. 2.]
 [1. 1. 0. 0. 1. 1. 1. 0.]]
