# Policy Improvement
Implementation of Policy Improvement

Parameters for Policy Improvement as **input**:
- `env`: This is an instance of an OpenAI Gym environment, where `env.P` returns the one-step dynamics.
- `V`: This is a 1D numpy array with `V.shape[0]` equal to the number of states (`env.nS`).  `V[s]` contains the estimated value of state `s`.
- `gamma`: This is the discount rate.  It must be a value between 0 and 1, inclusive (default value: `1`).

The algorithm returns as **output**:
- `policy`: This is a 2D numpy array with `policy.shape[0]` equal to the number of states (`env.nS`), and `policy.shape[1]` equal to the number of actions (`env.nA`).  `policy[s][a]` returns the probability that the agent takes action `a` while in state `s` under the policy.

In [None]:
# Importing the necessary packages 
import numpy as np
from gym.envs.toy_text import frozen_lake

In [None]:
# Creation the OpenAI Gym 'frozen_lake' environment
frozen_lake_env = frozen_lake.FrozenLakeEnv()

### Let's implement thePolicy Improvement algorithm

<img src='./image/policy_improvement.jpg' width='600'>

In [None]:
class Policy_Improvement(object):
    def __init__(self):
        super(Policy_Improvement, self).__init__()
        pass
    
    def action_value(self, env, V, s, gamma=1):
        q = np.zeros(env.nA)
        for a in range(env.nA):
            for prob, next_state, reward, done in env.P[s][a]:
                q[a] += prob * (reward + gamma * V[next_state])
        return q

    def policy_improvement(self, env, V, gamma=1):
        policy = np.zeros([env.nS, env.nA]) / env.nA
        for s in range(env.nS):
            q = self.action_value(env, V, s, gamma)
            best_a = np.argwhere(q==np.max(q)).flatten()
            policy[s] = np.sum([np.eye(env.nA)[i] for i in best_a], axis=0)/len(best_a)
        return policy