## Iterative Policy Evaluation / Policy Iteration
### Sean O'Malley

Reinforcement learning is a subset of machine learning that solves sequential decision making problems, learning from a critic and reward structure. A notion core to reinforcement learning is how an agent behaves and learns from its environment. A key tool for an agent to learn from its environment is that of iterative policy evaluation and policy iteration.

__Conceptually__, in the iterative policy evaluation and policy iteration process, the agent has a metric (discrete|continuous) that it values. Given its interactions with an environment, the agent has a policy as to how to best maximize this reward. This policy is valued based on the reward received from the policy at a certain state. This policy can be updated based on the reward of that state, in comparison to the previous state. The moment the previous state policy and current state policy match, an optimal policy is reached. The iterative policy evaluation determines the effectiveness of a policy and policy iteration is the mechanism that updates that policy.

__Formally__, policy iteration manipulates the policy directly, rather than finding it indirectly via the optimal value function. In policy iteration, the policy acts greedily towards the best expected result. Eventually the policy would reach a point where continuing to iterate would no longer change anything (convergence).

How good is a policy? We know this from the value function.

Value iteration only cares about the value function and not on forming a complete policy for every iteration. It is the same policy iteration, but only performing policy evaluation once and then changing the policy right away.

While this requires less steps to find the optimal policy, intermediate steps cannot be used as a suboptimal policy as the values do not correspond to a concrete policy, this is where policy iteration is key.

The policy iteration process follows the following algorithmic cadence:

* choose arbitrary policy 
    * loop:
        * compute the value of the policy
        * solve linear equations
        * improve the policy at each state
    * until new policy equals old policy
    
At the end of the day, the value function of a policy is just the expected infinite & discounted reward that will be gained, at each state, by executing that policy. 

Once we know the value of each state under the current policy, we consider whether the value could be improved by changing the first action taken. 

If so, the policy is altered to take the new action whenever it is in that situation. This step is guaranteed to improve performance of the policy when no improvement is possible, the policy is guaranteed to be optimal

### Iterative Policy Evaluation

The code below shows that policy iteration process via a series of functions to solve a gridworld problem.

* given a policy, let's find it's value function V(s)
* 2 sources of randomness
    * p(a|s) - deciding what action to take given the state
    * p(s',r|s,a) - the next state and reward given your action-state pair
    * we are only modeling p(a|s) = uniform
* __gamma__ - is used to determine the discount factor. A discount factor is useful for teaching an agent to value present values over future values. The example below shows how the discount factor helps the agent not only choose the path that maximizes the reward, but also the path that does so as fast as possible.

How would the code change if p(s',r|s,a) is not deterministic?

* A non-deterministic policy assumes that taking the same action in the same state on two different occasions may result in different next states and/or different reinforcement values.
* If the below problem was non-deterministic, we we would redefine the policy by taking expected values.

In [5]:
# https://deeplearningcourses.com/c/artificial-intelligence-reinforcement-learning-in-python
# https://www.udemy.com/artificial-intelligence-reinforcement-learning-in-python
from __future__ import print_function, division
from builtins import range
# Note: you may need to update your version of future
# sudo pip install -U future


import numpy as np

from grid_world import standard_grid

SMALL_ENOUGH = 1e-3 # threshold for convergence

def print_values(V, g):
  for i in range(g.width):
    print("---------------------------")
    for j in range(g.height):
      v = V.get((i,j), 0)
      if v >= 0:
        print(" %.2f|" % v, end="")
      else:
        print("%.2f|" % v, end="") # -ve sign takes up an extra space
    print("")


def print_policy(P, g):
  for i in range(g.width):
    print("---------------------------")
    for j in range(g.height):
      a = P.get((i,j), ' ')
      print("  %s  |" % a, end="")
    print("")

if __name__ == '__main__':
  # iterative policy evaluation
  # how would the code change if p(s',r|s,a) is not deterministic?
  grid = standard_grid()

  # states will be positions (i,j)
  # simpler than tic-tac-toe because we only have one "game piece"
  # that can only be at one position at a time
  states = grid.all_states()

  ### uniformly random actions ###
  # initialize V(s) = 0
  V = {}
  for s in states:
    V[s] = 0
  gamma = 1.0 # discount factor
  # repeat until convergence
  count = 0
  while True:
    count += 1
    biggest_change = 0
    for s in states:
      old_v = V[s]

      # V(s) only has value if it's not a terminal state
      if s in grid.actions:

        new_v = 0 # we will accumulate the answer
        p_a = 1.0 / len(grid.actions[s]) # each action has equal probability
        for a in grid.actions[s]:
          grid.set_state(s)
          r = grid.move(a)
          new_v += p_a * (r + gamma * V[grid.current_state()])
        V[s] = new_v
        biggest_change = max(biggest_change, np.abs(old_v - V[s]))

    print(count, " ", biggest_change)
    if biggest_change < SMALL_ENOUGH:
      break
  print("values for uniformly random actions:")
  print_values(V, grid)
  print("\n\n")

  ### fixed policy ###
  policy = {
    (2, 0): 'U',
    (1, 0): 'U',
    (0, 0): 'R',
    (0, 1): 'R',
    (0, 2): 'R',
    (1, 2): 'U',
    (2, 1): 'R',
    (2, 2): 'U',
    (2, 3): 'L',
  }
  print_policy(policy, grid)

  # initialize V(s) = 0
  V = {}
  for s in states:
    V[s] = 0

  # let's see how V(s) changes as we get further away from the reward
  gamma = 0.99 # discount factor

  # repeat until convergence
  count = 0
  while True:
    count += 1
    biggest_change = 0
    for s in states:
      old_v = V[s]

      # V(s) only has value if it's not a terminal state
      if s in policy:
        a = policy[s]
        grid.set_state(s)
        r = grid.move(a)
        V[s] = r + gamma * V[grid.current_state()]
        biggest_change = max(biggest_change, np.abs(old_v - V[s]))

    print(count, " ", biggest_change)
    if biggest_change < SMALL_ENOUGH:
      break
  print("values for fixed policy:")
  print_values(V, grid)


1   0.5
2   0.138888888889
3   0.0841049382716
4   0.0487825788752
5   0.0305968745237
6   0.020936465372
7   0.0154602885385
8   0.0125957267443
9   0.0106768890958
10   0.00911310419528
11   0.00780818790411
12   0.00670428088687
13   0.00576316028922
14   0.00495734073685
15   0.00426570920199
16   0.00367129309651
17   0.00316005081477
18   0.00272016465323
19   0.00234158959515
20   0.00201573935481
21   0.00173525141419
22   0.00149380152182
23   0.00128595197727
24   0.00110702482171
25   0.000952994480954
values for uniformly random actions:
---------------------------
-0.03| 0.09| 0.22| 0.00|
---------------------------
-0.16| 0.00|-0.44| 0.00|
---------------------------
-0.29|-0.41|-0.54|-0.77|



---------------------------
  R  |  R  |  R  |     |
---------------------------
  U  |     |  U  |     |
---------------------------
  U  |  R  |  U  |  L  |
1   1.0
2   0.99
3   0.970299
4   0
values for fixed policy:
---------------------------
 0.98| 0.99| 1.00| 0.00|
---------

In iterative policy evaluation we can see the values of each space in our grid given our final policy.

### Policy Iteration

This deterministic grid gives you a reward of -0.1 for every non-terminal state to encourage finding a shorter path to the goal.

The algorithm follows the the subsequent steps:
* rewards
* state -> action
* randomly choose an action and update as we learn
* initial policy
    * initialize terminal state
    * evaluate 
    * improve
* repeat until convergence

In [4]:
# https://deeplearningcourses.com/c/artificial-intelligence-reinforcement-learning-in-python
# https://www.udemy.com/artificial-intelligence-reinforcement-learning-in-python
from __future__ import print_function, division
from builtins import range
# Note: you may need to update your version of future
# sudo pip install -U future


import numpy as np
from grid_world import standard_grid, negative_grid
from iterative_policy_evaluation import print_values, print_policy

SMALL_ENOUGH = 1e-3
GAMMA = 0.9
ALL_POSSIBLE_ACTIONS = ('U', 'D', 'L', 'R')

# this is deterministic
# all p(s',r|s,a) = 1 or 0

if __name__ == '__main__':
  # this grid gives you a reward of -0.1 for every non-terminal state
  # we want to see if this will encourage finding a shorter path to the goal
  grid = negative_grid()

  # print rewards
  print("rewards:")
  print_values(grid.rewards, grid)

  # state -> action
  # we'll randomly choose an action and update as we learn
  policy = {}
  for s in grid.actions.keys():
    policy[s] = np.random.choice(ALL_POSSIBLE_ACTIONS)

  # initial policy
  print("initial policy:")
  print_policy(policy, grid)

  # initialize V(s)
  V = {}
  states = grid.all_states()
  for s in states:
    # V[s] = 0
    if s in grid.actions:
      V[s] = np.random.random()
    else:
      # terminal state
      V[s] = 0

  # repeat until convergence - will break out when policy does not change
  count = 0
  while True:
     
    count += 1
    # policy evaluation step - we already know how to do this!
    while True:
      
      biggest_change = 0
      for s in states:
        old_v = V[s]

        # V(s) only has value if it's not a terminal state
        if s in policy:
          a = policy[s]
          grid.set_state(s)
          r = grid.move(a)
          V[s] = r + GAMMA * V[grid.current_state()]
          biggest_change = max(biggest_change, np.abs(old_v - V[s]))

      if biggest_change < SMALL_ENOUGH:
        break

    # policy improvement step
    is_policy_converged = True
    for s in states:
      if s in policy:
        old_a = policy[s]
        new_a = None
        best_value = float('-inf')
        # loop through all possible actions to find the best current action
        for a in ALL_POSSIBLE_ACTIONS:
          grid.set_state(s)
          r = grid.move(a)
          v = r + GAMMA * V[grid.current_state()]
          print("s", s, "a", a, "grid.current_state()", grid.current_state(), "r", r, "Value", V[grid.current_state()])
          if v > best_value:
            best_value = v
            new_a = a
        policy[s] = new_a
        if new_a != old_a:
          is_policy_converged = False

    print(count)    
    if is_policy_converged:
      break

  print("policy:")
  print_policy(policy, grid)


rewards:
---------------------------
-0.10|-0.10|-0.10| 1.00|
---------------------------
-0.10| 0.00|-0.10|-1.00|
---------------------------
-0.10|-0.10|-0.10|-0.10|
initial policy:
---------------------------
  U  |  L  |  D  |     |
---------------------------
  R  |     |  R  |     |
---------------------------
  D  |  L  |  R  |  U  |
s (0, 1) a U grid.current_state() (0, 1) r -0.1 Value -0.9918002192500368
s (0, 1) a D grid.current_state() (0, 1) r -0.1 Value -0.9918002192500368
s (0, 1) a L grid.current_state() (0, 0) r -0.1 Value -0.9918002192500368
s (0, 1) a R grid.current_state() (0, 2) r -0.1 Value -1.0
s (1, 2) a U grid.current_state() (0, 2) r -0.1 Value -1.0
s (1, 2) a D grid.current_state() (2, 2) r -0.1 Value -1.0
s (1, 2) a L grid.current_state() (1, 2) r -0.1 Value -1.0
s (1, 2) a R grid.current_state() (1, 3) r -1 Value 0
s (0, 0) a U grid.current_state() (0, 0) r -0.1 Value -0.9918002192500368
s (0, 0) a D grid.current_state() (1, 0) r -0.1 Value -0.99229866452955