## Value Iteration
### Sean O'Malley

As a reminder, reinforcement learning is an agent interacting with an environment, at each step the agent has an action that leads to the alteration of the environment, thus receiving either a penalty or reward based on that action. The primary goal is to maximize total value of rewards.

Markov Decision Processes is a formalized process used to describe the agent's interaction with its environment. An MDP has 5 elements:
1. set of states
2. set of actions
3. state transition model (how env changes due to agent action)
4. reward model (real value reward for action)
5. discount actor (determines importance of future rewards)

Both Value iteration and policy iteration can be to solve Markov Decision Processes as 'offline' update options towards an environment, but in this document we will cover value iteration. 

__Value Function:__ The core principle to all optmization problems, but especially so with value iteration is that of the value function. The value function represents the quality of a state for an agent and is equal to the total reward for an agent starting at that specific state (total reward is the sum of rewards at each state in consideration to the discount factor). The optimal value function has the highest value

__Q-Function:__ In value iteration, the Q-function is a key component to the learning of the algorithm. The Q-function is the state-action pair that returns a real value. The optimal Q-function is the expected total reward an agent receives in a specific state given a specific action. Essentially meaning that the optimal q-function is an indication for the value of an agent's action given a certain state

Therefore, if we know the optimal q-funciton, the optimal policy can be easily extracted by choosing the action that gives the maximum q function for a specific state.

__Bellman Equation:__ is the recursive definition of the q-function. The q-function equals the sum of the immediate reward after an action in a specific state and the discount of future rewards after the transition to a new state. Value iteration relies heavily upon the bellman equation and finds the optimal policy by the convergence of the value and q function.

__Value Iteration__ is the idea that if we know the value of each state, our decision would be to always choose the action that maximizes that value. Value iteration computes the optimal state value function by iteratively improving the value function. 

First by initializing it with random values, value iteration then repeatedcly updates the q-function and the value function until they converge, guaranteeing the convergence to optimal values.

Algorithmically this looks like:

* assign each state a random value
    * calculate new value baaed on its proximal utilities for each state
    * update each state's value based on bellman equation
    * if no change, stop

### Value Iteration Grid World

Below we see that we are solving the gridworld problem by determining the optimal steps using value iteration. We can see that we discount future values using a gamma of 0.9, allow ourselves the possible actions of up, down, right and left. 


In [1]:
from __future__ import print_function, division
from builtins import range
# Note: you may need to update your version of future
# sudo pip install -U future


import numpy as np
from grid_world import standard_grid, negative_grid
from iterative_policy_evaluation import print_values, print_policy

SMALL_ENOUGH = 1e-3
GAMMA = 0.9
ALL_POSSIBLE_ACTIONS = ('U', 'D', 'L', 'R')

# this is deterministic
# all p(s',r|s,a) = 1 or 0

if __name__ == '__main__':
  # this grid gives you a reward of -0.1 for every non-terminal state
  # we want to see if this will encourage finding a shorter path to the goal
  grid = negative_grid()

  # print rewards
  print("rewards:")
  print_values(grid.rewards, grid)

  # state -> action
  # we'll randomly choose an action and update as we learn
  policy = {}
  for s in grid.actions.keys():
    policy[s] = np.random.choice(ALL_POSSIBLE_ACTIONS)

  # initial policy
  print("initial policy:")
  print_policy(policy, grid)

  # initialize V(s)
  V = {}
  states = grid.all_states()
  for s in states:
    # V[s] = 0
    if s in grid.actions:
      V[s] = np.random.random()
    else:
      # terminal state
      V[s] = 0

  # repeat until convergence
  # V[s] = max[a]{ sum[s',r] { p(s',r|s,a)[r + gamma*V[s']] } }
  count = 0
  while True:
    count += 1
    biggest_change = 0
    for s in states:
      old_v = V[s]

      # V(s) only has value if it's not a terminal state
      if s in policy:
        new_v = float('-inf')
        for a in ALL_POSSIBLE_ACTIONS:
          grid.set_state(s)
          r = grid.move(a)
          v = r + GAMMA * V[grid.current_state()]
          if v > new_v:
            new_v = v
        V[s] = new_v
        biggest_change = max(biggest_change, np.abs(old_v - V[s]))
        
    print(count,biggest_change)

    if biggest_change < SMALL_ENOUGH:
      break

  # find a policy that leads to optimal value function
  for s in policy.keys():
    best_a = None
    best_value = float('-inf')
    # loop through all possible actions to find the best current action
    for a in ALL_POSSIBLE_ACTIONS:
      grid.set_state(s)
      r = grid.move(a)
      v = r + GAMMA * V[grid.current_state()]
      if v > best_value:
        best_value = v
        best_a = a
    policy[s] = best_a

  # our goal here is to verify that we get the same answer as with policy iteration
  print("values:")
  print_values(V, grid)
  print("policy:")
  print_policy(policy, grid)

rewards:
---------------------------
-0.10|-0.10|-0.10| 1.00|
---------------------------
-0.10| 0.00|-0.10|-1.00|
---------------------------
-0.10|-0.10|-0.10|-0.10|
initial policy:
---------------------------
  L  |  D  |  L  |     |
---------------------------
  D  |     |  U  |     |
---------------------------
  D  |  U  |  D  |  D  |
1 0.5120438981916062
2 0.3099423529341516
3 0.1581921271581198
4 0
values:
---------------------------
 0.62| 0.80| 1.00| 0.00|
---------------------------
 0.46| 0.00| 0.80| 0.00|
---------------------------
 0.31| 0.46| 0.62| 0.46|
policy:
---------------------------
  R  |  R  |  R  |     |
---------------------------
  U  |     |  U  |     |
---------------------------
  U  |  R  |  U  |  L  |


We can see above that our rewards match up with our grid values and pre-defined gamma value (1.0 - 0.9 = -0.1). The initial random policy performed miserably, but after our value iteartion we can see that the value iteration process arrived at an optimal policy that correctly values each state. It is also fun to notice that by discounting future values at a more intense rate (0.9) we can see tha