# FROZENLAKE - 2

In [1]:
import numpy as np
from frozen_lake import SlipperyFrozenLake, FrozenLakeState, a_few_tests
from pprint import pprint

In [2]:
frozen_lake_map = [
    ['S', 'F', 'F', 'F'], 
    ['F', 'H', 'F', 'H'],
    ['F', 'F', 'F', 'H'],
    ['H', 'F', 'F', 'G']]

lake_environment = SlipperyFrozenLake(frozen_lake_map)

# Dynamic Programming
> To use dynamic programming algorithms we have to assume that the agent has full knowledge of the MDP. 

- Random Policy Creation
- Policy Evaluation
- Convert action value to state value
- Policy Improvement


##  Policy 
- A `policy` is a strategy way to behave in an environment.
- policy is the probability of doing an `action` given you are in a particular `state`

```
*
Formally,  

policy(action | state) = probability( action | state) 

*

```

### Types of policies
- It can be a stochastic or deterministic policy. 
- An example of a deterministic policy is for example, in the frozen lake, whenever you are at the starting state (state id `1`) you should always move `left`

### Stochastic policy example 
- An example of a stochastic policy is for example in the frozen lake, whenever you are at the starting state (state `1`, `70%` of the time you move `down` and `30%` of the time you move `left`. 
- In addition to this example you can also add that whenever you are at state `9`, `10%` of the time you move `left`, `30%` of the time you move `down` and `60%` of the time you move `right`. And all the other cases `100%` of the time you move `up`
- Of course, a policy may not be the best policy, and you are looking for the best policy given a certain environment. 
- We can represent this policy like a dictionary within a dictionary in python

```
policy = {

    1: {'left':0.7 ,'right': 0.0 ,'up': 0.0, 'down': 0.3 }
    2: {'left':0.0 ,'right': 0.0 ,'up': 1.0, 'down': 0.0 }
    3: {'left':0.0 ,'right': 0.0 ,'up': 1.0, 'down': 0.0 }
    ...
    ...
    9: {'left':0.1 ,'right': 0.6 ,'up': 0.0, 'down': 0.3 }
    10: {'left':0.0 ,'right': 0.0 ,'up': 1.0, 'down': 0.0 }
    11: {'left':0.0 ,'right': 0.0 ,'up': 1.0, 'down': 0.0 }
    ...
    ..
    14: {'left':0.0 ,'right': 0.0 ,'up': 1.0, 'down': 0.0 }
    15: {'left':0.0 ,'right': 0.0 ,'up': 1.0, 'down': 0.0 }
}
```

### Random policy

- First, let's make a simple policy such that regardless of the state we are in, we randomly choose a random action. 

In [3]:
def create_equiprobable_policy(env):

    # probability of taking an action given the state 
    p = 1.0 / len(env.actions)
    
    policy_for_any_state = {}
    policy = {}
    for action in env.actions:
        policy_for_any_state[action] = p
    
    for state_id in range(env.number_of_states):
        policy[state_id] = policy_for_any_state
    
    return policy

In [4]:
random_policy = create_equiprobable_policy(env=lake_environment)
pprint(random_policy)

{0: {'down': 0.25, 'left': 0.25, 'right': 0.25, 'up': 0.25},
 1: {'down': 0.25, 'left': 0.25, 'right': 0.25, 'up': 0.25},
 2: {'down': 0.25, 'left': 0.25, 'right': 0.25, 'up': 0.25},
 3: {'down': 0.25, 'left': 0.25, 'right': 0.25, 'up': 0.25},
 4: {'down': 0.25, 'left': 0.25, 'right': 0.25, 'up': 0.25},
 5: {'down': 0.25, 'left': 0.25, 'right': 0.25, 'up': 0.25},
 6: {'down': 0.25, 'left': 0.25, 'right': 0.25, 'up': 0.25},
 7: {'down': 0.25, 'left': 0.25, 'right': 0.25, 'up': 0.25},
 8: {'down': 0.25, 'left': 0.25, 'right': 0.25, 'up': 0.25},
 9: {'down': 0.25, 'left': 0.25, 'right': 0.25, 'up': 0.25},
 10: {'down': 0.25, 'left': 0.25, 'right': 0.25, 'up': 0.25},
 11: {'down': 0.25, 'left': 0.25, 'right': 0.25, 'up': 0.25},
 12: {'down': 0.25, 'left': 0.25, 'right': 0.25, 'up': 0.25},
 13: {'down': 0.25, 'left': 0.25, 'right': 0.25, 'up': 0.25},
 14: {'down': 0.25, 'left': 0.25, 'right': 0.25, 'up': 0.25},
 15: {'down': 0.25, 'left': 0.25, 'right': 0.25, 'up': 0.25}}


# Value of a state given a policy (`state_value` or `v`)

- A `value` of a state is a number that represents the goodness of a state given that we follow a policy
- It is an expected accumulation of rewards in the long term following a policy
- We will talk more about this later


# Value of a State-Action Pair 
####  `expected_return` or `state_action_value` or `q(state, action)` or ` Q[state_id][action]`
- Given that we are at a `state` and we choose an `action` at a particular time, we have expectation of the value of this, which we can call the `expected_return`
- Similar to `state_value` this is a measure of the goodness of doing a particular `action` given that you are at a particular `state` 
- An `expected return` (also called `state_action_value` or `q(state, action)` or `q[state][action]` is the weighted sum of the `state_values` of the particular next state we might end up with
- As an example, remember what we got from the environment earlier
```
You are in state id: 14, and you do action: right
--
Possibility # 1
--
state id of possible state we end up in:  15
reward:  1.0
probability of transitioning to this state
coming from state number 14 and going right:  0.3333333333333333
--
Possibility # 2
--
state id of possible state we end up in:  10
reward:  0.0
probability of transitioning to this state
coming from state number 14 and going right:  0.3333333333333333
--
Possibility # 3
--
state id of possible state we end up in:  14
reward:  0.0
probability of transitioning to this state
coming from state number 14 and going right:  0.3333333333333333
```

Then given this,  example our expected return is:

```
# `expected_return` or `state_action_value` aka `q(state, action)`

expected_return =
  (probability of transitioning to 15 from 14 when we go right)*(value of 15) +
  (probability of transitioning to 10 from 14 when we go right)*(value of 10) +
  (probability of transitioning to 10 from 14 when we go right)*(value of 14) 
  
expected_return[14] = 
  0.333 * V[15] + 0.333 * V[10] + 0.33 * V[14]

```

#### Value of the in terms given the value of next state, and reward of taking that action given our current state
- We won't go into the details of proving this, but you can check out this link and video! 
- Link: [Josh Greaves: Understanding RL: The bellman equations](https://joshgreaves.com/reinforcement-learning/understanding-rl-the-bellman-equations). 
- Video: [David Silver: Markov Decision Process](https://www.youtube.com/watch?v=lfHX2hHRMVQ)

```
*

(value of current_state) = 
  (reward of getting to a given next state, given current state and action) +
  (discount factor gamma) * (value of given next state)
  
*
```

In [5]:
# --------------------------------------------------
# Get the state_action value Q from the state_value (following a policy) 
# and given possible next states from environment
# `expected_return` or `state_action_value` 
# aka `q(state, action)` aka q[state]['action']
# --------------------------------------------------
def get_state_action_value(state_id, action, env, V, gamma):
    expected_return = 0.0    

    possibilities = env.get_possibilities(state_id, action)

    for state_info in possibilities:
        reward = state_info.reward
        next_state_id = state_info.n #state ID
        # probability of transitioning to this next state
        # from our current state and action
        p = state_info.probability 
        # The value of being at our current state 
        # Given the value of this next state amd reward gotten
        v = (reward + gamma * V[next_state_id])
        expected_return += (p * v)
    return expected_return

## Value of a state given a policy (`state_value`)

- A `value` of a state is a number that represents the goodness of a state given that we follow a policy
- It is an expected accumulation of rewards in the long term following a policy
- You can think of a reward as a number that represents the intensity immediate pleasure or pain
- You can think of the `value` as a far-sighted judgement of a state 
- For example, you have money and you were promised that if you don't spend it, you will accumulate more money which means you can buy new things in the future, but if you spend it now on a delicious sushi, you get to be happy today. 
- So spending the money to eat sushi means you will getting not have money, will give you a positive reward, since you get to a state of being happy with a sushi but no money. 
- However, if you don't spend it now, you will continue to accumulate more money, which means you can buy more types in the future, which will make you more happy in the future time. 
- So the state of having money but no sushi has more `value` than the state of being broke but with sushi. 
- Sometimes we discount the rewards we expect in the future, because the immediate reward now is more valuable that the rewards we might get in the future, but we wont talk about it now. We use `gamma` to discount the return of the next states. 

# `state_value` given a policy

- Going back to the value of a state given a policy 
- Recall that the policy is the probability of taking an action given a state
- We can say that the value of a state `V[s]` is the expectation given the probability of taking each action and the value of doing the action given the state. 
- Going back to the example earlier, if our policy is that whenever you are at state 9, 10% of the time you move left, 30% of the time you move down and 60% of the time you move right.

- Then our state_value `v` at state with state id `9` is 

```
state_value[9] = 0.1 * state_action_value(9, 'left) + 
                 0.3 * state_action_value(9, 'down') + 
                 0.6 * state_action_value(9, 'right') + 
                 0.0 * state_action_value(9, 'up')
```


# Policy Evaluation 
- To evaluate the values of the state given a certain policy, we can use this algorithm as I implemented in the `evaluate_policy()` function below
- You will get the values for each state to evaluate the policy 
- If a particular policy gets the largest value for each state then this is the best policy out of all policies 

```
start with each state value having a value of zero 

while delta is not almost zero, do the following:

    delta <- 0 
    
    for each state 
        1.store old state value temporarily (old v)
        2.get updated state value (new v), given policy and environment
        3.update delta to which ever is higher, old delta or absolute difference between old v and new v

Now you have the state value for each state

```

# Updating a state value
- Recall that 

```
*

(value of current_state) = 
  (reward of getting to a given next state, given current state and action) +
  (discount factor gamma) * (value of given next state)

* 
```
- So for example, given that we can get the transition probabilities from our environement we can get the state action value

```
# `expected_return` or `state_action_value` aka `q(state, action)`

expected_return =
  (probability of transitioning to 15 from 14 when we go right)*(value of 15) +
  (probability of transitioning to 10 from 14 when we go right)*(value of 10) +
  (probability of transitioning to 10 from 14 when we go right)*(value of 14) 
```

- Also remember our example that, if our policy is that whenever you are at state 9, 10% of the time you move left, 30% of the time you move down and 60% of the time you move right. Then our state_value `v` at state with state id `9` is 

```
state_value[9] = 0.1 * state_action_value(9, 'left) + 
                 0.3 * state_action_value(9, 'down') + 
                 0.6 * state_action_value(9, 'right') + 
                 0.0 * state_action_value(9, 'up')
```

- Given all these equations, we can update the value of a state, given the current policy, the environment, and the current values of all other states 

In [6]:
# --------------------------------------------------
# UPDATE THE VALUE OF THE A STATE GIVEN THE POLICY 
# --------------------------------------------------

def update_state_value(V, state_id, policy, env, gamma):
    new_v = 0.0
    for action, action_probability in policy[state_id].items():
        state_action_value = get_state_action_value(state_id, action, env, V, gamma)
        new_v += (action_probability * state_action_value)
    return new_v
        
# --------------------------------------------------
# EVALUATE POLICY
# --------------------------------------------------

def evaluate_policy(env, policy, gamma=1, theta=1e-8):
    
    print("Evaluating policy...")
    V = np.zeros(env.number_of_states)
    i = 0
    
    while True:
        delta = 0.0
        i+=1
        for state_id in range(env.number_of_states):
            old_v = V[state_id]
            new_v = update_state_value(V, state_id, policy, env, gamma)
            value_difference = np.abs(old_v - new_v)
            V[state_id] = new_v
            delta = max(delta, value_difference)
            
        if delta < theta: break
            
    print("Total iterations: ", i)
    print("...Policy evaluation done.")
    
    return V

In [7]:
random_policy = create_equiprobable_policy(env=lake_environment)
V = evaluate_policy(lake_environment, random_policy)
#State ID: Value Mapping 
#V[state_id]

Evaluating policy...
Total iterations:  57
...Policy evaluation done.


In [8]:
print("State Value List V[state_id] given the random policy")
pprint(list(V))

State Value List V[state_id] given the random policy
[0.013939768663454583,
 0.011630911318747722,
 0.020952973819001047,
 0.010476484293706931,
 0.016248651887812875,
 0.0,
 0.04075153320335127,
 0.0,
 0.03480619300811653,
 0.08816992993355391,
 0.14205315963086687,
 0.0,
 0.0,
 0.17582036826295283,
 0.43929117582009547,
 0.0]


In [9]:
def pretty_print_state_values(V, env):
    print("STATE VALUES GRID ")
    print("+---------+---------+---------+---------+")

    for r in range(lake_environment.rows):
        for c in range(lake_environment.columns):
            n = lake_environment.location_to_n(r, c)
            print('| {:6.5f} '.format(V[n]), end="")
        print(end="|\n")
        print("+---------+---------+---------+---------+")

In [10]:
pretty_print_state_values(V, lake_environment)

STATE VALUES GRID 
+---------+---------+---------+---------+
| 0.01394 | 0.01163 | 0.02095 | 0.01048 |
+---------+---------+---------+---------+
| 0.01625 | 0.00000 | 0.04075 | 0.00000 |
+---------+---------+---------+---------+
| 0.03481 | 0.08817 | 0.14205 | 0.00000 |
+---------+---------+---------+---------+
| 0.00000 | 0.17582 | 0.43929 | 0.00000 |
+---------+---------+---------+---------+


# `state_action_values` from `state_values` given a `policy`
- Given the `state_values` from evaluating the `policy`, we can compute the `state_action_values` for each state-action pair
- Similar to the `state_values`, the best `policy` among all policies is the one with the largest values for each possible state, action pair. 

In [11]:
def create_state_action_value_dictionary(env, state_values, gamma=1):
    Q = {}
    for state_id in range(env.number_of_states):
        q = {}        
        for action in env.actions: 
            state_action_value = get_state_action_value(
                state_id, action, env, state_values, gamma)
            q[action] = state_action_value
        
        Q[state_id] = q
    return Q

In [12]:
Q = create_state_action_value_dictionary(
        env=lake_environment, 
        state_values=V, 
        gamma=1)

In [13]:
print("STATE ACTION VALUE DICTIONARY FROM STATE VALUES")
# Q[state_id][action]
pprint(Q)

STATE ACTION VALUE DICTIONARY FROM STATE VALUES
{0: {'down': 0.013939777290005059,
     'left': 0.014709396404907345,
     'right': 0.013939777290005059,
     'up': 0.013170149548552295},
 1: {'down': 0.011630914160818542,
     'left': 0.008523559994067434,
     'right': 0.010861295045916257,
     'up': 0.015507884600401117},
 2: {'down': 0.02095297627193531,
     'left': 0.024445139447033346,
     'right': 0.02406033043868642,
     'up': 0.0143534564771519},
 3: {'down': 0.010476486037569326,
     'left': 0.010476486037569326,
     'right': 0.006984322862471287,
     'up': 0.01396864746880497},
 4: {'down': 0.017018281631976467,
     'left': 0.021664871186461328,
     'right': 0.0162486538905237,
     'up': 0.010062806850422485},
 5: {'down': 0.0, 'left': 0.0, 'right': 0.0, 'up': 0.0},
 6: {'down': 0.04735105321028896,
     'left': 0.05433537781662264,
     'right': 0.05433537781662264,
     'up': 0.006984324606333682},
 7: {'down': 0.0, 'left': 0.0, 'right': 0.0, 'up': 0.0},
 8: {'do

In [14]:
def pretty_print_state_action_values(Q, lake_environment):
    print("actions:", lake_environment.actions)
    print()

    print("STATE ACTION VALUE TABLE (Q)")
    print("+-------+---------+---------+---------+---------+")
    print("| STATE | LEFT    | DOWN    | RIGHT   | UP      |")
    print("+-------+---------+---------+---------+---------+")

    for state_id in range(lake_environment.number_of_states):
        print("| {:2d}    ".format(state_id), end="")
        for action in lake_environment.actions:
            print('| {:6.5f} '.format(Q[state_id][action]), end="")
        print(end="|\n")
        print("+-------+---------+---------+---------+---------+")

In [15]:
pretty_print_state_action_values(Q, lake_environment)

actions: ['left', 'down', 'right', 'up']

STATE ACTION VALUE TABLE (Q)
+-------+---------+---------+---------+---------+
| STATE | LEFT    | DOWN    | RIGHT   | UP      |
+-------+---------+---------+---------+---------+
|  0    | 0.01471 | 0.01394 | 0.01394 | 0.01317 |
+-------+---------+---------+---------+---------+
|  1    | 0.00852 | 0.01163 | 0.01086 | 0.01551 |
+-------+---------+---------+---------+---------+
|  2    | 0.02445 | 0.02095 | 0.02406 | 0.01435 |
+-------+---------+---------+---------+---------+
|  3    | 0.01048 | 0.01048 | 0.00698 | 0.01397 |
+-------+---------+---------+---------+---------+
|  4    | 0.02166 | 0.01702 | 0.01625 | 0.01006 |
+-------+---------+---------+---------+---------+
|  5    | 0.00000 | 0.00000 | 0.00000 | 0.00000 |
+-------+---------+---------+---------+---------+
|  6    | 0.05434 | 0.04735 | 0.05434 | 0.00698 |
+-------+---------+---------+---------+---------+
|  7    | 0.00000 | 0.00000 | 0.00000 | 0.00000 |
+-------+---------+---------+

### POLICY IMPROVEMENT
- Given state-action values for each state-action pair, we can improve our initial policy by always selecting the best possible action given a state. The best possibile action for each state is the one with the largest state-action value. 
- More formally `policy(state, action) = 1.0` if `max(state_action_value[state]` looking at all possible actions given a particular state
- An intuitive example could be like `policy(state = hungry, action = eat) = 1.0`
- If there is a a few actions with a equal values we can split the probability among them equaly. 
- For example:

```
policy(state=hungry, action=eat_oatmeal) = 0.3333, 
policy(state=hungry, action=eat_tuna) = 0.3333, 
policy(state=hungry, action=eat_chicken) = 0.0
```

In [34]:
def improve_policy_debug(policy_pi, V, Q, improved_policy, lake_environment):
    print()
    print("--------------")
    print("Policy")
    print("--------------")
    print()
    print()
    pretty_print_state_action_values(policy_pi, lake_environment)
    print()

    print()
    print("--------------")
    print("State Values under this policy")
    print("--------------")
    print()
    pretty_print_state_values(V, lake_environment)
    print()

    print()
    print("--------------")
    print("State Action Values under this policy")
    print("--------------")
    print()
    
    print()
    pretty_print_state_action_values(Q, lake_environment)
    print()

    print()
    print("--------------")
    print("improved policy")
    print("--------------")
    print()
    
    print()
    pretty_print_state_action_values(improved_policy, lake_environment)
    print()

In [36]:
def get_best_actions(action_values):
    max_val = float('-inf')
    best_actions = []
    for action, value in action_values.items(): 
        if value > max_val:
            best_actions = [action]
            max_val = value
        elif value == max_val:
            best_actions.append(action)
    return best_actions


def improve_policy(policy_pi, env, debug=False):

    # an array of state values given policy
    #to access state value of a state given its state id: V[state_id]
    V = evaluate_policy(env=lake_environment, policy=policy_pi)
    
    # A dictionary of state_action values, Q[state_id][action]
    # example Q[1]['left']
    Q = create_state_action_value_dictionary(
        env=lake_environment, 
        state_values=V, 
        gamma=1)
    
    improved_policy = {}
    
    for state_id, action_values in Q.items():
        best_actions = get_best_actions(action_values)
        
        # Create policy for this state
        action_probabilities = {} 
        action_probability = 1.0 / len(best_actions)
        
        for action in action_values.keys():
            if action in best_actions: 
                action_probabilities[action] = action_probability
            else: 
                action_probabilities[action] = 0.0
        
        improved_policy[state_id] = action_probabilities 
    
    if debug is True:
        improve_policy_debug(policy_pi, V, Q, improved_policy, lake_environment)
        
    return improved_policy

In [37]:
random_policy = create_equiprobable_policy(env=lake_environment)
policy = improve_policy(random_policy, lake_environment, debug=True)

Evaluating policy...
Total iterations:  57
...Policy evaluation done.

--------------
Policy
--------------


actions: ['left', 'down', 'right', 'up']

STATE ACTION VALUE TABLE (Q)
+-------+---------+---------+---------+---------+
| STATE | LEFT    | DOWN    | RIGHT   | UP      |
+-------+---------+---------+---------+---------+
|  0    | 0.25000 | 0.25000 | 0.25000 | 0.25000 |
+-------+---------+---------+---------+---------+
|  1    | 0.25000 | 0.25000 | 0.25000 | 0.25000 |
+-------+---------+---------+---------+---------+
|  2    | 0.25000 | 0.25000 | 0.25000 | 0.25000 |
+-------+---------+---------+---------+---------+
|  3    | 0.25000 | 0.25000 | 0.25000 | 0.25000 |
+-------+---------+---------+---------+---------+
|  4    | 0.25000 | 0.25000 | 0.25000 | 0.25000 |
+-------+---------+---------+---------+---------+
|  5    | 0.25000 | 0.25000 | 0.25000 | 0.25000 |
+-------+---------+---------+---------+---------+
|  6    | 0.25000 | 0.25000 | 0.25000 | 0.25000 |
+-------+---------+