> We'll be implementing a classical dynamic programming algorithms to figure out the best action to take in a toy problem environment called slippery frozen lake 👩 

# Slippery Frozen Lake Environment

[Modified version of the description from Open AI Gym](https://gym.openai.com/envs/FrozenLake-v0/)

> The agent controls the movement of a character in a grid world. Some tiles of the grid are walkable, and others lead to the agent falling into holes with exploding 💥 bombs 💥 and die (I know right!) . Additionally, the movement direction of the agent is uncertain and only partially depends on the chosen direction. The agent is rewarded for finding a walkable path to a goal tile.

> You want to get to the target 🎯. The lake is mostly frozen, but there are a few holes where the ice has melted. If you step into one of those holes, you'll fall into the freezing water (where there is a BOMB 💥and you will EXPLODE 💥and DIE 💥). You navigate across the lake and go to the target 🎯. **However, the ice is slippery, so you won't always move in the direction you intend.**

> The episode ends when you reach the goal or fall in the hole with bomb and die!. You receive a reward of **+1** if you reach the goal 🎯, and **zero** otherwise.

## STATES

```
S - 👩🏼 (S: starting point, safe, REWARD = 0)
F - ▫️ (F: frozen surface, safe, REWARD = 0)
H - 💥 (H: hole with bomb, fall to your doom, terminal state, REWARD = 0)
G - 🎯 (G: goal, dartboard target, safe, terminal state, REWARD = +1) 


+-----------------+-----------+-------+-------------------+--------+-------+
| State Condition | Character | Safe? | will episode end? | Reward | Icon  |
+-----------------+-----------+-------+-------------------+--------+-------+
| Starting point  | 'S'       | Yes   | No                | 0      | 👩    |
| You start here  |           |       |                   |        |       |
+-----------------+-----------+-------+-------------------+--------+-------+
| Frozen surface  | 'F'       | Yes   | No                | 0      | ▫️    |
+-----------------+-----------+-------+-------------------+--------+-------+
| Hole with bomb  | 'H'       | No    | Yes               | 0      | 💥    |
| Fall and die    |           |       |                   |        |       |
+-----------------+-----------+-------+-------------------+--------+-------+
| Goal / Target   | 'G'       | Yes   | Yes               | +1     | 🎯    |
| You should end  |           |       |                   |        |       |
| up here.        |           |       |                   |        |       |
+-----------------+-----------+-------+-------------------+--------+-------+
```

## 4x4 Grid World
```
👩▫️▫️▫️ | SFFF 
▫️💥▫️💥 | FHFH 
▫️▫️▫️💥 | FFFH
💥▫️▫️🎯 | HFFG
```

In [21]:
import numpy as np
from frozen_lake import SlipperyFrozenLake, FrozenLakeState, a_few_tests
from pprint import pprint

In [22]:
a_few_tests()

PASSED! :) 


In [23]:
frozen_lake_map = [
    ['S', 'F', 'F', 'F'], 
    ['F', 'H', 'F', 'H'],
    ['F', 'F', 'F', 'H'],
    ['H', 'F', 'F', 'G']]

lake_environment = SlipperyFrozenLake(frozen_lake_map)

# Explore Frozen Lake Environment

## chars ( S, F, H, G) - state condition

```
  0   1   2   3
  +---+---+---+---+
0 | S | F | F | F |
  +---+---+---+---+
1 | F | H | F | H |
  +---+---+---+---+
2 | F | F | F | H |
  +---+---+---+---+
3 | H | F | F | G |
  +---+---+---+---+
```

## icons ( 👩 ▫️  💥🎯)

```

  0   1   2   3
  +---+---+---+---+
0 |👩 |▫️ |▫️ |▫️|
  +---+---+---+---+
1 |▫️ |💥 |▫️ |💥|
  +---+---+---+---+
2 |▫️ |▫️ |▫️ |💥|
  +---+---+---+---+
3 |💥 |▫️ |▫️ |🎯|
  +---+---+---+---+

```

## terminal ( y / n )

```
  0   1   2   3
  +---+---+---+---+
0 | n | n | n | n |
  +---+---+---+---+
1 | n | y | n | y |
  +---+---+---+---+
2 | n | n | n | y |
  +---+---+---+---+
3 | y | n | n | y |
  +---+---+---+---+
```

## state ID 
```
  0   1   2   3
  +---+---+---+---+
0 | 0 | 1 | 2 | 3 |
  +---+---+---+---+
1 | 4 | 5 | 6 | 7 |
  +---+---+---+---+
2 | 8 | 9 |10 |11 |
  +---+---+---+---+
3 |12 |13 |14 |15 |
  +---+---+---+---+
```

## rewards

```
  0   1   2   3
  +---+---+---+---+
0 | 0 | 0 | 0 | 0 |
  +---+---+---+---+
1 | 0 | 0 | 0 | 0 |
  +---+---+---+---+
2 | 0 | 0 | 0 | 0 |
  +---+---+---+---+
3 | 0 | 0 | 0 |+1 |
  +---+---+---+---+
```

In [24]:
print()
print("-->A 4x4 Grid World")
pprint(lake_environment.map)

print()
print("-->Number of states:", lake_environment.number_of_states)
print("-->And potential actions to take:", lake_environment.actions)

print()
print("-->With respective states numbered as follows:")
print()
for r in lake_environment.n_map:
    for c in r: 
        print('{:4d}'.format(c), end="")
    print()

print()
print("-->Total number of states:", lake_environment.number_of_states)
print()

print("--------------------")
print("Possible Conditions for each state")
print("--------------------")

for c in ['S', 'H', 'F', 'G']:
    print()
    print('state condition (char):', c)
    print("-->Reward:", lake_environment.reward[c])
    print("-->Is terminal?:", lake_environment.is_terminal[c])
    print("-->Icon:", lake_environment.icons[c])
    print()



-->A 4x4 Grid World
[['S', 'F', 'F', 'F'],
 ['F', 'H', 'F', 'H'],
 ['F', 'F', 'F', 'H'],
 ['H', 'F', 'F', 'G']]

-->Number of states: 16
-->And potential actions to take: ['left', 'down', 'right', 'up']

-->With respective states numbered as follows:

   0   1   2   3
   4   5   6   7
   8   9  10  11
  12  13  14  15

-->Total number of states: 16

--------------------
Possible Conditions for each state
--------------------

state condition (char): S
-->Reward: 0.0
-->Is terminal?: False
-->Icon: 👩


state condition (char): H
-->Reward: 0.0
-->Is terminal?: True
-->Icon: 💥


state condition (char): F
-->Reward: 0.0
-->Is terminal?: False
-->Icon: ▫️


state condition (char): G
-->Reward: 1.0
-->Is terminal?: True
-->Icon: 🎯



# Transistion probability and one-step dynamics

Recall that the idea of the frozen lake environment is that the surface is slippery, therefore the agent can slide to a location other than the one it wanted.

Dynamic programming assumes that the agent has full knowledge of the Markov Decision Process (MDP). We have the full knowledge of each one step dynamic of each state. 

For example you can run the following: 

```
possibilities = lake_environment.get_possibilities(state_id, action)
```

You get a `list` or `array` of possible next states given you take a particular `action` (`left`, `right`, `up`, `down`) while you are in a particlar state identitfied by `state_id` which is an `int`

This is a `list` of  `FrozenLakeState` objects which each contains:
- `state_id` (an `int`) - The unique identification number the possible next state
- `probability` (a `float`)- The  probability of transitioning to this particular next state given you took the particular action coming from a state identified by the `state_id`
- `reward`() - The corresponding reward of landing to this next state from your current state. 
- `is_terminal` ( a boolean: `True` or `False`)- If this state is a terminal state 
- Among other useful formation

> DEFINITION: the **Transition Probability** of a state `s` (with corresponding current `state_id`) at time `t`, action `a` at timestep `t` and possible state `s'` (with corresponding possible `state_id`) is the probability that the next state at timestep `t+1` is the possible_state `s'` given that at state `s` you do action `a`. 


## Formally,  

```
*

transition(s, a, s') = probability[state(t+1) = s'| state(t) = s, action(t) = a]

*
```

In [25]:
_ = lake_environment.get_possibilities(
    state_id=6, action='down', debug=True)

***
From state ID:  6  do action:  down !
***

# 1
--> next state ID:  10
--> reward: 0.0
--> probability:  0.3333333333333333
--> is terminal:  False


# 2
--> next state ID:  5
--> reward: 0.0
--> probability:  0.3333333333333333
--> is terminal:  True


# 3
--> next state ID:  7
--> reward: 0.0
--> probability:  0.3333333333333333
--> is terminal:  True



In [26]:
possibilities = lake_environment.get_possibilities(
    state_id=0, action='up', debug=False)

for i, state_info in enumerate(possibilities):
    
    print("--")
    print("#", i + 1)
    print("--")

    print(state_info)

--
# 1
--

*FrozenLakeState (TYPE) 
--> State ID: 0
--> Reward: 0.0
--> Not a terminal state. 
--> (Transition) Probability (given state ID and action): 0.3333333333333333
--> Icon: 👩
--> Character representation of state condition: 'S'
--> Location: (0,0) 


--
# 2
--

*FrozenLakeState (TYPE) 
--> State ID: 0
--> Reward: 0.0
--> Not a terminal state. 
--> (Transition) Probability (given state ID and action): 0.3333333333333333
--> Icon: 👩
--> Character representation of state condition: 'S'
--> Location: (0,0) 


--
# 3
--

*FrozenLakeState (TYPE) 
--> State ID: 1
--> Reward: 0.0
--> Not a terminal state. 
--> (Transition) Probability (given state ID and action): 0.3333333333333333
--> Icon: ▫️
--> Character representation of state condition: 'F'
--> Location: (0,1) 




In [27]:
possibilities = lake_environment.get_possibilities(
    state_id=14, action='right', debug=False)

print("You are in state id: 14, and you do action: right")
for i, state_info in enumerate(possibilities):
    
    print("--")
    print("#", i + 1)
    print("--")

    print("state id of possible state we end up in: ", state_info.n)
    print("reward: ", state_info.reward)
    print("this is a terminal state?", state_info.is_terminal)
    print("probability of transitioning to this state", end="")
    print("coming from state number 14 and going right: ", state_info.probability)

You are in state id: 14, and you do action: right
--
# 1
--
state id of possible state we end up in:  15
reward:  1.0
this is a terminal state? True
probability of transitioning to this statecoming from state number 14 and going right:  0.3333333333333333
--
# 2
--
state id of possible state we end up in:  10
reward:  0.0
this is a terminal state? False
probability of transitioning to this statecoming from state number 14 and going right:  0.3333333333333333
--
# 3
--
state id of possible state we end up in:  14
reward:  0.0
this is a terminal state? False
probability of transitioning to this statecoming from state number 14 and going right:  0.3333333333333333


In [28]:
_ = lake_environment.get_possibilities(
    state_id=7, action='left', debug=True)

***
From state ID:  7  do action:  left !
***

# 1
--> next state ID:  7
--> reward: 0.0
--> probability:  1.0
--> is terminal:  True



# Dynamic Programming
> To use dynamic programming algorithms we have to assume that the agent has full knowledge of the MDP. 

- Random Policy Creation
- Policy Evaluation
- Convert action value to state value
- Policy Improvement
- Policy Iteration 
- Truncated policy iteration
- Value Iteration


##  Policy 
- A `policy` is a strategy way to behave in an environment.
- policy is the probability of doing an `action` given you are in a particular `state`

```
*
Formally,  

policy(action | state) = probability( action | state) 

*

```

### Types of policies
- It can be a stochastic or deterministic policy. 
- An example of a deterministic policy is for example, in the frozen lake, whenever you are at the starting state (state id `1`) you should always move `left`

### Stochastic policy example 
- An example of a stochastic policy is for example in the frozen lake, whenever you are at the starting state (state `1`, `70%` of the time you move `down` and `30%` of the time you move `left`. 
- In addition to this example you can also add that whenever you are at state `9`, `10%` of the time you move `left`, `30%` of the time you move `down` and `60%` of the time you move `right`. And all the other cases `100%` of the time you move `up`
- Of course, a policy may not be the best policy, and you are looking for the best policy given a certain environment. 
- We can represent this policy like a dictionary within a dictionary in python

```
policy = {

    1: {'left':0.7 ,'right': 0.0 ,'up': 0.0, 'down': 0.3 }
    2: {'left':0.0 ,'right': 0.0 ,'up': 1.0, 'down': 0.0 }
    3: {'left':0.0 ,'right': 0.0 ,'up': 1.0, 'down': 0.0 }
    ...
    ...
    9: {'left':0.1 ,'right': 0.6 ,'up': 0.0, 'down': 0.3 }
    10: {'left':0.0 ,'right': 0.0 ,'up': 1.0, 'down': 0.0 }
    11: {'left':0.0 ,'right': 0.0 ,'up': 1.0, 'down': 0.0 }
    ...
    ..
    14: {'left':0.0 ,'right': 0.0 ,'up': 1.0, 'down': 0.0 }
    15: {'left':0.0 ,'right': 0.0 ,'up': 1.0, 'down': 0.0 }
}
```

### Random policy

- First, let's make a simple policy such that regardless of the state we are in, we randomly choose a random action. 

In [29]:
def create_equiprobable_policy(env):

    # probability of taking an action given the state 
    p = 1.0 / len(env.actions)
    
    policy_for_any_state = {}
    policy = {}
    for action in env.actions:
        policy_for_any_state[action] = p
    
    for state_id in range(env.number_of_states):
        policy[state_id] = policy_for_any_state
    
    return policy

In [30]:
random_policy = create_equiprobable_policy(env=lake_environment)
pprint(random_policy)

{0: {'down': 0.25, 'left': 0.25, 'right': 0.25, 'up': 0.25},
 1: {'down': 0.25, 'left': 0.25, 'right': 0.25, 'up': 0.25},
 2: {'down': 0.25, 'left': 0.25, 'right': 0.25, 'up': 0.25},
 3: {'down': 0.25, 'left': 0.25, 'right': 0.25, 'up': 0.25},
 4: {'down': 0.25, 'left': 0.25, 'right': 0.25, 'up': 0.25},
 5: {'down': 0.25, 'left': 0.25, 'right': 0.25, 'up': 0.25},
 6: {'down': 0.25, 'left': 0.25, 'right': 0.25, 'up': 0.25},
 7: {'down': 0.25, 'left': 0.25, 'right': 0.25, 'up': 0.25},
 8: {'down': 0.25, 'left': 0.25, 'right': 0.25, 'up': 0.25},
 9: {'down': 0.25, 'left': 0.25, 'right': 0.25, 'up': 0.25},
 10: {'down': 0.25, 'left': 0.25, 'right': 0.25, 'up': 0.25},
 11: {'down': 0.25, 'left': 0.25, 'right': 0.25, 'up': 0.25},
 12: {'down': 0.25, 'left': 0.25, 'right': 0.25, 'up': 0.25},
 13: {'down': 0.25, 'left': 0.25, 'right': 0.25, 'up': 0.25},
 14: {'down': 0.25, 'left': 0.25, 'right': 0.25, 'up': 0.25},
 15: {'down': 0.25, 'left': 0.25, 'right': 0.25, 'up': 0.25}}


## Value of a state given a policy (`state_value`)

- A `value` of a state is a number that represents the goodness of a state given that we follow a policy
- It is an expected accumulation of rewards in the long term following a policy
- You can think of a reward as a number that represents the intensity immediate pleasure or pain
- You can think of the `value` as a far-sighted judgement of a state 
- For example, you have money and you were promised that if you don't spend it, you will accumulate more money which means you can buy new things in the future, but if you spend it now on a delicious sushi, you get to be happy today. 
- So spending the money to eat sushi means you will getting not have money, will give you a positive reward, since you get to a state of being happy with a sushi but no money. 
- However, if you don't spend it now, you will continue to accumulate more money, which means you can buy more types in the future, which will make you more happy in the future time. 
- So the state of having money but no sushi has more `value` than the state of being broke but with sushi. 
- Sometimes we discount the rewards we expect in the future, because the immediate reward now is more valuable that the rewards we might get in the future, but we wont talk about it now. We use `gamma` to discount the return of the next states. 

###  `expected_return` or `state_action_value`
### or `q(state, action)` or Q[state_id][action]
- Given that we are at a `state` and we choose an `action` at a particular time, we have expectation of the value of this, which we can call the `expected_return`
- Similar to `state_value` this is a measure of the goodness of doing a particular `action` given that you are at a particular `state` 
- An `expected return` (also called `state_action_value` or `q(state, action)` or `q[state][action]` is the weighted sum of the `state_values` of the particular next state we might end up with
- As an example, remember what we got from the environment earlier
```
You are in state id: 14, and you do action: right
--
Possibility # 1
--
state id of possible state we end up in:  15
reward:  1.0
probability of transitioning to this state
coming from state number 14 and going right:  0.3333333333333333
--
Possibility # 2
--
state id of possible state we end up in:  10
reward:  0.0
probability of transitioning to this state
coming from state number 14 and going right:  0.3333333333333333
--
Possibility # 3
--
state id of possible state we end up in:  14
reward:  0.0
probability of transitioning to this state
coming from state number 14 and going right:  0.3333333333333333
```

Then given this,  example our expected return is:

```
## `expected_return` or `state_action_value` aka `q(state, action)`

expected_return =
  (probability of transitioning to 15 from 14 when we go right)*(value of 15) +
  (probability of transitioning to 10 from 14 when we go right)*(value of 10) +
  (probability of transitioning to 10 from 14 when we go right)*(value of 14) 
  
expected_return[14] = 
  0.333 * V[15] + 0.333 * V[10] + 0.33 * V[14]

```

#### Value of the in terms given the value of next state, and reward of taking that action given our current state
- We won't go into the details of proving this, but you can check out this link and video! 
- Link: [Josh Greaves: Understanding RL: The bellman equations](https://joshgreaves.com/reinforcement-learning/understanding-rl-the-bellman-equations). 
- Video: [David Silver: Markov Decision Process](https://www.youtube.com/watch?v=lfHX2hHRMVQ)

```
*

(value of current_state) = 
  (reward of getting to a given next state, given current state and action) +
  (discount factor gamma) * (value of given next state)
  
*
```

In [31]:
# `expected_return` or `state_action_value` 
# aka `q(state, action)` aka q[state]['action']
def get_state_action_value(V, state_id, action, env, gamma):
    expected_return = 0.0
    possibilities = env.get_possibilities(state_id, action)
    for state_info in possibilities:
        # Information on the possible next state 
        # when we do particular action
        r = state_info.reward
        n = state_info.n #state ID
        # probability of transitioning to this next state
        # from our current state and action
        p = state_info.probability 
        # The value of being at our current state 
        # Given the value of this next state amd reward gotten
        v = (r + gamma * V[n])
        expected_return += (p * v)
    return expected_return

# `state_value` given a policy

- Going back to the value of a state given a policy 
- Recall that the policy is the probability of taking an action given a state
- We can say that the value of a state `V[s]` is the expectation given the probability of taking each action and the value of doing the action given the state. 
- Going back to the example earlier, if our policy is that whenever you are at state 9, 10% of the time you move left, 30% of the time you move down and 60% of the time you move right.

- Then our state_value `v` at state with state id `9` is 

```
state_value[9] = 0.1 * state_action_value(9, 'left) + 
                 0.3 * state_action_value(9, 'down') + 
                 0.6 * state_action_value(9, 'right') + 
                 0.0 * state_action_value(9, 'up')
```


# Policy Evaluation 
- To evaluate the values of the state given a certain policy, we can use this algorithm as I implemented in the `evaluate_policy()` function below
- You will get the values for each state to evaluate the policy 
- If a particular policy gets the largest value for each state then this is the best policy out of all policies 

In [32]:
def update_state_value(V, state_id, policy, env, gamma):
    new_v = 0.0
    for action, action_probability in policy[state_id].items():
        state_action_value = get_state_action_value(V, state_id, action, env, gamma)
        new_v += (action_probability * state_action_value)
    return new_v
        

def evaluate_policy(env, policy, gamma=1, theta=1e-8):
    
    print("Evaluating policy...")
    V = np.zeros(env.number_of_states)
    i = 0
    while True:
        delta = 0.0
        i+=1
        for state_id in range(env.number_of_states):
            old_v = V[state_id]
            new_v = update_state_value(V, state_id, policy, env, gamma)
            value_difference = np.abs(old_v - new_v)
            V[state_id] = new_v
            delta = max(delta, value_difference)
            
        if delta < theta: break
            
    print("Total iterations: ", i)
    print("...Policy evaluation done.")
    
    return V

In [33]:
random_policy = create_equiprobable_policy(env=lake_environment)
V = evaluate_policy(lake_environment, random_policy)
# State ID: Value Mapping 
# V[state_id]

Evaluating policy...
Total iterations:  57
...Policy evaluation done.


In [34]:
print("State Value List V[state_id] given the random policy")
pprint(list(V))

State Value List V[state_id]
[0.013939768663454583,
 0.011630911318747722,
 0.020952973819001047,
 0.010476484293706931,
 0.016248651887812875,
 0.0,
 0.04075153320335127,
 0.0,
 0.03480619300811653,
 0.08816992993355391,
 0.14205315963086687,
 0.0,
 0.0,
 0.17582036826295283,
 0.43929117582009547,
 0.0]


In [35]:
print("STATE VALUES GRID ")
print("+---------+---------+---------+---------+")

for r in range(lake_environment.rows):
    for c in range(lake_environment.columns):
        n = lake_environment.location_to_n(r, c)
        print('| {:6.5f} '.format(V[n]), end="")
    print(end="|\n")
    print("+---------+---------+---------+---------+")

STATE VALUES GRID 
+---------+---------+---------+---------+
| 0.01394 | 0.01163 | 0.02095 | 0.01048 |
+---------+---------+---------+---------+
| 0.01625 | 0.00000 | 0.04075 | 0.00000 |
+---------+---------+---------+---------+
| 0.03481 | 0.08817 | 0.14205 | 0.00000 |
+---------+---------+---------+---------+
| 0.00000 | 0.17582 | 0.43929 | 0.00000 |
+---------+---------+---------+---------+


# `state_action_values` from `state_values` given a `policy`
- Given the `state_values` from evaluating the `policy`, we can compute the `state_action_values` for each state-action pair
- Similar to the `state_values`, the best `polic`y among all policies is the one with the largest values for each possible state, action pair. 

In [16]:
def get_state_action_value(state_id, state_values, action, env, gamma):
    possibilities = env.get_possibilities(state_id, action)
    state_action_value = 0.0
    for state_info in possibilities:
        next_state = state_info.n
        reward = state_info.reward
        p = state_info.probability
        v = state_values[next_state]
        state_action_value += p * (reward + gamma * v )
    return state_action_value


def create_state_action_value_dictionary(env, state_values, gamma=1):
    
    Q = {}
    for state_id in range(env.number_of_states):
        q = {}
        
        for action in env.actions: 
            state_action_value = get_state_action_value(
                state_id, state_values, action, env, gamma)
            q[action] = state_action_value
        
        Q[state_id] = q
    return Q

In [17]:
Q = create_state_action_value_dictionary(
        env=lake_environment, 
        state_values=V, 
        gamma=1)

In [18]:
print("STATE ACTION VALUE DICTIONARY FROM STATE VALUES")
# Q[state_id][action]
pprint(Q)

STATE ACTION VALUE DICTIONARY FROM STATE VALUES
{0: {'down': 0.013939777290005059,
     'left': 0.014709396404907345,
     'right': 0.013939777290005059,
     'up': 0.013170149548552295},
 1: {'down': 0.011630914160818542,
     'left': 0.008523559994067434,
     'right': 0.010861295045916257,
     'up': 0.015507884600401117},
 2: {'down': 0.02095297627193531,
     'left': 0.024445139447033346,
     'right': 0.02406033043868642,
     'up': 0.0143534564771519},
 3: {'down': 0.010476486037569326,
     'left': 0.010476486037569326,
     'right': 0.006984322862471287,
     'up': 0.01396864746880497},
 4: {'down': 0.017018281631976467,
     'left': 0.021664871186461328,
     'right': 0.0162486538905237,
     'up': 0.010062806850422485},
 5: {'down': 0.0, 'left': 0.0, 'right': 0.0, 'up': 0.0},
 6: {'down': 0.04735105321028896,
     'left': 0.05433537781662264,
     'right': 0.05433537781662264,
     'up': 0.006984324606333682},
 7: {'down': 0.0, 'left': 0.0, 'right': 0.0, 'up': 0.0},
 8: {'do

In [20]:

print("actions:", lake_environment.actions)

print("STATE ACTION VALUE TABLE (Q)")
print("+-------+---------+---------+---------+---------+")
print("| STATE | LEFT    | DOWN    | RIGHT   | UP      |")
print("+-------+---------+---------+---------+---------+")

for state_id in range(lake_environment.number_of_states):
    print("| {:2d}    ".format(state_id), end="")
    for action in lake_environment.actions:
        print('| {:6.5f} '.format(Q[state_id][action]), end="")
    print(end="|\n")
    print("+-------+---------+---------+---------+---------+")


actions: ['left', 'down', 'right', 'up']
STATE ACTION VALUE TABLE (Q)
+-------+---------+---------+---------+---------+
| STATE | LEFT    | DOWN    | RIGHT   | UP      |
+-------+---------+---------+---------+---------+
|  0    | 0.01471 | 0.01394 | 0.01394 | 0.01317 |
+-------+---------+---------+---------+---------+
|  1    | 0.00852 | 0.01163 | 0.01086 | 0.01551 |
+-------+---------+---------+---------+---------+
|  2    | 0.02445 | 0.02095 | 0.02406 | 0.01435 |
+-------+---------+---------+---------+---------+
|  3    | 0.01048 | 0.01048 | 0.00698 | 0.01397 |
+-------+---------+---------+---------+---------+
|  4    | 0.02166 | 0.01702 | 0.01625 | 0.01006 |
+-------+---------+---------+---------+---------+
|  5    | 0.00000 | 0.00000 | 0.00000 | 0.00000 |
+-------+---------+---------+---------+---------+
|  6    | 0.05434 | 0.04735 | 0.05434 | 0.00698 |
+-------+---------+---------+---------+---------+
|  7    | 0.00000 | 0.00000 | 0.00000 | 0.00000 |
+-------+---------+---------+-

# Policy Improvement and Iteration
- To evaluate the values of the state given a certain policy, we can use this algorithm as we implemented in the `improve_policy()` function below