# **Tabular Reinforcement Learning**

# Value Iteration on FrozenLake

## Non-Evaluables Practical Exercices

This is a non-evaluable practical exercise, but it is recommended that students complete it fully and individually, since it is an important part of the learning process.

The solution will be available, although it is not recommended that students consult the solution until they have completed the exercise.

## The FrozenLake environment

In this activity, we are going to implement the **Value Iteration** algorithm on [Frozen Lake](https://gymnasium.farama.org/environments/toy_text/frozen_lake/) environment.

Main characteristics:
- The game starts with the player at location [0,0] of the frozen lake grid world with the goal located at far extent of the world e.g. [3,3] for the 4x4 environment.
- Holes in the ice are distributed in set locations when using a pre-determined map or in random locations when a random map is generated.
- The player makes moves until they reach the goal or fall in a hole.
- The lake is slippery (unless disabled) so the player may move perpendicular to the intended direction sometimes (see _is_slippery_ param).

<img src="https://gymnasium.farama.org/_images/frozen_lake.gif" />

In [1]:
import gymnasium as gym

# params
ENV_NAME = "FrozenLake-v1"
GAMMA = 0.9
TEST_EPISODES = 20

# definig the environment
env = gym.make(ENV_NAME, desc=None, map_name="4x4", is_slippery=True)

## Defining the Agent

In order to implement an Agent and train it using the **Value Iteration** algorithm, we have to follow these steps:
1. Populate a table with the transition probabilities from specific state (`transits`)
2. Populate a table with the immediate rewards (`rewards`)
3. Compute the Value Iteration algorithm and populate a table with the expected reward of an state (`values`)
4. Define the policy and create a function to select an action from each state

The central data structures in this example are as follows: 
- **Reward table**: A dictionary with the composite key "source state" + "action" + "target state". The value is obtained from the immediate reward. 
- **Transitions table**: A dictionary keeping counters of the experienced transitions. The key is the composite "state" + "action" and the value is another dictionary that maps the target state into a count of times that we've seen it. 

For example, if in state 0 we execute action 1 ten times, after three times it leads us to state 4 and after seven times to state 5. Entry with the key (0, 1) in this table will be a `dict {4: 3, 5: 7}`. We use this table to estimate the probabilities of our transitions. 

- **Value table**: A dictionary that maps a state into the calculated value of this state. 

In [None]:
import collections


class Agent:
    def __init__(self, env):
        self.env = env
        self.state = self.env.reset()[0]
        self.rewards = collections.defaultdict(float)
        self.transits = collections.defaultdict(collections.Counter)
        self.values = collections.defaultdict(float)

        
    def play_n_random_steps(self, count) -> None:
        '''
        Play n random steps to update info:
        - self.rewards[(state, action, new_state)] = reward
        - self.transits[(self.state, action)][new_state] += 1
        '''
        # TODO


    def calc_action_value(self, state, action) -> float:
        '''
        Calculate the action-state value function
        '''
        # TODO

        
    def select_action(self, state) -> int:
        '''
        Select the best action for a given state
        '''
        # TODO
        
    
    def value_iteration(self) -> None:
        '''
        Calculate the state value
        - self.values[state] = max(state_values)
        '''
        # TODO

            
    def play_episode(self, env) -> float:
        total_reward = 0.0
        state, _ = env.reset()

        while True:
            action = self.select_action(state)
            new_state, reward, terminated, truncated, _ = env.step(action)
            is_done = terminated or truncated
            self.rewards[(state, action, new_state)] = reward
            self.transits[(state, action)][new_state] += 1
            total_reward += reward
            if is_done:
                break
            state = new_state
        return total_reward

The overall logic of our code is simple: 
1. In the loop, we play 100 random steps from the environment, populating the reward and transition tables. 
2. After those 100 steps, we perform a value iteration loop over all states, updating our value table. 
3. Then we play several full episodes to check our improvements using the updated value table. 
4. If the average reward for those test episodes is above the 0.8 boundary, then we stop training. During test episodes, we also update our reward and transition tables to use all data from the environment.

In [None]:
agent = Agent(env)

iter_no = 0
best_reward = 0.0

while True:
    iter_no += 1
    agent.play_n_random_steps(100)
    agent.value_iteration()

    reward = 0.0
    for _ in range(TEST_EPISODES):
        reward += agent.play_episode(env)
    
    reward /= TEST_EPISODES
    
    if reward > best_reward:
        print("Best reward updated %.3f -> %.3f" % (best_reward, reward))
        best_reward = reward
    
    if reward > 0.80:
        print("Solved in %d iterations!" % iter_no)
        break

<div class="alert alert-block alert-danger">
<strong>Solution</strong>
</div>