# **Tabular Reinforcement Learning**

# Monte Carlo methods on FrozenLake environment

## Non-Evaluables Practical Exercices

This is a non-evaluable practical exercise, but it is recommended that students complete it fully and individually, since it is an important part of the learning process.

The solution will be available, although it is not recommended that students consult the solution until they have completed the exercise.

## The FrozenLake environment

In this activity, we are going to implement the **Value Iteration** algorithm on [Frozen Lake](https://gymnasium.farama.org/environments/toy_text/frozen_lake/) environment.

Main characteristics:
- The game starts with the player at location [0,0] of the frozen lake grid world with the goal located at far extent of the world e.g. [3,3] for the 4x4 environment.
- Holes in the ice are distributed in set locations when using a pre-determined map or in random locations when a random map is generated.
- The player makes moves until they reach the goal or fall in a hole.
- The lake is slippery (unless disabled) so the player may move perpendicular to the intended direction sometimes (see _is_slippery_ param).

<img src="https://gymnasium.farama.org/_images/frozen_lake.gif" />

## Monte Carlo methods

In this section, we implement an estimation of the **optimal policy** using **Monte Carlo methods**, specifically we will study the **_On-policy every-visit MC control algorithm (for $\epsilon$-soft policies)_**.

<u>Question 1</u>: : **Implement the algorithm** using the following parameters:

- Number of episodes = 100,000
- Discount factor = 1

<u>Question 2</u>: : **Implement the epsilon decay factor** using the following equation and parameters:

- Initial epsilon = 1
- Epsilon decay factor (*epsilon decay*) = 0.999
- Update epsilon according to: $\textrm{max}(\epsilon Â· \epsilon_{\textrm{decay}}, 0.01)$

<u>Question 3</u>: Once you have coded the algorithm, try different **values for the hyperparameters** and comment the best ones (providing an empirical comparison):

- Number of episodes
- Initial epsilon
- Epsilon decay factor (*epsilon decay*)
- *discount factor* 

<u>Question 4</u>: Try to solve the same environment but using a _8 x 8_ grid (also in slippery mode):

> gym.make(ENV_NAME, desc=None, map_name="8x8", is_slippery=True)

In [3]:
import gymnasium as gym

# params
ENV_NAME = "FrozenLake-v1"
GAMMA = 0.9
TEST_EPISODES = 20

# definig the environment
env = gym.make(ENV_NAME, desc=None, map_name="4x4", is_slippery=False)

print("Action space is {} ".format(env.action_space))
print("Observation space is {} ".format(env.observation_space))
print("Reward range is {} ".format(env.unwrapped.reward_range))

Action space is Discrete(4) 
Observation space is Discrete(16) 
Reward range is (0, 1) 


In [4]:
def make_epsilon_greedy_policy(Q, epsilon, num_Actions):
    """
    Creates an epsilon-greedy policy based on a Q and epsilon action value function
    
    Args:
         Q: A dictionary whose correspondence is state -> action-values.
            Each value is a numpy array of length num_Actions (see below)
         epsilon: The probability of selecting a random action (float between 0 and 1).
         num_Actions: Number of actions in the environment (in the case of WIndyGridWorld, it is 4)
    
    Returns:
         A function that takes the observation as an argument and returns as a result
         the probabilities of each action as a numpy array of length num_Actions.
    """
    

def mc_control_on_policy_epsilon_greedy(env, num_episodes, discount=1.0, epsilon=0.1, epsilon_decay = 0.9):
    """
    Control by Monte Carlo methods using Epsilon-Greedy policies
    Find an epsilon-greedy policy.
    
    Args:
         env: Gymnasium environment.
         num_episodes: Number of episodes in the sample.
         discount: discount factor.
         epsilon: The probability of selecting a random action (float between 0 and 1)
    
    Returns:
         A tuple (Q, policy).
         Q: A dictionary whose correspondence is state -> action-values.
         policy: A function that takes the observation as an argument and returns as a result
                 the probabilities of each action
    """


<div class="alert alert-block alert-danger">
<strong>Solution</strong>
</div>