# **Tabular Reinforcement Learning**

# SARSA on FrozenLake environment

## Non-Evaluables Practical Exercices

This is a non-evaluable practical exercise, but it is recommended that students complete it fully and individually, since it is an important part of the learning process.

The solution will be available, although it is not recommended that students consult the solution until they have completed the exercise. 

## The FrozenLake environment

In this activity, we are going to implement the **Value Iteration** algorithm on [Frozen Lake](https://gymnasium.farama.org/environments/toy_text/frozen_lake/) environment.

Main characteristics:
- The game starts with the player at location [0,0] of the frozen lake grid world with the goal located at far extent of the world e.g. [3,3] for the 4x4 environment.
- Holes in the ice are distributed in set locations when using a pre-determined map or in random locations when a random map is generated.
- The player makes moves until they reach the goal or fall in a hole.
- The lake is slippery (unless disabled) so the player may move perpendicular to the intended direction sometimes (see _is_slippery_ param).

<img src="https://gymnasium.farama.org/_images/frozen_lake.gif" />

## SARSA

<u>Question 1</u>: : **Implement the *SARSA* algorithm** explained in the "Temporal Difference Learning" module using the following parameters:

- Number of episodes = 1000000
- *learning rate* = 0.5
- *discount factor* = 1
- *epsilon* = 0.05  

<u>Question 2</u>: Once you have coded the algorithm, try different **values for the hyperparameters** and comment the best ones (providing an empirical comparison):

- Number of episodes
- *learning rate* 
- *discount factor* 
- *epsilon*

<u>Question 3</u>: Try to solve the same environment but using a _8 x 8_ grid (also in slippery mode):
> gym.make(ENV_NAME, desc=None, map_name="8x8", is_slippery=True)

In [None]:
import gymnasium as gym

# params
ENV_NAME = "FrozenLake-v1"
GAMMA = 0.9
TEST_EPISODES = 1000000

# definig the environment
env = gym.make(ENV_NAME, desc=None, map_name="4x4", is_slippery=True)

print("Gymnasium version is {} ".format(gym.__version__))
print("Action space is {} ".format(env.action_space))
print("Observation space is {} ".format(env.observation_space))

Gymnasium version is 1.2.0 
Action space is Discrete(4) 
Observation space is Discrete(16) 


In [56]:
from collections import defaultdict
import sys
import numpy as np


def epsilon_greedy_policy(Q, state, nA, epsilon):
    '''
    Create a policy where epsilon dictates the probability of a random action being carried out.

    :param Q: link state -> action value (dictionary)
    :param state: state in which the agent is (int)
    :param nA: number of actions (int)
    :param epsilon: possibility of random movement (float)
    :return: probability of each action (list) d
    '''

    probs = np.ones(nA) * epsilon / nA
    best_action = np.argmax(Q[state])
    probs[best_action] += 1.0 - epsilon

    return probs


def SARSA(episodes, learning_rate, discount, epsilon):
    '''
    Learn to solve the environment using the SARSA algorithm

    :param episodes: Number of episodes (int)
    :param learning_rate: Learning rate (float [0, 1])
    :param discount: Discount factor (float [0, 1])
    :param epsilon: chance that random movement is required (float [0, 1])
    :return: x,y number of episodes and number of steps
    :Q: action value function
    '''

    # Link actions to states
    Q = defaultdict(lambda: np.zeros(env.action_space.n))

    # Number of episodes
    x = np.arange(episodes)
    y = np.zeros(episodes)
    
    for episode in range(episodes):
        state = env.reset()[0]
        # Select and execute an action
        probs = epsilon_greedy_policy(Q, state, env.action_space.n, epsilon)
        action = np.random.choice(np.arange(len(probs)), p=probs)
        
        done = False
        step = 1
                
        while not done:
            # Execute action
            next_state, reward, terminated, truncated, info = env.step(action)
            done = terminated or truncated
            # Select and execute action
            probs = epsilon_greedy_policy(Q, next_state, env.action_space.n, epsilon)
            next_action = np.random.choice(np.arange(len(probs)), p=probs)
           
            # Update TD
            td_target = reward + discount * Q[next_state][next_action]
            td_error = td_target - Q[state][action]
            Q[state][action] += learning_rate * td_error
                        
            if done:
                y[episode] = step
                break

            state = next_state
            action = next_action
            step += 1
                 
    return x, y, Q


<div class="alert alert-block alert-danger">
<strong>Solution</strong>
</div>

In [57]:
x, y, q = SARSA(episodes=100000, learning_rate=0.5, discount=1, epsilon=0.05)

In [58]:
# execution of an episode following the optimal policy
def execute_episode_SARSA(q, env):
    obs = env.reset()[0]
    t, total_reward, done = 0, 0, False

    print("Obs initial: {} ".format(obs))

    switch_action = {
            0: "U",
            1: "R",
            2: "D",
            3: "L",
        }

    for t in range(1000): # We limit the number of time-steps in each episode to 1000
        # Choose a stock following the optimal policy
        arr = np.array(q[obs])
        action = arr.argmax()
       
        # Execute the action and wait for the response from the environment
        new_obs, reward, terminated, truncated, info = env.step(action)
        done = terminated or truncated
        obs = new_obs
        print("Action: {} -> Obs: {} and reward: {}".format(switch_action[action], obs, reward))

        if t==999:
            print("Number of time-septs exceeds 1000. STOP episode.") 
        total_reward += reward
        t += 1
        if done:
            break
   
    print("Episode finished after {} timesteps and reward was {} ".format(t, total_reward))
    env.close()


In [59]:
execute_episode_SARSA(q, env)

Obs initial: 0 
Action: U -> Obs: 0 and reward: 0.0
Action: U -> Obs: 4 and reward: 0.0
Action: U -> Obs: 4 and reward: 0.0
Action: U -> Obs: 0 and reward: 0.0
Action: U -> Obs: 0 and reward: 0.0
Action: U -> Obs: 0 and reward: 0.0
Action: U -> Obs: 0 and reward: 0.0
Action: U -> Obs: 0 and reward: 0.0
Action: U -> Obs: 4 and reward: 0.0
Action: U -> Obs: 4 and reward: 0.0
Action: U -> Obs: 8 and reward: 0.0
Action: L -> Obs: 8 and reward: 0.0
Action: L -> Obs: 8 and reward: 0.0
Action: L -> Obs: 4 and reward: 0.0
Action: U -> Obs: 4 and reward: 0.0
Action: U -> Obs: 0 and reward: 0.0
Action: U -> Obs: 0 and reward: 0.0
Action: U -> Obs: 0 and reward: 0.0
Action: U -> Obs: 0 and reward: 0.0
Action: U -> Obs: 4 and reward: 0.0
Action: U -> Obs: 4 and reward: 0.0
Action: U -> Obs: 8 and reward: 0.0
Action: L -> Obs: 9 and reward: 0.0
Action: R -> Obs: 8 and reward: 0.0
Action: L -> Obs: 9 and reward: 0.0
Action: R -> Obs: 8 and reward: 0.0
Action: L -> Obs: 4 and reward: 0.0
Action: U ->