# M4 - Reinforcement Learning

## Example of the ForzenLake environment

Below we will see a simple example that will allow us to understand the concepts introduced in this section.  

First we will install the [Gymnasium](https://gymnasium.farama.org/) library (if we do not have it installed):

> !pip install gymnasium

For backward compatibility with code created for OpenAI Gym.

In [1]:
import gymnasium as gym

## Frozen Lake

In this example we are going to load the [FrozenLake](https://gymnasium.farama.org/environments/toy_text/frozen_lake/) environment and run some tests.

### 1. Data load

The following code loads the packages necessary for the example, creates the environment using the `make` method and prints on the screen the dimension of:
- the **action space** (0 = left, 1 = down, 2 = right and 3 = up), 
- the **space of observations** (a number from 0 to 15 that indicates the position of the agent in the environment) and 

In [2]:
import time

env = gym.make("FrozenLake-v1")
print("Action space is {} ".format(env.action_space))
print("Observation space is {} ".format(env.observation_space))

Action space is Discrete(4) 
Observation space is Discrete(16) 


To observe the default map:
 - S = starting cell
 - G = destination cell
 - H = hole
 - F = ice cell

In [3]:
print(env.unwrapped.desc)

[[b'S' b'F' b'F' b'F']
 [b'F' b'H' b'F' b'H']
 [b'F' b'F' b'F' b'H']
 [b'H' b'F' b'F' b'G']]


### 2. Running an episode

Next, we will execute an episode of the FrozenLake environment using an agent that **selects actions randomly**.

In the following code we initialize the environment, define the maximum number of steps per episode (`max_steps`) and execute an episode of the environment:
- This ends when the `done` variable takes the value `True` or when the stipulated maximum number of steps. 
- We use an agent that implements a completely random policy (`env.action_space.sample()`). 
- Using the `env.render()` method we can see the evolution of the agent in the environment from the departure box S until it reaches the destination box G or falls into a hole H.

In [4]:
# Environment reset
obs, info = env.reset()
t, total_reward, done = 0, 0, False
max_steps = 100

while t < max_steps:
    # Get random action (this is the implementation of the agent)
    action = env.action_space.sample()
    
    # Execute action and get response
    obs, reward, terminated, truncated, info = env.step(action)
    done = terminated or truncated
        
    t += 1
    if done:
        break
    time.sleep(0.1)

print("Episode finished after {} timesteps and reward was {} ".format(t, reward))
env.close()

Episode finished after 2 timesteps and reward was 0.0 


### 3. Simulating several episodes

The following code fragment repeats the process from the previous section for the number of episodes defined in the `num_episodes` variable. 

- The episodes are rendered to display them on the screen in an external window (does not work in **Google Colab**).

In [5]:
# render_mode = "human" (kernel crash using notebooks), "ansi"  
env = gym.make("FrozenLake-v1")

num_episodes = 10

for episode in range(num_episodes):

    # Environment reset
    obs, info = env.reset()
    t, done = 0, False
    
    print('Running episode {} '.format(episode+1))

    while t < max_steps:
        # Get random action (this is the implementation of the agent)
        action = env.action_space.sample()
    
        # Execute action and get response
        obs, reward, terminated, truncated, info = env.step(action)
        done = terminated or truncated
        
        t += 1
        if done:
            break
        time.sleep(0.1)
      
    print("Episode {} finished after {} timesteps and reward was {} ".format(episode+1, t, reward))
    print('')

env.close()

Running episode 1 
Episode 1 finished after 24 timesteps and reward was 0.0 

Running episode 2 
Episode 2 finished after 3 timesteps and reward was 0.0 

Running episode 3 
Episode 3 finished after 9 timesteps and reward was 0.0 

Running episode 4 
Episode 4 finished after 8 timesteps and reward was 0.0 

Running episode 5 
Episode 5 finished after 17 timesteps and reward was 0.0 

Running episode 6 
Episode 6 finished after 2 timesteps and reward was 0.0 

Running episode 7 
Episode 7 finished after 9 timesteps and reward was 0.0 

Running episode 8 
Episode 8 finished after 8 timesteps and reward was 0.0 

Running episode 9 
Episode 9 finished after 13 timesteps and reward was 0.0 

Running episode 10 
Episode 10 finished after 3 timesteps and reward was 0.0 



### 4. Calculating the total reward of multiple episodes

To measure the efficiency of the agent, we can calculate the total reward of several episodes. 

- Given that in each episode the accumulated reward is 0 if the destination cell is not reached and 1 if the objective is achieved, measuring the **total accumulated reward** for a number of episodes gives us a measure of the success rate of our agent.

The following code fragment repeats the process from the previous section for the number of episodes defined in the `num_episodes` variable and calculates the agent's success rate. 

- Rendering of the environment is omitted in order to speed up execution.

In [6]:
env = gym.make("FrozenLake-v1")

num_episodes = 1000
total_reward = 0

for episode in range(num_episodes):

    # Environment reset
    obs, info = env.reset()
    t, done = 0, False
    
    while t < max_steps:
        # Get random action (this is the implementation of the agent)
        action = env.action_space.sample()
    
        # Execute action and get response
        obs, reward, terminated, truncated, info = env.step(action)
        done = terminated or truncated
        
        total_reward += reward
        t += 1
        if done:
            break
    
success_rate = total_reward*100/num_episodes
print("{} successes in {} episodes: {} % of success".format(total_reward, num_episodes, success_rate))

14.0 successes in 1000 episodes: 1.4 % of success


### 5. Training an agent in a deterministic environment

As we have seen in the previous section, since the agent used chooses the actions at random, it is almost impossible to reach the destination square G with this policy (the success rate is 1% or 2%). 

- First of all, we will start using a **deterministic environment** (instead of the stochastic environment we have used until now).
- Thus, it will be possible to define an **optimal policy** in a very simple way. 
- We will use a one-dimensional `ndarray`, where each position (corresponding to a single map cell) indicates the direction we must follow to find the correct exit.

Define the enviroment using the `is_slippery=False` parameter.

In [7]:
env = gym.make("FrozenLake-v1", is_slippery=False, render_mode="ansi")

print(env.unwrapped.desc)

[[b'S' b'F' b'F' b'F']
 [b'F' b'H' b'F' b'H']
 [b'F' b'F' b'F' b'H']
 [b'H' b'F' b'F' b'G']]


Next, we will create an array to indicate the **optimal policy** to be followed by the agent.

The actions are:
- 0: Move left
- 1: Move down
- 2: Move right
- 3: Move up

In [8]:
import numpy as np

optimal_policy = np.array([1, 0, 0, 0, 1, 0, 0, 0, 2, 2, 1, 0, 0, 2, 2, 2])

print("Optimal policy: {}".format(optimal_policy))

Optimal policy: [1 0 0 0 1 0 0 0 2 2 1 0 0 2 2 2]


In [9]:
# Environment reset
obs, info = env.reset()
t, done = 0, False

while not done:
    # Select the action to be performed
    action = optimal_policy[obs]
    
    # Execute action and get response
    new_obs, reward, terminated, truncated, info = env.step(action)
    done = terminated or truncated
    print("Obs: {} -> Action: {} and reward: {}".format(np.round(obs, 3), action, reward))
        
    t += 1
    obs = new_obs
    print(env.render())
    time.sleep(0.1)

print("Episode finished after {} timesteps and reward was {} ".format(t, reward))
env.close()

Obs: 0 -> Action: 1 and reward: 0.0
  (Down)
SFFF
[41mF[0mHFH
FFFH
HFFG

Obs: 4 -> Action: 1 and reward: 0.0
  (Down)
SFFF
FHFH
[41mF[0mFFH
HFFG

Obs: 8 -> Action: 2 and reward: 0.0
  (Right)
SFFF
FHFH
F[41mF[0mFH
HFFG

Obs: 9 -> Action: 2 and reward: 0.0
  (Right)
SFFF
FHFH
FF[41mF[0mH
HFFG

Obs: 10 -> Action: 1 and reward: 0.0
  (Down)
SFFF
FHFH
FFFH
HF[41mF[0mG

Obs: 14 -> Action: 2 and reward: 1.0
  (Right)
SFFF
FHFH
FFFH
HFF[41mG[0m

Episode finished after 6 timesteps and reward was 1.0 


### 6. Training an agent in a stochastic environment

Finally, we apply the **optimal policy** found in the previous section to a **stochastic version of the environment**.

In [10]:
env = gym.make("FrozenLake-v1", is_slippery=True, render_mode="ansi")

# Environment reset
obs, info = env.reset()
t, done = 0, False

while not done:
    # Select the action to be performed
    action = optimal_policy[obs]
    
    # Execute action and get response
    new_obs, reward, terminated, truncated, info = env.step(action)
    done = terminated or truncated
    print("Obs: {} -> Action: {} and reward: {}".format(np.round(obs, 3), action, reward))
        
    t += 1
    obs = new_obs
    print(env.render())
    time.sleep(0.1)

print("Episode finished after {} timesteps and reward was {} ".format(t, reward))
env.close()

Obs: 0 -> Action: 1 and reward: 0.0
  (Down)
[41mS[0mFFF
FHFH
FFFH
HFFG

Obs: 0 -> Action: 1 and reward: 0.0
  (Down)
S[41mF[0mFF
FHFH
FFFH
HFFG

Obs: 1 -> Action: 0 and reward: 0.0
  (Left)
S[41mF[0mFF
FHFH
FFFH
HFFG

Obs: 1 -> Action: 0 and reward: 0.0
  (Left)
S[41mF[0mFF
FHFH
FFFH
HFFG

Obs: 1 -> Action: 0 and reward: 0.0
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG

Obs: 0 -> Action: 1 and reward: 0.0
  (Down)
[41mS[0mFFF
FHFH
FFFH
HFFG

Obs: 0 -> Action: 1 and reward: 0.0
  (Down)
SFFF
[41mF[0mHFH
FFFH
HFFG

Obs: 4 -> Action: 1 and reward: 0.0
  (Down)
SFFF
FHFH
[41mF[0mFFH
HFFG

Obs: 8 -> Action: 2 and reward: 0.0
  (Right)
SFFF
FHFH
F[41mF[0mFH
HFFG

Obs: 9 -> Action: 2 and reward: 0.0
  (Right)
SFFF
F[41mH[0mFH
FFFH
HFFG

Episode finished after 10 timesteps and reward was 0.0 
