# M4 - Reinforcement Learning

## Example of the CartPole environment

Below we will see a simple example that will allow us to understand the concepts introduced in this section.  

First, we will install the [Gymnasium](https://gymnasium.farama.org/) library (if we do not have it installed):

> !pip install gymnasium

For backward compatibility with code created for OpenAI Gym.

In [8]:
import gymnasium as gym

## CartPole

In this first example we are going to load the [CartPole](https://gymnasium.farama.org/environments/classic_control/cart_pole/) environment and perform some tests.

### 1. Data load

The following code loads the necessary packages for the example, creates the environment using the `make` method and prints on the screen the dimension of the:
- **action space** (two actions: 0 = left and 1 = right), 
- **observations space** (four observations : cart position, cart speed, pole angle, and pole speed at the tip) 

In [9]:
import numpy as np

env = gym.make('CartPole-v1')
print("Action space is      : {} ".format(env.action_space))
print("Observation space is : {} ".format(env.observation_space))

Action space is      : Discrete(2) 
Observation space is : Box([-4.8               -inf -0.41887903        -inf], [4.8               inf 0.41887903        inf], (4,), float32) 


Next, we **reset the environment** (an action that must always be performed after its creation) and initialize the variables that will store:
- the number of steps executed (t), 
- the accumulated reward (`total_reward`) and 
- the variable that will tell us when an episode ends (`done`).

In [10]:
# Environment reset
obs, info = env.reset()
t, total_reward, done = 0, 0, False

### 2. Running an episode

Next, we will run an episode of the `CartPole` environment using an agent that **selects actions randomly**.

- The following code runs an episode of the environment (it ends when the `done` variable takes the value `True`). 
- The agent is implemented using the `env.action_space.sample()` method which selects a random action. 
- For each step (_time step_), the observation generated by the environment (the four values discussed above), the selected action and the reward obtained in that step (+1 in each action until the episode ends) are printed on the screen.

In [11]:
while not done:
    # Get random action (this is the implementation of the agent)
    action = env.action_space.sample()
    
    # Execute action and get response
    new_obs, reward, terminated, truncated, info = env.step(action)
    done = terminated or truncated
    print("Obs: {} -> Action: {} and reward: {}".format(np.round(obs, 3), action, reward))
    
    obs = new_obs
    total_reward += reward
    t += 1
    
total_reward += reward
t += 1
print("Obs: {} -> Action: {} and reward: {}".format(np.round(obs, 3), action, reward))

Obs: [-0.019 -0.027 -0.024 -0.031] -> Action: 1 and reward: 1.0
Obs: [-0.02   0.168 -0.024 -0.331] -> Action: 1 and reward: 1.0
Obs: [-0.016  0.364 -0.031 -0.631] -> Action: 0 and reward: 1.0
Obs: [-0.009  0.169 -0.044 -0.348] -> Action: 0 and reward: 1.0
Obs: [-0.006 -0.025 -0.051 -0.07 ] -> Action: 0 and reward: 1.0
Obs: [-0.006 -0.22  -0.052  0.207] -> Action: 0 and reward: 1.0
Obs: [-0.01  -0.414 -0.048  0.483] -> Action: 1 and reward: 1.0
Obs: [-0.019 -0.218 -0.038  0.175] -> Action: 0 and reward: 1.0
Obs: [-0.023 -0.413 -0.035  0.456] -> Action: 1 and reward: 1.0
Obs: [-0.031 -0.217 -0.026  0.152] -> Action: 1 and reward: 1.0
Obs: [-0.036 -0.022 -0.023 -0.149] -> Action: 0 and reward: 1.0
Obs: [-0.036 -0.217 -0.026  0.137] -> Action: 1 and reward: 1.0
Obs: [-0.04  -0.021 -0.023 -0.164] -> Action: 1 and reward: 1.0
Obs: [-0.041  0.174 -0.026 -0.464] -> Action: 1 and reward: 1.0
Obs: [-0.037  0.37  -0.035 -0.764] -> Action: 0 and reward: 1.0
Obs: [-0.03   0.175 -0.051 -0.483] -> Ac

Finally, we print the results and **close the environment**.

In [12]:
print("Episode finished after {} timesteps and reward was {} ".format(t, total_reward))
env.close()

Episode finished after 25 timesteps and reward was 25.0 


### 3. Simulating several episodes

The following code fragment repeats the process from the previous section for the number of episodes defined in the `num_episodes` variable. 

- The episodes are rendered to display them on the screen in an external window.

<u>Notes</u>:
- The parameter `render_mode="human"` renders it to an external window, but it does not work in **Google Colab**.

In [None]:
env = gym.make("CartPole-v1", render_mode="human") # rendering
env = gym.make("CartPole-v1")

num_episodes = 10

for episode in range(num_episodes):
    # Environment reset
    obs, info = env.reset()
    t, total_reward, done = 0, 0, False
    
    print('Running episode {} '.format(episode+1))
    
    while not done:
    
        # Get random action (this is the implementation of the agent)
        action = env.action_space.sample()
    
        # Execute action and get response
        new_obs, reward, terminated, truncated, info = env.step(action)
        done = terminated or truncated
        print("Obs: {} -> Action: {} and reward: {}".format(np.round(obs, 3), action, reward))
    
        obs = new_obs
        total_reward += reward
        t += 1
        
    total_reward += reward
    t += 1
    print("Obs: {} -> Action: {} and reward: {}".format(np.round(obs, 3), action, reward))
    print("Episode {} finished after {} timesteps and reward was {} ".format(episode+1, t, total_reward))
    print('')
    
env.close()

Running episode 1 
Obs: [-0.03  -0.001 -0.03   0.015] -> Action: 1 and reward: 1.0
Obs: [-0.03   0.195 -0.03  -0.287] -> Action: 1 and reward: 1.0
Obs: [-0.026  0.39  -0.036 -0.589] -> Action: 1 and reward: 1.0
Obs: [-0.018  0.586 -0.048 -0.892] -> Action: 0 and reward: 1.0
Obs: [-0.006  0.391 -0.065 -0.615] -> Action: 0 and reward: 1.0
Obs: [ 0.001  0.197 -0.078 -0.344] -> Action: 1 and reward: 1.0
Obs: [ 0.005  0.393 -0.085 -0.66 ] -> Action: 1 and reward: 1.0
Obs: [ 0.013  0.59  -0.098 -0.978] -> Action: 0 and reward: 1.0
Obs: [ 0.025  0.396 -0.117 -0.718] -> Action: 1 and reward: 1.0
Obs: [ 0.033  0.592 -0.132 -1.045] -> Action: 0 and reward: 1.0
Obs: [ 0.045  0.399 -0.153 -0.796] -> Action: 1 and reward: 1.0
Obs: [ 0.053  0.596 -0.169 -1.133] -> Action: 0 and reward: 1.0
Obs: [ 0.065  0.404 -0.191 -0.897] -> Action: 1 and reward: 1.0
Obs: [ 0.073  0.601 -0.209 -1.244] -> Action: 0 and reward: 1.0
Obs: [ 0.085  0.409 -0.234 -1.023] -> Action: 0 and reward: 1.0
Episode 1 finished af