# M4 - Reinforcement Learning

## Example of the CartPole environment

Below we will see a simple example that will allow us to understand the concepts introduced in this section.  

First, we will install the [Gymnasium](https://gymnasium.farama.org/) library (if we do not have it installed):

> !pip install gymnasium

For backward compatibility with code created for OpenAI Gym.

In [1]:
import gymnasium as gym

## CartPole

In this first example we are going to load the [CartPole](https://gymnasium.farama.org/environments/classic_control/cart_pole/) environment and perform some tests.

### 1. Data load

The following code loads the necessary packages for the example, creates the environment using the `make` method and prints on the screen the dimension of the:
- **action space** (two actions: 0 = left and 1 = right), 
- **observations space** (four observations : cart position, cart speed, pole angle, and pole speed at the tip) 

In [2]:
import numpy as np

env = gym.make('CartPole-v1')
print("Action space is      : {} ".format(env.action_space))
print("Observation space is : {} ".format(env.observation_space))

Action space is      : Discrete(2) 
Observation space is : Box([-4.8               -inf -0.41887903        -inf], [4.8               inf 0.41887903        inf], (4,), float32) 


Next, we **reset the environment** (an action that must always be performed after its creation) and initialize the variables that will store:
- the number of steps executed (t), 
- the accumulated reward (`total_reward`) and 
- the variable that will tell us when an episode ends (`done`).

In [3]:
# Environment reset
obs, info = env.reset()
t, total_reward, done = 0, 0, False

### 2. Running an episode

Next, we will run an episode of the `CartPole` environment using an agent that **selects actions randomly**.

- The following code runs an episode of the environment (it ends when the `done` variable takes the value `True`). 
- The agent is implemented using the `env.action_space.sample()` method which selects a random action. 
- For each step (_time step_), the observation generated by the environment (the four values discussed above), the selected action and the reward obtained in that step (+1 in each action until the episode ends) are printed on the screen.

In [4]:
while not done:
    # Get random action (this is the implementation of the agent)
    action = env.action_space.sample()
    
    # Execute action and get response
    new_obs, reward, terminated, truncated, info = env.step(action)
    done = terminated or truncated
    print("Obs: {} -> Action: {} and reward: {}".format(np.round(obs, 3), action, reward))
    
    obs = new_obs
    total_reward += reward
    t += 1
    
total_reward += reward
t += 1
print("Obs: {} -> Action: {} and reward: {}".format(np.round(obs, 3), action, reward))

Obs: [-0.049  0.048  0.001  0.028] -> Action: 1 and reward: 1.0
Obs: [-0.048  0.243  0.002 -0.265] -> Action: 0 and reward: 1.0
Obs: [-0.043  0.048 -0.004  0.028] -> Action: 1 and reward: 1.0
Obs: [-0.042  0.243 -0.003 -0.265] -> Action: 1 and reward: 1.0
Obs: [-0.037  0.438 -0.008 -0.559] -> Action: 1 and reward: 1.0
Obs: [-0.029  0.634 -0.02  -0.854] -> Action: 1 and reward: 1.0
Obs: [-0.016  0.829 -0.037 -1.153] -> Action: 1 and reward: 1.0
Obs: [ 1.000e-03  1.025e+00 -6.000e-02 -1.457e+00] -> Action: 0 and reward: 1.0
Obs: [ 0.021  0.83  -0.089 -1.183] -> Action: 0 and reward: 1.0
Obs: [ 0.038  0.636 -0.112 -0.92 ] -> Action: 1 and reward: 1.0
Obs: [ 0.05   0.833 -0.131 -1.246] -> Action: 0 and reward: 1.0
Obs: [ 0.067  0.64  -0.156 -0.997] -> Action: 1 and reward: 1.0
Obs: [ 0.08   0.836 -0.176 -1.334] -> Action: 1 and reward: 1.0
Obs: [ 0.097  1.033 -0.202 -1.676] -> Action: 0 and reward: 1.0
Obs: [ 0.117  0.841 -0.236 -1.453] -> Action: 0 and reward: 1.0


Finally, we print the results and **close the environment**.

In [5]:
print("Episode finished after {} timesteps and reward was {} ".format(t, total_reward))
env.close()

Episode finished after 15 timesteps and reward was 15.0 


### 3. Simulating several episodes

The following code fragment repeats the process from the previous section for the number of episodes defined in the `num_episodes` variable. 

- The episodes are rendered to display them on the screen in an external window.

<u>Notes</u>:
- The parameter `render_mode="human"` renders it to an external window, but it does not work in **Google Colab**.

In [6]:
#env = gym.make("CartPole-v1", render_mode="human") # rendering
env = gym.make("CartPole-v1")

num_episodes = 10

for episode in range(num_episodes):
    # Environment reset
    obs, info = env.reset()
    t, total_reward, done = 0, 0, False
    
    print('Running episode {} '.format(episode+1))
    
    while not done:
    
        # Get random action (this is the implementation of the agent)
        action = env.action_space.sample()
    
        # Execute action and get response
        new_obs, reward, terminated, truncated, info = env.step(action)
        done = terminated or truncated
        print("Obs: {} -> Action: {} and reward: {}".format(np.round(obs, 3), action, reward))
    
        obs = new_obs
        total_reward += reward
        t += 1
        
    total_reward += reward
    t += 1
    print("Obs: {} -> Action: {} and reward: {}".format(np.round(obs, 3), action, reward))
    print("Episode {} finished after {} timesteps and reward was {} ".format(episode+1, t, total_reward))
    print('')
    
env.close()

Running episode 1 
Obs: [ 0.031 -0.016  0.042  0.029] -> Action: 1 and reward: 1.0
Obs: [ 0.03   0.179  0.043 -0.25 ] -> Action: 0 and reward: 1.0
Obs: [ 0.034 -0.017  0.038  0.056] -> Action: 0 and reward: 1.0
Obs: [ 0.034 -0.213  0.039  0.361] -> Action: 1 and reward: 1.0
Obs: [ 0.029 -0.018  0.046  0.081] -> Action: 0 and reward: 1.0
Obs: [ 0.029 -0.214  0.048  0.388] -> Action: 0 and reward: 1.0
Obs: [ 0.025 -0.41   0.056  0.695] -> Action: 1 and reward: 1.0
Obs: [ 0.017 -0.215  0.07   0.42 ] -> Action: 0 and reward: 1.0
Obs: [ 0.012 -0.411  0.078  0.734] -> Action: 0 and reward: 1.0
Obs: [ 0.004 -0.608  0.093  1.05 ] -> Action: 0 and reward: 1.0
Obs: [-0.008 -0.804  0.114  1.37 ] -> Action: 1 and reward: 1.0
Obs: [-0.024 -0.61   0.141  1.115] -> Action: 0 and reward: 1.0
Obs: [-0.036 -0.807  0.163  1.449] -> Action: 0 and reward: 1.0
Obs: [-0.053 -1.004  0.192  1.788] -> Action: 0 and reward: 1.0
Obs: [-0.073 -1.2    0.228  2.133] -> Action: 0 and reward: 1.0
Episode 1 finished af