# M4 - Reinforcement Learning

## Example of the information from CartPole environment

Below we will see a simple example that will allow us to understand the concepts introduced in this section.   

First, we import the library and load the environment (lines 1 and 2).

In [1]:
import gymnasium as gym

The following code allows us to create an environment based on CartPole-v0 and see the type and range of data used.

In [2]:
e = gym.make('CartPole-v1')

To initialize the environment we execute the `reset()` method. In addition to starting the environment, it returns an observation in the form of an __array__ of four decimal values, which contains information about:
- the $x$ coordinate of the center of mass of the stick,
- its speed, 
- its angle with the platform, and 
- its angular velocity.

In [3]:
e.reset()

(array([ 0.00600007,  0.04271926, -0.01585362, -0.01345813], dtype=float32),
 {})

We can see that the **type** of the allowed actions is `Discrete(n=2)`, which indicates that the actions must contain the value 0 or 1, where:
- 0 means pushing the platform to the left, and 
- 1 means push to the right. 

In [4]:
e.action_space

Discrete(2)

The `action_space.sample()` method returns us random examples of the valid action space in this environment.

In [5]:
e.action_space.sample()

np.int64(1)

The space of observations is a vector of four positions in the range ($-\infty, \infty$).

In [6]:
e.observation_space

Box([-4.8               -inf -0.41887903        -inf], [4.8               inf 0.41887903        inf], (4,), float32)

Below we see a random example provided by the `observation_space.sample()` method.

In [7]:
e.observation_space.sample()

array([ 2.1233292 ,  0.22612087,  0.02973259, -0.08213806], dtype=float32)

Finally, we indicate the action that the agent will perform. 

The response of the environment includes five values:
- A four-position vector with the result of the **new observation** of the environment.
- The **reward**, which in this case is 1.0.
- The **terminated** indicator, which in this case is _False_.
- The **truncated** indicator, which in this case is _False_.
- **Extra information** about the environment, which in this case does not contain information.

In [8]:
e.step(0)

(array([ 0.00685445, -0.15217179, -0.01612279,  0.27418092], dtype=float32),
 1.0,
 False,
 False,
 {})