### Getting Started with Cart Pole in OpenAI Gym

This notebook provides a basic introduction to interacting with the classic 'CartPole-v1' environment using the Gymnasium library (the successor to OpenAI Gym). The goal in the CartPole environment is to balance a pole upright on a movable cart by applying forces (left or right) to the cart. An episode ends if the pole tilts too far, the cart moves off the track, or a maximum number of steps is reached.

#### October 25, 2022 update

- [Announcing The Farama Foundation](https://farama.org/Announcing-The-Farama-Foundation) The future of open source reinforcement learning

- The Farama Foundation is a non profit designed to house reinforcement learning libraries in a neutral nonprofit body.

- Released the [Gymnasium](https://github.com/Farama-Foundation/Gymnasium) library. Future maintenance of OpenAI Gym will take place here.

- Gym documentation is at [https://www.gymlibrary.dev/index.html](https://www.gymlibrary.dev/index.html)

#### Installing gymnasium

If you haven't installed Gymnasium yet, you can do so from the command prompt (or a terminal within your environment) using pip:

``pip install gymnasium``

You might also need specific extras for rendering or different environments, e.g., `pip install gymnasium[classic_control]` for the CartPole environment.

#### Import Gymnasium Library

First, we import the necessary library.

In [5]:
# Import the Gymnasium library
import gymnasium as gym

In [6]:
# Optional: Check the installed version of Gymnasium
print(f"Gymnasium version: {gym.__version__}")

Gymnasium version: 1.1.1


#### Create the Cart Pole environment

The `gym.make()` function is used to instantiate an environment. We specify the environment ID, in this case, `'CartPole-v1'`. 

Once created, we can inspect its `observation_space` and `action_space` properties to understand the format of the states we'll receive and the actions we can take.

#### Create Cart Pole Environment and Inspect Spaces

In [7]:
# Create the CartPole environment
env = gym.make('CartPole-v1')

# Print information about the observation and action spaces
print(f"Observation Space: {env.observation_space}")
print(f"Action Space: {env.action_space}")

Observation Space: Box([-4.8               -inf -0.41887903        -inf], [4.8               inf 0.41887903        inf], (4,), float32)
Action Space: Discrete(2)


The `observation_space` is a `Box(4,)`, meaning it's a 4-dimensional continuous space. The values represent:

1.  **Cart Position:** How far the cart is from the center.
2.  **Cart Velocity:** How fast the cart is moving.
3.  **Pole Angle:** The angle of the pole with respect to the vertical position (0 is upright).
4.  **Pole Angular Velocity:** How fast the pole is rotating.

The bounds shown (`low` and `high`) indicate the theoretical limits, although episodes often terminate before these are reached (e.g., pole angle > 12 degrees or cart position > 2.4 units).

The `action_space` is `Discrete(2)`, meaning there are two possible discrete actions:

- **0:** Push the cart to the left.
- **1:** Push the cart to the right.

The agent needs to learn a policy (a way to choose actions based on observations) to keep the pole balanced.

#### Reset (start) the environment and get an Initial Observation

Before starting an interaction loop (an episode), we must `reset()` the environment. This function:
- Puts the environment into a valid starting state (e.g., cart near the center, pole nearly upright).
- Returns the initial `observation` corresponding to this starting state.
- Returns an `info` dictionary which might contain auxiliary information (often empty for simple environments at reset).

#### Reset Environment for New Episode

In [8]:
# Reset the environment to start a new episode
obs, info = env.reset()

print(f"Initial Observation: {obs}")
print(f"Initial Env info: {info}")

Initial Observation: [ 0.00447542 -0.02835283  0.04589985 -0.00298108]
Initial Env info: {}


The observation array returned contains the initial values for the cart position, cart velocity, pole angle, and pole angular velocity.

#### Actions, Rewards, and a New Observation

The core interaction loop involves taking an `action` in the environment using the `env.step(action)` method. This function simulates one time step in the environment based on the chosen action.

It returns five values:
- `observation`: The next state of the environment after taking the action.
- `reward`: A scalar value indicating the immediate reward received for the action taken in the previous state. In CartPole, this is typically +1 for every step the pole remains balanced.
- `terminated`: A boolean flag. `True` if the episode ended due to reaching a terminal state (e.g., pole fell over, cart went off-screen).
- `truncated`: A boolean flag. `True` if the episode ended due to an external condition (e.g., reaching a time limit, like 500 steps in CartPole-v1).
- `info`: A dictionary containing auxiliary diagnostic information (often empty).

#### Take a Step in the Environment

In [9]:
# Perform an action (e.g., push left)
action = 0 
obs, reward, terminated, truncated, info = env.step(action)

print(f"Action Taken: {action}")
print(f"New Observation: {obs}")
print(f"Reward received: {reward}")
print(f"Episode terminated: {terminated}")
print(f"Episode truncated: {truncated}")
print(f"Step Info dictionary: {info}")

Action Taken: 0
New Observation: [ 0.00390836 -0.22410198  0.04584023  0.3038232 ]
Reward received: 1.0
Episode terminated: False
Episode truncated: False
Step Info dictionary: {}


#### Running a Simple Loop

Let's run a short loop taking random actions to see how the episode progresses.

In [10]:
# Example: Take a few random steps
print("\n--- Running a short episode with random actions ---")
obs, info = env.reset() # Start a new episode
total_reward = 0
for step_num in range(10): # Run for a maximum of 10 steps
    random_action = env.action_space.sample() # Sample a random action (0 or 1)
    print(f"\nStep: {step_num + 1}")
    print(f"Taking action: {random_action}")
    obs, reward, terminated, truncated, info = env.step(random_action)
    total_reward += reward
    print(f"Observation: {obs}")
    print(f"Reward: {reward}")
    print(f"Terminated: {terminated}, Truncated: {truncated}")
    
    # Check if the episode has ended
    if terminated or truncated:
        print(f"\nEpisode finished after {step_num + 1} steps.")
        break

print(f"\nTotal reward for this short random episode: {total_reward}")


--- Running a short episode with random actions ---

Step: 1
Taking action: 0
Observation: [ 0.04135794 -0.20590533  0.01851772  0.3429478 ]
Reward: 1.0
Terminated: False, Truncated: False

Step: 2
Taking action: 1
Observation: [ 0.03723983 -0.01105165  0.02537668  0.0561613 ]
Reward: 1.0
Terminated: False, Truncated: False

Step: 3
Taking action: 0
Observation: [ 0.0370188  -0.2065281   0.0264999   0.35674152]
Reward: 1.0
Terminated: False, Truncated: False

Step: 4
Taking action: 0
Observation: [ 0.03288824 -0.40201658  0.03363473  0.6576613 ]
Reward: 1.0
Terminated: False, Truncated: False

Step: 5
Taking action: 1
Observation: [ 0.0248479  -0.20737855  0.04678796  0.37575617]
Reward: 1.0
Terminated: False, Truncated: False

Step: 6
Taking action: 0
Observation: [ 0.02070033 -0.40313277  0.05430308  0.6828168 ]
Reward: 1.0
Terminated: False, Truncated: False

Step: 7
Taking action: 0
Observation: [ 0.01263768 -0.59896505  0.06795942  0.9920895 ]
Reward: 1.0
Terminated: False, Trunc

#### Closing the Environment

It's good practice to close the environment when you're done with it, especially if it involves graphical rendering or other resources.

In [11]:
# Close the environment
env.close()

### Footnotes

[^1]: In earlier versions of OpenAI gym the ``env.reset()`` function had a different signature and return values. The current Gymnasium API returns `observation, info`.

[^2]: Similarly, the ``env.step(action)`` function in earlier versions returned `observation, reward, done, info`. The current Gymnasium API returns `observation, reward, terminated, truncated, info`, splitting the old `done` flag into `terminated` (environment-specific end) and `truncated` (external condition like time limit).