### Introduction to CartPole Episodes

This markdown cell provides a high-level overview of what constitutes an "episode" in the context of the CartPole environment within OpenAI Gym (now Gymnasium). It explains the key concepts: reset, observation, action, step, reward, policy, and value.

### Episodes of the Cart Pole Game



An episode of begins by calling the ``reset`` function. This returns the **observation** values associated with the initial state of the MDP.



At each state, an **action** must be chosen that is then sent to the ``step`` function. The step function returns a **new observation** and the **reward** that resulted from that action.



The process of selecting an action that results a new observation and a reward is repeated until the MDP terminates. The step function also returns a flag that indicates whether it has terminated, or not.



An **episode** of the CartPole MDP starts by calling the reset function and then repeatedly calling the step function until the MDP terminates. The set of actions taken is called the **policy** and the sum of the rewards received is called the **value** of the policy. 

### Importing Libraries and Creating the Environment

This code cell imports the necessary library (`gymnasium`) and creates an instance of the CartPole environment (`CartPole-v1`). This environment object (`env`) will be used to interact with the simulation.

In [None]:
#import OpenAI Gym
import gymnasium as gym
#create CartPole environment
env = gym.make('CartPole-v1')

### Starting an Episode and Observing the Initial State

This markdown cell explains the purpose of the `reset()` function, which initializes the environment to a starting state and returns the first observation. It also notes the inherent randomness in the starting state.

##### Starting Cart Pole

- Start and episode of Cart Pole using the reset() function  

- Print the observed values - Note these values have a random component to them and will not be the same every time you reset the environment.

### Resetting the Environment and Printing the Initial Observation

This code cell demonstrates how to start (or reset) an episode using `env.reset()`. It captures the initial observation and prints it. The `_` is used to ignore the secondary return value (info dictionary) from `reset()`.

In [None]:
#start an episode 
obs,_ = env.reset()
print(f"Observation: {obs}")

Observation: [-0.04027831  0.01710224 -0.03783857 -0.03566743]


### Understanding Actions in CartPole

This markdown cell describes how actions are selected, specifically mentioning the `sample()` method for choosing a random action from the available action space (0 or 1 for CartPole).

##### Selecting actions

- The sample() function returns a random action from the action space  

- For Cart Pole, the actions are 0 and 1



The code in the next cell print a small set of actions from the sample function to see the different values

### Sampling Random Actions

This code cell demonstrates the use of `env.action_space.sample()` to get random actions. It runs a loop to show that the sampled actions are indeed random (either 0 or 1).

In [None]:
# sample() returns a random action
for i in range(8):
    action = env.action_space.sample()
    print(f"Action {i}: {action}")

Action 0: 0
Action 1: 1
Action 2: 0
Action 3: 1
Action 4: 1
Action 5: 0
Action 6: 0
Action 7: 1


### Running a Full Episode with Random Actions

This markdown cell outlines the process of running a complete episode. It involves resetting the environment and then looping through steps (taking random actions) until the episode terminates (`terminated` flag becomes True). It details what information is obtained and printed at each step.

##### An episode of Cart Pole

An episode of Cart Pole starts using the reset function. Then while the done boolean variable is false, an action is selected and passed to the step function that executes the step. The step function returns a new observation, a reward, and whether the episode has terminated, or not, with the done boolean variable.  

- Run an episode of Cart Pole by taking a random action at each step of the episode  

- Reset to start an episode

- While not done

    - Get a random action

    - Perform the action and record results

    - Print 

        - step number 

        - action taken 

        - new observation 

        - reward

        - terminated flag

        - truncated flag  

- The episode terminates when the terminated flag is True



The cell below runs an episode of the Cart Pole MDP. Actions are selected at random. Run the cell a number of times to observe that the number of steps in the MDP varies. Note in the final step that ther **terminated** flag has been set to True.

### Code for Running a Single Episode

This code cell implements the logic described above. It resets the environment, then enters a `while` loop that continues as long as the episode is not `terminated`. Inside the loop, it samples a random action, takes a step in the environment using `env.step(action)`, updates the total reward, and prints the details of the step.

In [None]:
obs,_ = env.reset()
print(f"Initial Observation: {obs}")
i = 0 # counts the number of steps in the episode
sum_reward = 0.0 # sums the rewards
terminated = False
truncated = False # Initialize truncated flag

while not terminated and not truncated: # Check for truncated as well
    action = env.action_space.sample()
    obs,reward,terminated,truncated,_ = env.step(action)
    i += 1
    sum_reward += reward
    print(f"Step: {i}, Action: {action}, New Observtion: {obs}, Reward: {reward}")
    print(f"   --- Sum of Rewards: {sum_reward}, Terminated: {terminated}, Truncated: {truncated} ")

Initial Observation: [ 0.02208824  0.00818464 -0.02375799  0.03528415]
Step: 1, Action: 0, New Observtion: [ 0.02225194 -0.1865887  -0.02305231  0.32037753], Reward: 1.0
   --- Sum of Rewards: 1.0, Terminated: False, Truncated: False 
Step: 2, Action: 0, New Observtion: [ 0.01852016 -0.3813749  -0.01664476  0.6057024 ], Reward: 1.0
   --- Sum of Rewards: 2.0, Terminated: False, Truncated: False 
Step: 3, Action: 0, New Observtion: [ 0.01089266 -0.5762602  -0.00453071  0.8930965 ], Reward: 1.0
   --- Sum of Rewards: 3.0, Terminated: False, Truncated: False 
Step: 4, Action: 1, New Observtion: [-0.00063254 -0.38107708  0.01333122  0.5989928 ], Reward: 1.0
   --- Sum of Rewards: 4.0, Terminated: False, Truncated: False 
Step: 5, Action: 0, New Observtion: [-0.00825408 -0.576383    0.02531108  0.89584494], Reward: 1.0
   --- Sum of Rewards: 5.0, Terminated: False, Truncated: False 
Step: 6, Action: 0, New Observtion: [-0.01978174 -0.7718388   0.04322797  1.1963754 ], Reward: 1.0
   --- Sum

### Behavior After Episode Termination

This markdown cell explains what happens if you try to take another step *after* an episode has already terminated. It highlights that the `terminated` flag remains `True` and no further reward is accumulated.

If you run another step of the MDP from the terminal state you will get a warning that the behavior of the step function is undefined. However, note that 

- **terminated** remains True

- the **reward** is zero, and consequently

- the **sum of the rewards** does not increase 

### Demonstrating a Step After Termination

This code cell explicitly takes one more step after the previous episode likely terminated. It shows the output, confirming the points made in the preceding markdown cell (terminated=True, reward=0).

In [None]:
action = env.action_space.sample()
obs,reward,terminated,truncated,_ = env.step(action)
i += 1
sum_reward += reward
print(f"Step: {i}, Action: {action}, New Observtion: {obs}, Reward: {reward}")
print(f"   --- Sum of Rewards: {sum_reward}, Terminated: {terminated}, Truncated: {truncated} ")

Step: 18, Action: 1, New Observtion: [-0.14431769 -0.79173946  0.27745247  1.6922164 ], Reward: 0.0
   --- Sum of Rewards: 17.0, Terminated: True, Truncated: False 


  logger.warn(


### Extending the Example: Running Multiple Episodes

This markdown cell introduces the next code block, which extends the single-episode example. The goal is to run multiple episodes using the random action policy and calculate the average number of steps (episode length) achieved over these runs. This gives a basic measure of the performance of the random policy.

### Code for Multiple Episodes and Averaging Length

This code cell runs a specified number of episodes (`num_episodes`). For each episode, it follows the same logic as the single-episode example (reset, loop steps with random actions until termination or truncation). It records the length (number of steps) of each episode. After all episodes are run, it calculates and prints the average episode length. Finally, it closes the environment using `env.close()` to free up resources.

In [9]:
# Run multiple episodes and calculate average length
num_episodes = 10
episode_lengths = []

print(f"\nRunning {num_episodes} episodes with random actions...")

for episode in range(num_episodes):
    obs, _ = env.reset()
    terminated = False
    truncated = False
    steps = 0
    # It's good practice to also check for truncation, 
    # which signifies reaching a time limit even if not 'failed'.
    while not terminated and not truncated: 
        action = env.action_space.sample() # Random action
        obs, reward, terminated, truncated, info = env.step(action)
        steps += 1
    episode_lengths.append(steps)
    print(f"Episode {episode + 1} finished after {steps} steps.")

if episode_lengths: # Avoid division by zero if num_episodes was 0
    average_length = sum(episode_lengths) / num_episodes
    print(f"\nAverage episode length over {num_episodes} episodes: {average_length:.2f}")
else:
    print("\nNo episodes were run.")

env.close() # Close the environment when done


Running 10 episodes with random actions...
Episode 1 finished after 28 steps.
Episode 2 finished after 21 steps.
Episode 3 finished after 23 steps.
Episode 4 finished after 26 steps.
Episode 5 finished after 26 steps.
Episode 6 finished after 27 steps.
Episode 7 finished after 15 steps.
Episode 8 finished after 16 steps.
Episode 9 finished after 73 steps.
Episode 10 finished after 12 steps.

Average episode length over 10 episodes: 26.70
