# About the Environment (Cartpole-v1)

## Description

This environment corresponds to the version of the cart-pole problem described by Barto, Sutton, and Anderson in “Neuronlike Adaptive Elements That Can Solve Difficult Learning Control Problem”. A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The pendulum is placed upright on the cart and the goal is to balance the pole by applying forces in the left and right direction on the cart.

## Action Space

The action is a ndarray with shape (1,) which can take values {0, 1} indicating the direction of the fixed force the cart is pushed with.

- 0: Push cart to the left
- 1: Push cart to the right

Note: The velocity that is reduced or increased by the applied force is not fixed and it depends on the angle the pole is pointing. The center of gravity of the pole varies the amount of energy needed to move the cart underneath it

## Observation Space

The observation is a ndarray with shape (4,) with the values corresponding to the following positions and velocities:

| Num | Observation           | Min                  | Max                  |
|-----|-----------------------|----------------------|----------------------|
| 0   | Cart Position         | -4.8                 | 4.8                  |
| 1   | Cart Velocity         | -Inf                 | Inf                  |
| 2   | Pole Angle            | ~ -0.418 rad (-24°) | ~ 0.418 rad (24°)   |
| 3   | Pole Angular Velocity | -Inf                 | Inf                  |

**Note:** While the ranges above denote the possible values for the observation space of each element, it is not reflective of the allowed values of the state space in an unterminated episode. Particularly:

- The cart x-position (index 0) can take values between (-4.8, 4.8), but the episode terminates if the cart leaves the (-2.4, 2.4) range.
- The pole angle can be observed between (-0.418, 0.418) radians (or ±24°), but the episode terminates if the pole angle is not in the range (-0.2095, 0.2095) (or ±12°).
ewards   |


## Rewards

Since the goal is to keep the pole upright for as long as possible, a reward of +1 for every step taken, including the termination step, is allotted. The threshold for rewards is 500 for v1 and 200 for v0.

## Starting State
All observations are assigned a uniformly random value in (-0.05, 0.05)

## Episode End
The episode ends if any one of the following occurs:

1. Termination: Pole Angle is greater than ±12°
2. Termination: Cart Position is greater than ±2.4 (center of the cart reaches the edge of the display)
3. Truncation: Episode length is greater than 500 (200 for v0)

# Loading Libraries 

In [2]:
import numpy as np
import os
import gym 
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.callbacks import EvalCallback, StopTrainingOnRewardThreshold

# loading Environment

In [28]:
environment_name = "CartPole-v1"
env = gym.make(environment_name, render_mode="human")

# Testing with random sample

In [18]:
env.reset()

(array([-0.02001622, -0.03263483,  0.03480381, -0.04523245], dtype=float32),
 {})

In [27]:
env.close()

In [5]:
episodes = 5
for episode in range(1, episodes+1):
    state = env.reset()
    done = False
    score = 0 

    while not done:
        env.render()
        action = env.action_space.sample()
        n_state, reward, done, _, info, = env.step(action)
        score+=reward
    print('Episode:{} Score:{}'.format(episode, score))
env.close()

  if not isinstance(terminated, (bool, np.bool8)):


Episode:1 Score:15.0
Episode:2 Score:86.0
Episode:3 Score:29.0
Episode:4 Score:28.0
Episode:5 Score:15.0


# Setup Callback

In [7]:
save_path = os.path.join('Training', 'Saved Models')
log_path = os.path.join('Training', 'Logs')
stop_callback = StopTrainingOnRewardThreshold(reward_threshold=490, verbose=1)
eval_callback = EvalCallback(env, 
                             callback_on_new_best=stop_callback, 
                             eval_freq=10000, 
                             best_model_save_path=save_path, 
                             verbose=1)



# Training RL Model (PPO)

In [8]:
env = DummyVecEnv([lambda: env])
model = PPO('MlpPolicy', env, verbose = 1, tensorboard_log=log_path)

Using cpu device


In [9]:
model.learn(total_timesteps=20000, callback=eval_callback)

Logging to Training\Logs\PPO_2


  if not isinstance(terminated, (bool, np.bool8)):


-----------------------------
| time/              |      |
|    fps             | 46   |
|    iterations      | 1    |
|    time_elapsed    | 43   |
|    total_timesteps | 2048 |
-----------------------------
-----------------------------------------
| time/                   |             |
|    fps                  | 45          |
|    iterations           | 2           |
|    time_elapsed         | 89          |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.007474929 |
|    clip_fraction        | 0.0737      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.687      |
|    explained_variance   | -0.00524    |
|    learning_rate        | 0.0003      |
|    loss                 | 10.1        |
|    n_updates            | 10          |
|    policy_gradient_loss | -0.0106     |
|    value_loss           | 55          |
-----------------------------------------
----------------------------------



Eval num_timesteps=10000, episode_reward=250.40 +/- 105.07
Episode length: 250.40 +/- 105.07
------------------------------------------
| eval/                   |              |
|    mean_ep_length       | 250          |
|    mean_reward          | 250          |
| time/                   |              |
|    total_timesteps      | 10000        |
| train/                  |              |
|    approx_kl            | 0.0074244626 |
|    clip_fraction        | 0.0787       |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.617       |
|    explained_variance   | 0.208        |
|    learning_rate        | 0.0003       |
|    loss                 | 24.7         |
|    n_updates            | 40           |
|    policy_gradient_loss | -0.0198      |
|    value_loss           | 75.7         |
------------------------------------------
New best mean reward!
------------------------------
| time/              |       |
|    fps             | 40    |
|    iterations   

<stable_baselines3.ppo.ppo.PPO at 0x1a97b8d3590>

# Saving the model

In [24]:
ppo_path = os.path.join("Training","Saved_models","PPO_cartpole_model")
model.save(ppo_path)

# Loading the model

In [25]:
ppo_path = os.path.join("Training","Saved_models","PPO_cartpole_model")
model = PPO.load(ppo_path,env=env)

Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.




# Testing and Evaluation

In [26]:
evaluate_policy(model, env, n_eval_episodes=10, render=True)

(496.0, 12.0)

In [30]:
model.predict((0,0,0,0))

array(1, dtype=int64)

In [23]:
episodes = 5
for episode in range(1, episodes+1):
    obs = env.reset()
    done = False
    score = 0 

    while not done:
        env.render()
        action,_ = model.predict(obs)
        obs, reward, done, info = env.step(action)
        score+=reward
    print('Episode:{} Score:{}'.format(episode, score))
env.close()

ValueError: You have passed a tuple to the predict() function instead of a Numpy array or a Dict. You are probably mixing Gym API with SB3 VecEnv API: `obs, info = env.reset()` (Gym) vs `obs = vec_env.reset()` (SB3 VecEnv). See related issue https://github.com/DLR-RM/stable-baselines3/issues/1694 and documentation for more information: https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html#vecenv-api-vs-gym-api

# Viewing Logs in Tensorboard

In [None]:
training_log_path = os.path.join(log_path, 'PPO_5')
training_log_path

In [None]:
# do it in command matrics
!tensorboard --logdir={training_log_path} 