# Stable-Baselines3 practice lab Cart Pole v1

Code snippets are  
- from the official documentation of [Stable-Baselines3](https://stable-baselines3.readthedocs.io/en/master/) and  
- from [Reinforcement Learning in Python with Stable Baselines 3](https://pythonprogramming.net/introduction-reinforcement-learning-stable-baselines-3-tutorial/).

## Create and train the model A2C

In [1]:
# imports
import gymnasium as gym
from stable_baselines3 import A2C
import os

2025-01-05 17:55:59.274774: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1736096159.437768     394 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1736096159.485991     394 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-01-05 17:55:59.743600: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [4]:
# prepare output folders of model snapshots and logs for tensorboard
models_dir, logs_dir = "models/cartpole/A2C", "logs/cartpole"

if not os.path.exists(models_dir):
    os.makedirs(models_dir)

if not os.path.exists(logs_dir):
    os.makedirs(logs_dir)

In [6]:
# create the cart pole environment
env = gym.make("CartPole-v1", render_mode="rgb_array")
env.reset()

(array([-0.01515297,  0.01392985, -0.0105844 ,  0.01107144], dtype=float32),
 {})

In [7]:
# create the model of choice
model = A2C("MlpPolicy", env, verbose=1, tensorboard_log=logs_dir)

Using cuda device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.




In [None]:
# train the model and save the snapshots and create the logs along the way
TIMESTEPS = 1e4 # timesteps for each training episode
episodes = 20 # number of training episodes
for episode in range(episodes):
    model.learn(total_timesteps=TIMESTEPS, reset_num_timesteps=False, tb_log_name="A2C")
    model.save(f"{models_dir}/{TIMESTEPS*(episode + 1)}")

Now, while the model trains, we can view the results over time by 
- opening a new terminal
- invoking `tensorboard --logdir=logs/cartpole`
- start the browser with http://localhost:6006/ (look for the terminal output for the actual port on your system)

To watch progress in near real-time, switch on the TensorBoard setting "Reload data" (look for the gear icon in the TensorBoards main menu bar).

To see to which extend the GPU is used, type `nvidia-smi` into a new terminal.

TensorBoard shows that the training maxed out between 14 and 16 episodes aka
140k and 160k total timesteps.

## Load and evaluate a formerly saved snapshot

Some lines of code are repeats of code from previous sections of the notebook. I did this intentionally so that this section can be executed without executing any cell from previous sections.

In [None]:
# imports and path settings
import gymnasium as gym
from stable_baselines3 import A2C
from stable_baselines3.common.evaluation import evaluate_policy

models_dir, logs_dir = "models/cartpole/A2C", "logs/cartpole"

In [None]:
# load and evaluate a formerly saved snapshot
model = A2C.load(f"{models_dir}/160000.0", env=env)
mean_reward, std_reward = evaluate_policy(model, model.get_env(), n_eval_episodes=10)
print(f"The mean reward is {mean_reward} with a standard deviation of {std_reward}")

Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.




The mean reward is 500.0 with a standard deviation of 0.0


## Enjoy the agent

Some lines of code are repeats of code from previous sections of the notebook. I did this intentionally so that this section can be executed without executing any cell from previous sections.

In [None]:
# imports and path settings
import gymnasium as gym
from stable_baselines3 import A2C

models_dir, logs_dir = "models/cartpole/A2C", "logs/cartpole"

In [5]:
# recreate env and load a formerly saved snapshot and render animations from it
env = gym.make("CartPole-v1", render_mode="rgb_array")
env.reset()

model = A2C.load(f"{models_dir}/160000.0", env=env)

vec_env = model.get_env()
obs = vec_env.reset()

episodes = 10
total_reward_episode = 0

for episode in range(episodes):
    # VecEnv resets automatically but one could optionally reset it here
    # obs = vec_env.reset()
    done = False    
    total_reward_episode = 0
    while not done:
        action, _state = model.predict(obs, deterministic=True)
        obs, reward, done, info = vec_env.step(action)
        vec_env.render("human")
        total_reward_episode += 1

    print(f"Total reward in episode {episode + 1} was {total_reward_episode}")

Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


qt.qpa.plugin: Could not find the Qt platform plugin "wayland" in "/home/stonapse/Projects/kaggle.studies.rl/.venv/lib/python3.12/site-packages/cv2/qt/plugins"
Qt: Session management error: Could not open network socket


Total reward in episode 1 was 500
Total reward in episode 2 was 500
Total reward in episode 3 was 500
Total reward in episode 4 was 500
Total reward in episode 5 was 500
Total reward in episode 6 was 500
Total reward in episode 7 was 500
Total reward in episode 8 was 500
Total reward in episode 9 was 500
Total reward in episode 10 was 500


Now and just for fun, the same with PPO

## Now and just for fun, create and train the model PPO

Some lines of code are repeats of code from previous sections of the notebook. I did this intentionally so that this section can be executed without executing any cell from previous sections.

In [5]:
import gymnasium as gym
from stable_baselines3 import PPO
import os

In [6]:
# prepare output folders of model snapshots and logs for tensorboard
models_dir, logs_dir = "models/cartpole/PPO", "logs/cartpole"

if not os.path.exists(models_dir):
    os.makedirs(models_dir)

if not os.path.exists(logs_dir):
    os.makedirs(logs_dir)

In [7]:
# create the cart pole environment
env = gym.make("CartPole-v1", render_mode="rgb_array")
env.reset()

(array([ 0.00904992,  0.01734696, -0.02579219,  0.03996974], dtype=float32),
 {})

In [8]:
# create the model of choice
model = PPO("MlpPolicy", env, verbose=1, tensorboard_log=logs_dir)

Using cuda device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.




In [None]:
# train the model and save the snapshots and create the logs along the way
TIMESTEPS = 1e4 # timesteps for each training episode
episodes = 20 # number of training episodes
for episode in range(episodes):
    model.learn(total_timesteps=TIMESTEPS, reset_num_timesteps=False, tb_log_name="PPO")
    model.save(f"{models_dir}/{TIMESTEPS*(episode + 1)}")

## Load and evaluate a formerly saved snapshot

Some lines of code are repeats of code from previous sections of the notebook. I did this intentionally so that this section can be executed without executing any cell from previous sections.

In [None]:
# imports and path settings
import gymnasium as gym
from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy

models_dir, logs_dir = "models/cartpole/PPO", "logs/cartpole"

In [10]:
# create environment, load and evaluate model @ a saved snapshot
env = gym.make("CartPole-v1", render_mode="rgb_array")
env.reset()

# to evaluate another snapshot explore the folder models_dir for saved snapshots
model = PPO.load(f"{models_dir}/160000.0", env=env)

mean_reward, std_reward = evaluate_policy(model, model.get_env(), n_eval_episodes=10)
print(f"The mean reward is {mean_reward} with a standard deviation of {std_reward}")

Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.




The mean reward is 500.0 with a standard deviation of 0.0


## Enjoy the agent

Some lines of code are repeats of code from previous sections of the notebook. I did this intentionally so that this section can be executed without executing any cell from previous sections.

In [1]:
# imports and path settings
import gymnasium as gym
from stable_baselines3 import PPO

models_dir, logs_dir = "models/cartpole/PPO", "logs/cartpole"

2025-01-06 10:17:27.656468: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1736155047.774732     719 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1736155047.823518     719 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-01-06 10:17:28.088449: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [3]:
# recreate env and load a formerly saved snapshot and render animations from it
env = gym.make("CartPole-v1", render_mode="rgb_array")
env.reset()

# to enjoy another snapshot explore the folder models_dir for saved snapshots
model = PPO.load(f"{models_dir}/160000.0", env=env)

vec_env = model.get_env()
obs = vec_env.reset()

episodes = 3
total_reward_episode = 0

for episode in range(episodes):
    # VecEnv resets automatically but one could optionally reset it here
    # obs = vec_env.reset()
    done = False    
    total_reward_episode = 0
    while not done:
        action, _state = model.predict(obs, deterministic=True)
        obs, reward, done, info = vec_env.step(action)
        vec_env.render("human")
        total_reward_episode += 1

    print(f"Total reward in episode {episode + 1} was {total_reward_episode}")

Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.




Total reward in episode 1 was 500
Total reward in episode 2 was 500
Total reward in episode 3 was 500



## Evaluation of training process using Tensorboard

The Tensorboard comparison of the mean reward of the two models shows that 
- PPO maxes out 25k steps earlier
- PPO reaches max-out on a steadier path
- PPO holds the max-out whereas A2C shows a drop to ~470 after ~170k steps
- PPO performs the 200k steps ~50% quicker

![Tensorboard](images/Tensorboard_cartpole_reward_mean_A2C_vs_PPO.png)