# Stable-Baselines3 practice lab Lunar Lander

Code snippets are  
- from the official documentation of [Stable-Baselines3](https://stable-baselines3.readthedocs.io/en/master/) and  
- from [Reinforcement Learning in Python with Stable Baselines 3](https://pythonprogramming.net/introduction-reinforcement-learning-stable-baselines-3-tutorial/).

In [1]:
# imports

import gymnasium as gym
from stable_baselines3 import A2C, PPO
from stable_baselines3.common.evaluation import evaluate_policy
import tqdm
import os

2025-01-05 19:14:50.769772: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1736100890.783596    4365 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1736100890.787783    4365 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-01-05 19:14:50.802001: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
# prepare output folders of model snapshots and logs for tensorboard
models_dir, logs_dir = "models/lunarlander/PPO", "logs/lunarlander"

if not os.path.exists(models_dir):
    os.makedirs(models_dir)

if not os.path.exists(logs_dir):
    os.makedirs(logs_dir)

In [3]:
# create the cart pole environment
env = gym.make("LunarLander-v3", render_mode="rgb_array")
env.reset()

(array([ 1.15966795e-04,  1.41220427e+00,  1.17322486e-02,  5.70709035e-02,
        -1.27612017e-04, -2.65754969e-03,  0.00000000e+00,  0.00000000e+00],
       dtype=float32),
 {})

In [4]:
# create the model of choice
model = PPO("MlpPolicy", env, verbose=1, tensorboard_log=logs_dir)

Using cuda device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.




In [5]:
# train the model and save the snapshots and create the logs along the way
TIMESTEPS = 1e4 # timesteps for each training episode
episodes = 50 # number of training episodes
for episode in range(episodes):
    model.learn(total_timesteps=TIMESTEPS, reset_num_timesteps=False, tb_log_name="PPO")
    model.save(f"{models_dir}/{TIMESTEPS*(episode + 1)}")

Logging to logs/lunarlander/PPO_0
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 99       |
|    ep_rew_mean     | -179     |
| time/              |          |
|    fps             | 780      |
|    iterations      | 1        |
|    time_elapsed    | 2        |
|    total_timesteps | 2048     |
---------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 92.7        |
|    ep_rew_mean          | -188        |
| time/                   |             |
|    fps                  | 377         |
|    iterations           | 2           |
|    time_elapsed         | 10          |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.007320149 |
|    clip_fraction        | 0.0382      |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.38       |
|    explained_variance   | 0.00795   

Now, while the model trains, we can view the results over time <br>
by opening a new terminal and doing: `tensorboard --logdir=logs/lunarlander`

To see to which extend the GPU is used, type `nvidia-smi` into a new terminal.

In [8]:
# load and evaluate a formerly saved snapshot
model = PPO.load(f"{models_dir}/200000.0", env=env)
mean_reward, std_reward = evaluate_policy(model, model.get_env(), n_eval_episodes=10)
print(f"The mean reward is {mean_reward} with a standard deviation of {std_reward}")

Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
The mean reward is 147.9203316 with a standard deviation of 95.44333679253258


In [10]:
# recreate env and load a formerly saved snapshot and render animations from it
env = gym.make("LunarLander-v3", render_mode="rgb_array")
env.reset()

model = PPO.load(f"{models_dir}/200000.0", env=env)

vec_env = model.get_env()
obs = vec_env.reset()

episodes = 2
total_reward_episode = 0

for episode in range(episodes):
    # VecEnv resets automatically but one could optionally reset it here
    # obs = vec_env.reset()
    done = False    
    total_reward_episode = 0
    while not done:
        action, _state = model.predict(obs, deterministic=True)
        obs, reward, done, info = vec_env.step(action)
        vec_env.render("human")
        total_reward_episode += reward

    print(f"Total reward in episode {episode + 1} was {total_reward_episode}")

Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.




Total reward in episode 1 was [181.28253]
Total reward in episode 2 was [-95.54287]


The Tensorboard comparison of the mean reward of the two models shows that 
- PPO maxes out 25k steps earlier
- PPO reaches max-out on a steadier path
- PPO holds the max-out whereas A2C shows a drop to ~470 after ~170k steps
- PPO performs the 200k steps ~50% quicker

![Tensorboard](images/Tensorboard_rewart_mean_A2C_vs_PPO.png)