# Stable-Baselines3 practice lab Lunar Lander

Code snippets are  
- from the official documentation of [Stable-Baselines3](https://stable-baselines3.readthedocs.io/en/master/) and  
- from [Reinforcement Learning in Python with Stable Baselines 3](https://pythonprogramming.net/introduction-reinforcement-learning-stable-baselines-3-tutorial/).

## Create and train the model PPO

In [None]:
# imports
import gymnasium as gym
import os
from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy

2025-01-06 08:59:34.109534: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1736150374.122749    2862 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1736150374.126709    2862 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-01-06 08:59:34.140318: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [4]:
# prepare output folders of model snapshots and logs for tensorboard
models_dir, logs_dir = "models/lunarlander/PPO", "logs/lunarlander"

if not os.path.exists(models_dir):
    os.makedirs(models_dir)

if not os.path.exists(logs_dir):
    os.makedirs(logs_dir)

In [3]:
# create the lunar lander environment
env = gym.make("LunarLander-v3", render_mode="rgb_array")
env.reset()

(array([ 1.15966795e-04,  1.41220427e+00,  1.17322486e-02,  5.70709035e-02,
        -1.27612017e-04, -2.65754969e-03,  0.00000000e+00,  0.00000000e+00],
       dtype=float32),
 {})

In [4]:
# create the model of choice
model = PPO("MlpPolicy", env, verbose=1, tensorboard_log=logs_dir)

Using cuda device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.




In [None]:
# train the model, save the snapshots and create / save the logs
TIMESTEPS = 1e4 # timesteps for each training episode
episodes = 50 # number of training episodes
for episode in range(episodes):
    model.learn(total_timesteps=TIMESTEPS, reset_num_timesteps=False, tb_log_name="PPO")
    model.save(f"{models_dir}/{TIMESTEPS*(episode + 1)}")

Now, while the model trains, we can view the results over time by 
- opening a new terminal
- invoking `tensorboard --logdir=logs/lunarlander`
- start the browser with http://localhost:6006/ (look for the terminal output for the actual port on your system)

To watch progress in near real-time, switch on the TensorBoard setting "Reload data" (look for the gear icon in the TensorBoards main menu bar).

To see to which extend the GPU is used, type `nvidia-smi` into a new terminal.


## Evaluation of training process using Tensorboard


The Tensorboard shows that 
- PPO needed 100k timesteps to start producing positive rewards
- PPO at first developed a strategy to approach landing slowly whilst increasing reward. This approach ended at around 140k timesteps with a length of ~770 timesteps per landing approach.
- From that on PPO started to increase the efficiency of the landing by decreasing the timesteps per landing approach down to 320 at around ~320k timesteps.
- Beyond the 500k timestep training budget, the model started to show a positive trend both in shorter lengths and bigger rewards. Thus further training might have been beneficial.

![Tensorboard](images/Tensorboard_lunarlander_PPO.png)

## Load and evaluate a formerly saved snapshot

Some lines of code are repeats of code from previous sections of the notebook. I did this intentionally so that this section can be executed without executing any cell from previous sections.

In [3]:
# imports and path settings
import gymnasium as gym
from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy

models_dir, logs_dir = "models/lunarlander/PPO", "logs/lunarlander"


Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.




The mean reward is 109.28687439999999 with a standard deviation of 134.06204899112822


In [None]:
# create environment, load and evaluate model @ a saved snapshot
env = gym.make("LunarLander-v3", render_mode="rgb_array")
env.reset()

# to evaluate another snapshot explore the folder models_dir for saved snapshots
model = PPO.load(f"{models_dir}/500000.0", env=env)

mean_reward, std_reward = evaluate_policy(model, model.get_env(), n_eval_episodes=10)
print(f"The mean reward is {mean_reward} with a standard deviation of {std_reward}")

## Enjoy the agent

Some lines of code are repeats of code from previous sections of the notebook. I did this intentionally so that this section can be executed without executing any cell from previous sections.

In [1]:
# imports and path settings
import gymnasium as gym
from stable_baselines3 import PPO

models_dir, logs_dir = "models/lunarlander/PPO", "logs/lunarlander"

2025-01-09 15:00:02.027979: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1736431202.041124     456 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1736431202.045053     456 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-01-09 15:00:02.058448: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
# recreate env and load a formerly saved snapshot and render animations from it
env = gym.make("LunarLander-v3", render_mode="rgb_array")
env.reset()

# to enjoy another snapshot explore the folder models_dir for saved snapshots
model = PPO.load(f"{models_dir}/500000.0", env=env)

vec_env = model.get_env()
obs = vec_env.reset()

episodes = 5
total_reward_episode = 0

try:
    for episode in range(episodes):
        # VecEnv resets automatically but one could optionally reset it here
        # obs = vec_env.reset()
        done = False    
        total_reward_episode = 0
        while not done:
            action, _state = model.predict(obs, deterministic=True)
            obs, reward, done, info = vec_env.step(action)
            vec_env.render('human')
            total_reward_episode += reward

        print(f"Total reward in episode {episode + 1} was {total_reward_episode}")
finally:
    vec_env.close()

Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


qt.qpa.plugin: Could not find the Qt platform plugin "wayland" in "/home/stonapse/Projects/kaggle.studies.rl/.venv/lib/python3.12/site-packages/cv2/qt/plugins"
Qt: Session management error: Could not open network socket


Total reward in episode 1 was [-43.39543]
Total reward in episode 2 was [235.18979]
Total reward in episode 3 was [-1.2233429]
Total reward in episode 4 was [232.4602]
Total reward in episode 5 was [-69.20262]
