<a href="https://colab.research.google.com/github/wengti/Reinforcement-Learning-Tutorial-/blob/main/%5BRL%5D_Unit_1_Note.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Install Libraries that would be needed in this notebook

In [1]:
!apt install swig cmake

!pip install stable-baselines3==2.0.0a5
!pip install swig
!pip install gymnasium[box2d]
!pip install huggingface_sb3


Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
cmake is already the newest version (3.22.1-1ubuntu1.22.04.2).
Suggested packages:
  swig-doc swig-examples swig4.0-examples swig4.0-doc
The following NEW packages will be installed:
  swig swig4.0
0 upgraded, 2 newly installed, 0 to remove and 35 not upgraded.
Need to get 1,116 kB of archives.
After this operation, 5,542 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 swig4.0 amd64 4.0.2-1ubuntu1 [1,110 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/universe amd64 swig all 4.0.2-1ubuntu1 [5,632 B]
Fetched 1,116 kB in 1s (853 kB/s)
Selecting previously unselected package swig4.0.
(Reading database ... 126111 files and directories currently installed.)
Preparing to unpack .../swig4.0_4.0.2-1ubuntu1_amd64.deb ...
Unpacking swig4.0 (4.0.2-1ubuntu1) ...
Selecting previously unselected package swig.
Preparing to unpack .../swig_4.0.2-1ubunt

# The flow of creating and using an environment

1. Create an environment.
2. Reset the environment to the initial state.
3. Sample an action.
4. Take that action.
5. Check if the action leads to termination (end of episode) or truncation (exceed time or physical space limit)
6. Repeat step 3 to 5 for a number of times.
7. Close the environment.

# Some important reference / links to look up for
* Learn the environment parameter from the documentations.
  - For instance, https://gymnasium.farama.org/environments/box2d/lunar_lander/ provides details on the state space, action space and reward.
*  Learn how the environment works as an object.
  - https://gymnasium.farama.org/api/env/#gymnasium.Env provides details on the functions / methods generally used by an env object.




In [3]:
import gymnasium as gym

# 1. First create an environment (This example focuses on the Lunar Lander Environment)
# https://gymnasium.farama.org/environments/box2d/lunar_lander/
env = gym.make("LunarLander-v2")

# 2. Reset the environment to the initial state
observation, info = env.reset()

# 3. Randomly sample an action
# 4. And take the action
# 5. Check if the action leads to termination (end of episode) or truncation (exceed timelimit or physical space limit)
# 6. Repeat step 3 to 5 for a number of times

for _ in range(20):

  # 3. Randomly sample an action
  action = env.action_space.sample()
  print(f"Action taken: {action}")

  # 4. And take the action (https://gymnasium.farama.org/api/env/#gymnasium.Env.step)
  observation, reward, terminated, truncated, info = env.step(action)

  # 5. Check if the action leads to termination or truncation
  if terminated or truncated:
    print("Environment is reset!")
    observation, info = env.reset()

# 7. Close the environment
env.close()

Action taken: 3
Action taken: 2
Action taken: 3
Action taken: 0
Action taken: 1
Action taken: 0
Action taken: 2
Action taken: 2
Action taken: 1
Action taken: 1
Action taken: 2
Action taken: 3
Action taken: 0
Action taken: 1
Action taken: 1
Action taken: 0
Action taken: 3
Action taken: 2
Action taken: 1
Action taken: 2


# Visualize Observation Space
* To study the parameter represented by each value in the observation space, kindly refer to: https://gymnasium.farama.org/environments/box2d/lunar_lander/#

In [6]:
import gymnasium as gym

env = gym.make('LunarLander-v2')

print(f"The shape of the observation space: {env.observation_space.shape}")
print(f"A sample of the observation space: {env.observation_space.sample()}")

The shape of the observation space: (8,)
A sample of the observation space: [-75.64905     78.99128     -2.5721257    4.0058837   -2.0155616
   4.15837      0.38517615   0.49400613]


# Visualize Action Space

In [15]:
import gymnasium as gym

env = gym.make("LunarLander-v2")

print(f"The available action in the action space: {env.action_space} / {env.action_space.n}")
print(f"A sample of the action space: {env.action_space.sample()}")

The available action in the action space: Discrete(4) / 4
A sample of the action space: 2


# Train an agent using a Vectorised Environment

In [16]:
import gymnasium as gym
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env


#1. Make a vectorised environment - allowing the model to train with diversity across environments
# https://stable-baselines3.readthedocs.io/en/master/common/env_util.html#stable_baselines3.common.env_util.make_vec_env
env = make_vec_env(env_id = "LunarLander-v2",
                   n_envs = 16)

# 2. Create a PPO model
# https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html
model = PPO(policy = 'MlpPolicy',
            env = env,
            n_steps = 1024,
            batch_size = 64,
            n_epochs = 4,
            gamma = 0.999,
            gae_lambda = 0.98,
            ent_coef = 0.01,
            verbose = 1)

# 3. Train the PPO model
model.learn(total_timesteps = 1e6)

# 4. Save the trained PPO model
model_name = "ppo-LunarLander-v2"
model.save(model_name)

Using cuda device
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 89.1     |
|    ep_rew_mean     | -187     |
| time/              |          |
|    fps             | 3486     |
|    iterations      | 1        |
|    time_elapsed    | 4        |
|    total_timesteps | 16384    |
---------------------------------
-------------------------------------------
| rollout/                |               |
|    ep_len_mean          | 98.8          |
|    ep_rew_mean          | -167          |
| time/                   |               |
|    fps                  | 2318          |
|    iterations           | 2             |
|    time_elapsed         | 14            |
|    total_timesteps      | 32768         |
| train/                  |               |
|    approx_kl            | 0.010340167   |
|    clip_fraction        | 0.0583        |
|    clip_range           | 0.2           |
|    entropy_loss         | -1.38         |
|    explained_variance   

# Evaluate the trained agent

In [18]:
import gymnasium as gym
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.monitor import Monitor

# 1. Create a new environment for evaluation
eval_env = Monitor(gym.make("LunarLander-v2", render_mode = 'rgb_array'))

# 2. Evaluate the model (P/S: Taking models from the previous cell block)
# https://stable-baselines3.readthedocs.io/en/master/common/evaluation.html#stable_baselines3.common.evaluation.evaluate_policy
mean_reward, std_reward = evaluate_policy(model = model,
                                          env = eval_env,
                                          n_eval_episodes = 10,
                                          deterministic = True) # Take deterministic actions (instead of sampling / stochastic)

print(f"The earned reward is : {mean_reward:.2f} +/- {std_reward:.2f}")

The earned reward is : 240.90 +/- 19.26


# Publish the trained model on the Hub

* Create a new token with with **write role** here: https://huggingface.co/settings/tokens

* Once the model is published on Hub, the result may be accessed via the **repo_url** in the output.

In [20]:
from huggingface_hub import notebook_login

notebook_login()
!git config --global credential.helper store

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [22]:
from stable_baselines3.common.vec_env import DummyVecEnv
from huggingface_sb3 import package_to_hub

# https://huggingface.co/docs/hub/en/stable-baselines3
package_to_hub(model = model, # Taking model from the previous cell block
               model_name = model_name, # Taking model name from the previous cell block
               model_architecture = "PPO",
               env_id = "LunarLander-v2",
               eval_env = DummyVecEnv([lambda: Monitor(gym.make("LunarLander-v2", render_mode = "rgb_array"))]), # Expecting a list of functiosn that return the environment
               repo_id = "wengti0608/ppo-LunarLander-v2",
               commit_message = "First Commit")

[38;5;4mℹ This function will save, evaluate, generate a video of your agent,
create a model card and push everything to the hub. It might take up to 1min.
This is a work in progress: if you encounter a bug, please open an issue.[0m


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Saving video to /tmp/tmp2a59xuqt/-step-0-to-step-1000.mp4


  """


Moviepy - Building video /tmp/tmp2a59xuqt/-step-0-to-step-1000.mp4.
Moviepy - Writing video /tmp/tmp2a59xuqt/-step-0-to-step-1000.mp4





Moviepy - Done !
Moviepy - video ready /tmp/tmp2a59xuqt/-step-0-to-step-1000.mp4
[38;5;4mℹ Pushing repo wengti0608/ppo-LunarLander-v2 to the Hugging Face
Hub[0m


Uploading...:   0%|          | 0.00/452k [00:00<?, ?B/s]

[38;5;4mℹ Your model is pushed to the Hub. You can view your model here:
https://huggingface.co/wengti0608/ppo-LunarLander-v2/tree/main/[0m


CommitInfo(commit_url='https://huggingface.co/wengti0608/ppo-LunarLander-v2/commit/16d0fbe46ece90043f8e2d482d4a221554b0fb97', commit_message='First Commit', commit_description='', oid='16d0fbe46ece90043f8e2d482d4a221554b0fb97', pr_url=None, repo_url=RepoUrl('https://huggingface.co/wengti0608/ppo-LunarLander-v2', endpoint='https://huggingface.co', repo_type='model', repo_id='wengti0608/ppo-LunarLander-v2'), pr_revision=None, pr_num=None)

# Load a saved LunarLander model from the Hub

* I am loading a model that was trained with gym (old version of gymnasium). Therefore, I am using shimmy to ensure version compatibility.
* But this will lead to reinstallation of gym which will crash with gymnasium, subsequently leading to crash. Therefore, I try to reinstall all the libraries that are needed again.

In [1]:
!pip install shimmy
!apt install swig cmake

!pip install stable-baselines3==2.0.0a5
!pip install swig
!pip install gymnasium[box2d] shimmy
!pip install huggingface_sb3

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
swig is already the newest version (4.0.2-1ubuntu1).
cmake is already the newest version (3.22.1-1ubuntu1.22.04.2).
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.
Collecting gymnasium==0.28.1 (from stable-baselines3==2.0.0a5)
  Using cached gymnasium-0.28.1-py3-none-any.whl.metadata (9.2 kB)
Using cached gymnasium-0.28.1-py3-none-any.whl (925 kB)
Installing collected packages: gymnasium
  Attempting uninstall: gymnasium
    Found existing installation: gymnasium 1.1.1
    Uninstalling gymnasium-1.1.1:
      Successfully uninstalled gymnasium-1.1.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
shimmy 2.0.0 requires gymnasium>=1.0.0a1, but you have gymnasium 0.28.1 which is incompatible.
dopamine-rl 4.1.2 requires gymnasium>=1.0.0, but you have

* You may find a full list of trained models of LunarLander-v2 here: https://huggingface.co/models?search=LunarLander-v2
* For this tutorial, we are using the following repository: https://huggingface.co/Classroom-workshop/assignment2-omar

In [2]:
from huggingface_sb3 import load_from_hub
from stable_baselines3 import PPO

# The loaded model was trained on Python 3.8 that uses pickle protocol of 5
# But Python 3.6 and 3.7 use pickle protocol 4
# Therefore, a placeholder custom_objects need to be loaded
custom_objects = {
    "learning_rate" : 0.0,
    "lr_schedule": lambda _: 0.0,
    "clip_range": lambda _: 0.0,
}

# https://huggingface.co/blog/sb3
checkpoint = load_from_hub(repo_id = "Classroom-workshop/assignment2-omar",
                           filename = "ppo-LunarLander-v2.zip")

# https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html
model = PPO.load(path = checkpoint,
                 custom_objects = custom_objects,
                 print_system_info = True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


ppo-LunarLander-v2.zip:   0%|          | 0.00/146k [00:00<?, ?B/s]

== CURRENT SYSTEM INFO ==
- OS: Linux-6.1.123+-x86_64-with-glibc2.35 # 1 SMP PREEMPT_DYNAMIC Sun Mar 30 16:01:29 UTC 2025
- Python: 3.11.13
- Stable-Baselines3: 2.0.0a5
- PyTorch: 2.6.0+cu124
- GPU Enabled: True
- Numpy: 2.0.2
- Cloudpickle: 3.1.1
- Gymnasium: 0.28.1
- OpenAI Gym: 0.25.2

== SAVED MODEL SYSTEM INFO ==
OS: Linux-5.4.188+-x86_64-with-Ubuntu-18.04-bionic #1 SMP Sun Apr 24 10:03:06 PDT 2022
Python: 3.7.13
Stable-Baselines3: 1.5.0
PyTorch: 1.11.0+cu113
GPU Enabled: True
Numpy: 1.21.6
Gym: 0.21.0



  deserialized_object = cloudpickle.loads(base64_object)


# Evaluate the loaded model

In [5]:
import gymnasium as gym
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.monitor import Monitor

eval_env = Monitor(gym.make("LunarLander-v2", render_mode = "rgb_array"))

mean_reward, std_reward = evaluate_policy(model = model,
                                          env = eval_env,
                                          n_eval_episodes = 10,
                                          deterministic = True) # Taking the loaded model from the previous cell block

print(f"The reward earned by the loaded model: {mean_reward:.2f} +/- {std_reward:.2f}")

The reward earned by the loaded model: 294.90 +/- 14.55
