# Install Dependencies

In [None]:
!apt install swig cmake

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
cmake is already the newest version (3.22.1-1ubuntu1.22.04.2).
Suggested packages:
  swig-doc swig-examples swig4.0-examples swig4.0-doc
The following NEW packages will be installed:
  swig swig4.0
0 upgraded, 2 newly installed, 0 to remove and 49 not upgraded.
Need to get 1,116 kB of archives.
After this operation, 5,542 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 swig4.0 amd64 4.0.2-1ubuntu1 [1,110 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/universe amd64 swig all 4.0.2-1ubuntu1 [5,632 B]
Fetched 1,116 kB in 1s (1,705 kB/s)
Selecting previously unselected package swig4.0.
(Reading database ... 123605 files and directories currently installed.)
Preparing to unpack .../swig4.0_4.0.2-1ubuntu1_amd64.deb ...
Unpacking swig4.0 (4.0.2-1ubuntu1) ...
Selecting previously unselected package swig.
Preparing to unpack .../swig_4.0.2-1ubu

Install OpenAI gymnasium, stable-baselines3, and huggingface

In [None]:
!pip install -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit1/requirements-unit1.txt

Collecting stable-baselines3==2.0.0a5 (from -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit1/requirements-unit1.txt (line 1))
  Downloading stable_baselines3-2.0.0a5-py3-none-any.whl.metadata (5.3 kB)
Collecting swig (from -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit1/requirements-unit1.txt (line 2))
  Downloading swig-4.2.1-py2.py3-none-manylinux_2_5_x86_64.manylinux1_x86_64.whl.metadata (3.6 kB)
Collecting huggingface_sb3 (from -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit1/requirements-unit1.txt (line 4))
  Downloading huggingface_sb3-3.0-py3-none-any.whl.metadata (6.3 kB)
Collecting gymnasium[box2d] (from -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit1/requirements-unit1.txt (line 3))
  Downloading gymnasium-0.29.1-py3-none-any.whl.metadata (10 kB)
Collecting gymnasium==0.28.1 (from stable-baselines3==2.0.0a5->-r https://raw.githubusercon

During the notebook, we’ll need to generate a replay video. To do so, with colab, we need to have a virtual screen to be able to render the environment (and thus record the frames).

Hence the following cell will install virtual screen libraries and create and run a virtual screen 🖥️

In [None]:
!sudo apt-get update
!apt install python3-opengl
!apt install ffmpeg
!apt install xvfb
!pip3 install pyvirtualdisplay

0% [Working]            Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]
0% [Waiting for headers] [Waiting for headers] [1 InRelease 0 B/3,626 B 0%] [Co0% [Waiting for headers] [Waiting for headers] [Connecting to r2u.stat.illinois                                                                               Hit:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
0% [Waiting for headers] [Waiting for headers] [Connecting to r2u.stat.illinois                                                                               Get:3 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:5 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Ign:6 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Get:7 https://r2u.stat.illinois.edu/ubuntu jammy Release [5,713 B]
Get:8 https://r2u.stat.illinois.edu/ubuntu jammy Release.gpg [793 B

In [None]:
import os

os.kill(os.getpid(), 9)

In [None]:
# Virtual display
from pyvirtualdisplay import Display

virtual_display = Display(visible=0, size=(1400, 900))
virtual_display.start()

<pyvirtualdisplay.display.Display at 0x7c68cc61cee0>

# Import the packages 📦

In [None]:
import gymnasium as gym

from huggingface_sb3 import load_from_hub, package_to_hub
from huggingface_hub import notebook_login # to log into my huggingface account in order to upload models to the Hub
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.monitor import Monitor

# Create the LunarLander environment 🌛

In [None]:
env = gym.make("LunarLander-v2")
print("_____OBSERVATION SPACE_____ \n")
print("Observation space shape", env.observation_space.shape)
print("Sample Observation", env.observation_space.sample())

_____OBSERVATION SPACE_____ 

Observation space shape (8,)
Sample Observation [-78.75593    -35.06715      4.529825     3.9676156    2.677344
   2.8875573    0.2231788    0.21774516]


We see with `Observation Space Shape (8,)` that the observation is a vector of size 8, where each value contains different information about the lander:

* Horizontal pad coordinate (x)
* Vertical pad coordinate (y)
* Horizontal speed (x)
* Vertical speed (y)
* Angle
* Angular speed
* If the left leg contact point has touched the land (boolean)
* If the right leg contact point has touched the land (boolean)

In [None]:
print("_____ACTION SPACE_____ \n")
print("Action space shape", env.action_space.shape)
print("Action space sample", env.action_space.sample())

_____ACTION SPACE_____ 

Action space shape ()
Action space sample 0


The action space (the set of possible actions the agent can take) is discrete with 4 actions available 🎮:

* Action 0: Do nothing,
* Action 1: Fire left orientation engine,
* Action 2: Fire the main engine,
* Action 3: Fire right orientation engine.

Reward function (the function that will give a reward at each timestep) 💰:

After every step a reward is granted. The total reward of an episode is the sum of the rewards for all the steps within that episode.

For each step, the reward:

* Is increased/decreased the closer/further the lander is to the landing pad.
* Is increased/decreased the slower/faster the lander is moving.
* Is decreased the more the lander is tilted (angle not horizontal).
* Is increased by 10 points for each leg that is in contact with the ground.
* Is decreased by 0.03 points each frame a side engine is firing.
* Is decreased by 0.3 points each frame the main engine is firing.

The episode receive an **additional reward of -100 or +100 points for crashing or landing safely respectively**.

An episode is **considered a solution if it scores at least 200 points**.

## Vectorized Environment
A method for stacking multiple independent environments into a single environment to have more diverse experiences during the training.

In [None]:
# Create the environment
env = make_vec_env("LunarLander-v2", n_envs=16)

# Create the Model 🤖


In [None]:
model = PPO(
    policy="MlpPolicy",
    env=env,
    n_steps=1024,
    batch_size=64,
    n_epochs=4,
    gamma=0.999,
    gae_lambda=0.98,
    ent_coef=0.01,
    verbose=1,
)

Using cuda device


WARNING: The cell below takes about 20 mins to run.

In [None]:
model.learn(total_timesteps=int(1e6))
# Save the model
model_name = "ppo-lunarlander-v2"
model.save(model_name)

---------------------------------
| rollout/           |          |
|    ep_len_mean     | 95.2     |
|    ep_rew_mean     | -192     |
| time/              |          |
|    fps             | 3121     |
|    iterations      | 1        |
|    time_elapsed    | 5        |
|    total_timesteps | 16384    |
---------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 93.8        |
|    ep_rew_mean          | -154        |
| time/                   |             |
|    fps                  | 2201        |
|    iterations           | 2           |
|    time_elapsed         | 14          |
|    total_timesteps      | 32768       |
| train/                  |             |
|    approx_kl            | 0.005164008 |
|    clip_fraction        | 0.0305      |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.38       |
|    explained_variance   | -0.00137    |
|    learning_rate        | 0.

# Evaluate the agent
💡 When you evaluate your agent, you should not use your training environment but create an evaluation environment.

In [None]:
eval_env = Monitor(gym.make("LunarLander-v2"))
mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=10, deterministic=True)
print(f"mean reward: {mean_reward: .2f}, +/- {std_reward}")

mean reward:  266.98, +/- 23.314944875572113


In my case, I got a mean reward of `266.98 +/- 23.31` after training for 1 million steps, which means that our lunar lander agent is ready to land on the moon 🌛🥳

# Publish our trained model on the Huggingface Hub 🔥

In [None]:
notebook_login()
!git config --global credential.helper store

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

💡 A good agent name is `{username}/{model_architecture}-{env_id}`

In [None]:
from stable_baselines3.common.vec_env import DummyVecEnv

env_id = "LunarLander-v2"
model_architecture = "PPO"
## repo_id is the id of the model repository from the Hugging Face Hub (repo_id = {username}/{repo_name}
repo_id = "wowthecoder/ppo-LunarLander-v2"
commit_msg = "Upload PPO LunarLander-v2 trained agent"
# Create the eval env and set the render mode = "rgb_array"
eval_env = DummyVecEnv([lambda: Monitor(gym.make(env_id, render_mode="rgb_array"))])

package_to_hub(
    model=model,
    model_name=model_name,
    model_architecture=model_architecture,
    env_id=env_id,
    eval_env = eval_env,
    repo_id = repo_id,
    commit_message=commit_msg,
)



policy.optimizer.pth:   0%|          | 0.00/88.4k [00:00<?, ?B/s]

policy.pth:   0%|          | 0.00/43.8k [00:00<?, ?B/s]

ppo-lunarlander-v2.zip:   0%|          | 0.00/148k [00:00<?, ?B/s]

Upload 4 LFS files:   0%|          | 0/4 [00:00<?, ?it/s]

pytorch_variables.pth:   0%|          | 0.00/864 [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/wowthecoder/ppo-LunarLander-v2/commit/e2d2d89fca742a2ecc2b33019282cb9f482338bc', commit_message='Upload PPO LunarLander-v2 trained agent', commit_description='', oid='e2d2d89fca742a2ecc2b33019282cb9f482338bc', pr_url=None, pr_revision=None, pr_num=None)

# Extra: Load a saved LunarLander model from the Hub 🤗


In [None]:
repo_id = "satcos/ppo-LunarLander-v2"
filename = "ppo-LunarLander-v2.zip"\

# Could not deserialize lr_schedule and clip_range
custom_objects = {
    "learning_rate": 0.0,
    "lr_schedule": lambda _: 0.0,
    "clip_range": lambda _: 0.0,
}

checkpoint = load_from_hub(repo_id, filename)
model = PPO.load(checkpoint, custom_objects=custom_objects, print_system_info=True)

  th_object = th.load(file_content, map_location=device)


In [None]:
print("test")

In [None]:
eval_env = Monitor(gym.make("LunarLander-v2"))
mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=10, deterministic=True)
print(f"mean reward: {mean_reward :.2f} +/- {std_reward}")