# Create a virtual display 🖥️
During the notebook, we'll need to generate a replay video. To do so, with colab, we need to have a virtual screen to be able to render the environment (and thus record the frames).

The following cell will install the librairies and create and run a virtual screen 🖥️

In [None]:
%%capture
!apt install python-opengl
!apt install ffmpeg
!apt install xvfb
!pip install pyvirtualdisplay
!pip install pyglet==1.5.1

In [None]:
# Virtual display
from pyvirtualdisplay import Display

virtual_display = Display(visible=0, size=(1400, 900))
virtual_display.start()

<pyvirtualdisplay.display.Display at 0x78c349d46920>

# Install the dependencies 🔽
Here is what we are installing:
* `gym`
* `gym-games`: Extra gym environments made with PyGame.
* `huggingface_hub`

Why do we install gym and not gymnasium, a more recent version of gym? Because the gym-games we are using are not updated yet with gymnasium.

The differences are:
* In gym we don't have `terminated` and `truncated` but only `done`.
* In gym using `env.step()` returns `state, reward, done, info`

In [None]:
!pip install -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit4/requirements-unit4.txt

Collecting git+https://github.com/ntasfi/PyGame-Learning-Environment.git (from -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit4/requirements-unit4.txt (line 1))
  Cloning https://github.com/ntasfi/PyGame-Learning-Environment.git to /tmp/pip-req-build-v8vnunp3
  Running command git clone --filter=blob:none --quiet https://github.com/ntasfi/PyGame-Learning-Environment.git /tmp/pip-req-build-v8vnunp3
  Resolved https://github.com/ntasfi/PyGame-Learning-Environment.git to commit 3dbe79dc0c35559bb441b9359948aabf9bb3d331
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting git+https://github.com/simoninithomas/gym-games (from -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit4/requirements-unit4.txt (line 2))
  Cloning https://github.com/simoninithomas/gym-games to /tmp/pip-req-build-wkkjhg5w
  Running command git clone --filter=blob:none --quiet https://github.com/simoninithomas/gym-games /tmp/pip-req-build-wkkjhg

# Import the packages 📦
In addition to importing the installed libraries, we also import:

`imageio`: A library that will help us to generate a replay video

In [None]:
from collections import deque

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# PyTorch
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.optim.lr_scheduler as lr_scheduler
from torch.distributions import Categorical

# Gym
import gym
import gym_pygame

# HuggingFace Hub
from huggingface_hub import notebook_login, login
import imageio

# First agent: Playing CartPole-v1 🤖


> A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The pendulum is placed upright on the cart and the goal is to balance the pole by applying forces in the left and right direction on the cart.

The goal is to push the cart left or right so that the pole stays in the equilibrium.

The episode ends if:

* The pole Angle is greater than ±12°
* The Cart Position is greater than ±2.4
* The episode length is greater than 500

We get a reward 💰 of +1 every timestep that the Pole stays in the equilibrium.



In [None]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)

cuda:0


In [None]:
env_id = "CartPole-v1"
# Create the environment
env = gym.make(env_id, render_mode="rgb_array")
eval_env = gym.make(env_id, render_mode="rgb_array")

# Get state space and action space
s_size = env.observation_space.shape[0]
a_size = env.action_space.n

  deprecation(
  deprecation(


In [None]:
print("_____OBSERVATION SPACE_____ \n")
print("The State Space is: ", s_size)
print("Sample observation", env.observation_space.sample())

_____OBSERVATION SPACE_____ 

The State Space is:  4
Sample observation [4.5378919e+00 2.8901635e+37 2.0756003e-01 2.0504685e+38]


In [None]:
print("\n _____ACTION SPACE_____ \n")
print("The Action Space is: ", a_size)
print("Action Space Sample", env.action_space.sample())


 _____ACTION SPACE_____ 

The Action Space is:  2
Action Space Sample 1


## Build the Reinforce Architecture
* Two fully connected layers (fc1 and fc2).
* ReLU as activation function of fc1
* Softmax to output a probability distribution over actions

In [None]:
class Policy(nn.Module):
    def __init__(self, s_size, a_size, h_size):
        super(Policy, self).__init__()
        self.fc1 = nn.Linear(s_size, h_size)
        self.fc2 = nn.Linear(h_size, a_size)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return F.softmax(x, dim=1)

    def act(self, state):
        """
        Given a state, take an action
        """
        state = torch.from_numpy(state).float().unsqueeze(0).to(device)
        probs = self.forward(state).cpu()
        m = Categorical(probs)
        action = m.sample()
        return action.item(), m.log_prob(action)

## Build the Reinforce Training algorithm

In [None]:
def reinforce(policy, optimizer, n_training_episodes, max_t, gamma, print_every):
    # Help us to calculate the score during the training
    scores_deque = deque(maxlen=100)
    scores = []

    # Line 3 of pseudocode
    for i_episode in range(1, n_training_episodes + 1):
        saved_log_probs = []
        rewards = []

        # reset the environment
        state = env.reset()

        # Line 4 of pseudocode (Generate an episode)
        for t in range(max_t):
            # get the action
            action, log_prob = policy.act(state)
            saved_log_probs.append(log_prob)
            # Take a step in the environment
            state, reward, done, _ = env.step(action)
            rewards.append(reward)
            if done:
                break

        total_rewards = sum(rewards)
        scores_deque.append(total_rewards)
        scores.append(total_rewards)

        # Line 5 and 6 of pseudocode: calculate the return
        returns = deque(maxlen=max_t)
        n_steps = len(rewards)

        # Calculate the sum of discounted rewards starting at timestep t
        # G_t = r_(t+1) + gamma * r_(t+2) + ... + gamma ^n * r_(T-1)
        # We can do it backwards from max_t - 1 to 0 to avoid recomputing redundant values
        # G_t = r_(t+1) + gamma * G_(t+1)
        # The queue "returns" will hold the returns in chronological order from t=0
        for t in range(n_steps-1, -1, -1):
            disc_return_t = returns[0] if len(returns) > 0 else 0
            returns.appendleft(rewards[t] + gamma * disc_return_t)

        # eps is the smallest representable float (machine epsilon)
        eps = np.finfo(np.float32).eps.item()

        # standardization of the returns is employed to make training more stable
        returns = torch.tensor(returns)
        returns = (returns - returns.mean()) / (returns.std() + eps)

        # Line 7 of pseudocode (Calculate gradient delta)
        # Negative because we are performing gradient descent instead of ascent
        policy_loss = []
        for log_prob, disc_return in zip(saved_log_probs, returns):
            policy_loss.append(-log_prob * disc_return)
        policy_loss = torch.cat(policy_loss).sum()

        # Line 8: PyTorch perfers gradient descent
        # Set gradients to zero before GD
        optimizer.zero_grad()
        policy_loss.backward()
        optimizer.step()

        if i_episode % print_every == 0:
            # current_lr = optimizer.param_groups[0]['lr']
            print('Episode {}\tAverage Score: {:.2f}'
                  .format(i_episode, np.mean(scores_deque)))

    return scores

## Train it (Finally!)


In [None]:
cartpole_hyperparams = {
    "h_size": 32,
    "n_training_episodes": 2000,
    "n_evaluation_episodes": 20,
    "max_t": 500,
    "gamma": 0.99,
    "lr": 1e-3,
    "env_id": env_id,
    "state_space": s_size,
    "action_space": a_size,
}

In [None]:
cartpole_policy = Policy(
    cartpole_hyperparams["state_space"],
    cartpole_hyperparams["action_space"],
    cartpole_hyperparams["h_size"],
).to(device)
cartpole_optimizer = optim.Adam(cartpole_policy.parameters(), lr=cartpole_hyperparams["lr"])
# step_scheduler = lr_scheduler.StepLR(cartpole_optimizer, step_size=1000, gamma=0.1)

In [None]:
scores = reinforce(
    policy=cartpole_policy,
    optimizer=cartpole_optimizer,
    n_training_episodes=cartpole_hyperparams["n_training_episodes"],
    max_t=cartpole_hyperparams["max_t"],
    gamma=cartpole_hyperparams["gamma"],
    print_every=100,
)

  and should_run_async(code)
  from pkg_resources import resource_stream, resource_exists
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
  declare_namespace(pkg)
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
  declare_namespace(pkg)
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
  declare_namespace(pkg)
  if not isinstance(terminated, (bool, np.bool8)):


Episode 100	Average Score: 25.84
Episode 200	Average Score: 36.44
Episode 300	Average Score: 49.38
Episode 400	Average Score: 90.57
Episode 500	Average Score: 149.44
Episode 600	Average Score: 251.31
Episode 700	Average Score: 372.91
Episode 800	Average Score: 422.75
Episode 900	Average Score: 449.18
Episode 1000	Average Score: 471.65
Episode 1100	Average Score: 488.66
Episode 1200	Average Score: 445.34
Episode 1300	Average Score: 444.37
Episode 1400	Average Score: 467.13
Episode 1500	Average Score: 486.93
Episode 1600	Average Score: 461.92
Episode 1700	Average Score: 470.95
Episode 1800	Average Score: 479.91
Episode 1900	Average Score: 491.45
Episode 2000	Average Score: 493.30


## Define evaluation method 📝

In [None]:
def evaluate_agent(env, max_steps, n_eval_episodes, policy):
    """
    Evaluate the agent for ``n_eval_episodes`` episodes and returns average reward and std of reward.
    :param env: The evaluation environment
    :param n_eval_episodes: Number of episode to evaluate the agent
    :param policy: The Reinforce agent
    """
    episode_rewards = []
    for episode in range(n_eval_episodes):
        state = env.reset()
        total_rewards_ep = 0

        for step in range(max_steps):
            action, _ = policy.act(state)
            state, reward, done, _ = env.step(action)
            total_rewards_ep += reward
            if done:
                break
        episode_rewards.append(total_rewards_ep)

    mean_reward = np.mean(episode_rewards)
    std_reward = np.std(episode_rewards)
    return mean_reward, std_reward

  and should_run_async(code)


## Evaluate our agent 📈

In [None]:
mean_reward, std_reward = evaluate_agent(
    env=eval_env,
    max_steps=cartpole_hyperparams["max_t"],
    n_eval_episodes=cartpole_hyperparams["n_evaluation_episodes"],
    policy=cartpole_policy,
)

print('Mean Reward: {:.2f} +/- {:.2f}'.format(mean_reward, std_reward))

Mean Reward: 500.00 +/- 0.00


## Publish trained model on the Hub 🔥

In [None]:
from huggingface_hub import HfApi, snapshot_download
from huggingface_hub.repocard import metadata_eval_result, metadata_save
from pathlib import Path
import datetime
import json
import imageio
import tempfile
import os

def record_video(env, policy, out_directory, fps=30):
    """
    Generate a replay video of the agent
    :param env
    :param Qtable: Qtable of our agent
    :param out_directory
    :param fps: how many frame per seconds (with taxi-v3 and frozenlake-v1 we use 1)
    """

    images = []
    done = False
    state = env.reset()
    img = env.render()
    # image[0] to get rid of the batch size dimension
    images.append(np.array(img[0]))

    while not done:
        # Take the action (index) that have the maximum expected future reward given that state
        action, _ = policy.act(state)
        state, reward, done, info = env.step(action)  # We directly put next_state = state for recording logic
        img = env.render()
        images.append(np.array(img[0]))

    imageio.mimsave(out_directory, images, fps=fps)

def push_to_hub(repo_id,
                model,
                hyperparameters,
                eval_env,
                video_fps=30
                ):
  """
  Evaluate, Generate a video and Upload a model to Hugging Face Hub.
  This method does the complete pipeline:
  - It evaluates the model
  - It generates the model card
  - It generates a replay video of the agent
  - It pushes everything to the Hub

  :param repo_id: repo_id: id of the model repository from the Hugging Face Hub
  :param model: the pytorch model we want to save
  :param hyperparameters: training hyperparameters
  :param eval_env: evaluation environment
  :param video_fps: how many frame per seconds to record our video replay
  """

  _, repo_name = repo_id.split("/")
  api = HfApi()

  # Step 1: Create the repo
  repo_url = api.create_repo(
        repo_id=repo_id,
        exist_ok=True,
  )

  with tempfile.TemporaryDirectory() as tmpdirname:
    local_directory = Path(tmpdirname)

    # Step 2: Save the model
    torch.save(model, local_directory / "model.pt")

    # Step 3: Save the hyperparameters to JSON
    with open(local_directory / "hyperparameters.json", "w") as outfile:
      json.dump(hyperparameters, outfile)

    # Step 4: Evaluate the model and build JSON
    mean_reward, std_reward = evaluate_agent(eval_env,
                                            hyperparameters["max_t"],
                                            hyperparameters["n_evaluation_episodes"],
                                            model)

    # Get datetime
    eval_datetime = datetime.datetime.now()
    eval_form_datetime = eval_datetime.isoformat()

    evaluate_data = {
          "env_id": hyperparameters["env_id"],
          "mean_reward": mean_reward,
          "n_evaluation_episodes": hyperparameters["n_evaluation_episodes"],
          "eval_datetime": eval_form_datetime,
    }

    # Write a JSON file
    with open(local_directory / "results.json", "w") as outfile:
        json.dump(evaluate_data, outfile)

    # Step 5: Create the model card
    env_name = hyperparameters["env_id"]

    metadata = {}
    metadata["tags"] = [
          env_name,
          "reinforce",
          "reinforcement-learning",
          "custom-implementation",
          "deep-rl-class"
      ]

    # Add metrics
    eval = metadata_eval_result(
        model_pretty_name=repo_name,
        task_pretty_name="reinforcement-learning",
        task_id="reinforcement-learning",
        metrics_pretty_name="mean_reward",
        metrics_id="mean_reward",
        metrics_value=f"{mean_reward:.2f} +/- {std_reward:.2f}",
        dataset_pretty_name=env_name,
        dataset_id=env_name,
      )

    # Merges both dictionaries
    metadata = {**metadata, **eval}

    model_card = f"""
  # **Reinforce** Agent playing **{env_id}**

  This is a trained model of a **Reinforce** agent playing **{env_id}** .

  To learn to use this model and train yours check Unit 4 of the Deep Reinforcement Learning Course: https://huggingface.co/deep-rl-course/unit4/introduction

  """

    readme_path = local_directory / "README.md"
    readme = ""
    if readme_path.exists():
        with readme_path.open("r", encoding="utf8") as f:
          readme = f.read()
    else:
      readme = model_card


    with readme_path.open("w", encoding="utf-8") as f:
      f.write(readme)

    # Save our metrics to Readme metadata
    metadata_save(readme_path, metadata)

    # Step 6: Record a video
    video_path =  local_directory / "replay.mp4"
    record_video(env, model, video_path, video_fps)

    # Step 7. Push everything to the Hub
    api.upload_folder(
          repo_id=repo_id,
          folder_path=local_directory,
          path_in_repo=".",
    )

    print(f"Your model is pushed to the Hub. You can view your model here: {repo_url}")

In [None]:
model_path = "cartpole_policy.pth"
torch.save(cartpole_policy.state_dict(), model_path)  # Save model weights

By using push_to_hub, you evaluate, record a replay, generate a model card of your agent, and push it to the Hub.

In [None]:
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
repo_id = "wowthecoder/reinforce-cartpole-v1"
eval_env = gym.make("CartPole-v1", render_mode="rgb_array")

push_to_hub(
    repo_id,
    cartpole_policy,  # The model we want to save
    cartpole_hyperparams,  # Hyperparameters
    eval_env,  # Evaluation environment
    video_fps=30
)

  deprecation(
  deprecation(
  if not isinstance(terminated, (bool, np.bool8)):


model.pt:   0%|          | 0.00/3.71k [00:00<?, ?B/s]

Your model is pushed to the Hub. You can view your model here: https://huggingface.co/wowthecoder/reinforce-cartpole-v1


## Second agent: PixelCopter 🚁
Similar to Flappy bird

In [None]:
env_id = "Pixelcopter-PLE-v0"
env = gym.make(env_id)
eval_env = gym.make(env_id)
s_size = env.observation_space.shape[0]
a_size = env.action_space.n

  deprecation(
  deprecation(


In [None]:
print("_____OBSERVATION SPACE_____ \n")
print("The State Space is: ", s_size)
print("Sample observation", env.observation_space.sample())  # Get a random observation

_____OBSERVATION SPACE_____ 

The State Space is:  7
Sample observation [ 1.4721981  -0.05318568  0.26877016 -0.23274052  1.5832546  -0.24545674
  0.56390744]


In [None]:
print("\n _____ACTION SPACE_____ \n")
print("The Action Space is: ", a_size)
print("Action Space Sample", env.action_space.sample())  # Take a random action


 _____ACTION SPACE_____ 

The Action Space is:  2
Action Space Sample 0


The observation space (7) 👀:

* player y position
* player velocity
* player distance to floor
* player distance to ceiling
* next block x distance to player
* next blocks top y location
* next blocks bottom y location

The action space(2) 🎮:

* Up (press accelerator)
* Do nothing (don’t press accelerator)

The reward function 💰:

For each vertical block it passes, it gains a positive reward of +1. Each time a terminal state is reached it receives a negative reward of -1.

### Define the new Policy 🧠
We need to have a deeper neural network since the environment is more complex

In [None]:
class Policy(nn.Module):
    def __init__(self, s_size, a_size, h_size):
        super(Policy, self).__init__()
        self.fc1 = nn.Linear(s_size, h_size)
        self.fc2 = nn.Linear(h_size, h_size)
        self.fc3 = nn.Linear(h_size, a_size)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return F.softmax(x, dim=1)

    def act(self, state):
        state = torch.from_numpy(state).float().unsqueeze(0).to(device)
        probs = self.forward(state).cpu()
        m = Categorical(probs)
        action = m.sample()
        return action.item(), m.log_prob(action)

  and should_run_async(code)


## Define the hyperparameters

In [None]:
pixelcopter_hyperparameters = {
    "h_size": 64,
    "n_training_episodes": 50000,
    "n_evaluation_episodes": 10,
    "max_t": 10000,
    "gamma": 0.99,
    "lr": 1e-4,
    "env_id": env_id,
    "state_space": s_size,
    "action_space": a_size,
}

## Train the Agent

In [None]:
pixelcopter_policy = Policy(
    pixelcopter_hyperparameters["state_space"],
    pixelcopter_hyperparameters["action_space"],
    pixelcopter_hyperparameters["h_size"],
).to(device)

pixelcopter_optimizer = optim.Adam(pixelcopter_policy.parameters(), lr=pixelcopter_hyperparameters["lr"])

In [None]:
scores = reinforce(
    policy=pixelcopter_policy,
    optimizer=pixelcopter_optimizer,
    n_training_episodes=pixelcopter_hyperparameters["n_training_episodes"],
    max_t=pixelcopter_hyperparameters["max_t"],
    gamma=pixelcopter_hyperparameters["gamma"],
    print_every=1000,
)

Episode 1000	Average Score: 3.79
Episode 2000	Average Score: 5.42
Episode 3000	Average Score: 8.30
Episode 4000	Average Score: 10.03
Episode 5000	Average Score: 9.26
Episode 6000	Average Score: 12.23
Episode 7000	Average Score: 10.76
Episode 8000	Average Score: 16.67
Episode 9000	Average Score: 14.10
Episode 10000	Average Score: 14.61
Episode 11000	Average Score: 16.74
Episode 12000	Average Score: 16.33
Episode 13000	Average Score: 19.65
Episode 14000	Average Score: 18.46
Episode 15000	Average Score: 20.93
Episode 16000	Average Score: 21.29
Episode 17000	Average Score: 20.55
Episode 18000	Average Score: 20.33
Episode 19000	Average Score: 24.66
Episode 20000	Average Score: 22.92


## Publish trained model on the Hub 🔥

In [None]:
repo_id = "wowthecoder/reinforce-pixelcopter"
push_to_hub(
    repo_id,
    pixelcopter_policy,  # The model we want to save
    pixelcopter_hyperparameters,  # Hyperparameters
    eval_env,  # Evaluation environment
    video_fps=30
)