# Unit 6: Advantage Actor Critic (A2C) using Robotics Simulations with Panda-Gym 🤖

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/thumbnail.png"  alt="Thumbnail"/>

In this notebook, you'll learn to use A2C with [Panda-Gym](https://github.com/qgallouedec/panda-gym). You're going **to train a robotic arm** (Franka Emika Panda robot) to perform a task:

- `Reach`: the robot must place its end-effector at a target position.

After that, you'll be able **to train in other robotics tasks**.


### 🎮 Environments:

- [Panda-Gym](https://github.com/qgallouedec/panda-gym)

###📚 RL-Library:

- [Stable-Baselines3](https://stable-baselines3.readthedocs.io/)

We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the GitHub Repo](https://github.com/huggingface/deep-rl-class/issues).

## Objectives of this notebook 🏆

At the end of the notebook, you will:

- Be able to use **Panda-Gym**, the environment library.
- Be able to **train robots using A2C**.
- Understand why **we need to normalize the input**.
- Be able to **push your trained agent and the code to the Hub** with a nice video replay and an evaluation score 🔥.




## This notebook is from the Deep Reinforcement Learning Course
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/deep-rl-course-illustration.jpg" alt="Deep RL Course illustration"/>

In this free course, you will:

- 📖 Study Deep Reinforcement Learning in **theory and practice**.
- 🧑‍💻 Learn to **use famous Deep RL libraries** such as Stable Baselines3, RL Baselines3 Zoo, CleanRL and Sample Factory 2.0.
- 🤖 Train **agents in unique environments**

And more check 📚 the syllabus 👉 https://simoninithomas.github.io/deep-rl-course

Don’t forget to **<a href="http://eepurl.com/ic5ZUD">sign up to the course</a>** (we are collecting your email to be able to **send you the links when each Unit is published and give you information about the challenges and updates).**


The best way to keep in touch is to join our discord server to exchange with the community and with us 👉🏻 https://discord.gg/ydHrjt3WP5

## Prerequisites 🏗️
Before diving into the notebook, you need to:

🔲 📚 Study [Actor-Critic methods by reading Unit 6](https://huggingface.co/deep-rl-course/unit6/introduction) 🤗  

# Let's train our first robots 🤖

To validate this hands-on for the [certification process](https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process),  you need to push your trained model to the Hub and get the following results:

- `PandaReachDense-v3` get a result of >= -3.5.

To find your result, go to the [leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) and find your model, **the result = mean_reward - std of reward**

For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process

## Set the GPU 💪
- To **accelerate the agent's training, we'll use a GPU**. To do that, go to `Runtime > Change Runtime type`

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step1.jpg" alt="GPU Step 1">

- `Hardware Accelerator > GPU`

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step2.jpg" alt="GPU Step 2">

## Create a virtual display 🔽

During the notebook, we'll need to generate a replay video. To do so, with colab, **we need to have a virtual screen to be able to render the environment** (and thus record the frames).

Hence the following cell will install the librairies and create and run a virtual screen 🖥

In [None]:
%%capture
!apt install python-opengl
!apt install ffmpeg
!apt install xvfb
!pip3 install pyvirtualdisplay

In [None]:
# Virtual display
from pyvirtualdisplay import Display

virtual_display = Display(visible=0, size=(1400, 900))
virtual_display.start()

<pyvirtualdisplay.display.Display at 0x7f8173068c50>

### Install dependencies 🔽

The first step is to install the dependencies, we’ll install multiple ones:
- `gymnasium`
- `panda-gym`: Contains the robotics arm environments.
- `stable-baselines3`: The SB3 deep reinforcement learning library.
- `huggingface_sb3`: Additional code for Stable-baselines3 to load and upload models from the Hugging Face 🤗 Hub.
- `huggingface_hub`: Library allowing anyone to work with the Hub repositories.

⏲ The installation can **take 10 minutes**.

In [None]:
!pip install stable-baselines3[extra]
!pip install gymnasium

Collecting stable-baselines3[extra]
  Downloading stable_baselines3-2.5.0-py3-none-any.whl.metadata (4.8 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch<3.0,>=2.3->stable-baselines3[extra])
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch<3.0,>=2.3->stable-baselines3[extra])
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch<3.0,>=2.3->stable-baselines3[extra])
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch<3.0,>=2.3->stable-baselines3[extra])
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch<3.0,>=2.3->stable-baselines3[extra])
  Downloading nvidia_cublas_cu12-12.4.5.8-py

In [None]:
!pip install huggingface_sb3
!pip install huggingface_hub
!pip install panda_gym

Collecting huggingface_sb3
  Downloading huggingface_sb3-3.0-py3-none-any.whl.metadata (6.3 kB)
Downloading huggingface_sb3-3.0-py3-none-any.whl (9.7 kB)
Installing collected packages: huggingface_sb3
Successfully installed huggingface_sb3-3.0
Collecting panda_gym
  Downloading panda_gym-3.0.7-py3-none-any.whl.metadata (4.3 kB)
Collecting pybullet (from panda_gym)
  Downloading pybullet-3.2.7-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.8 kB)
Downloading panda_gym-3.0.7-py3-none-any.whl (23 kB)
Downloading pybullet-3.2.7-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (103.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m103.2/103.2 MB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pybullet, panda_gym
Successfully installed panda_gym-3.0.7 pybullet-3.2.7


## Import the packages 📦

In [None]:
import os

import gymnasium as gym
import panda_gym

from huggingface_sb3 import load_from_hub, package_to_hub

from stable_baselines3 import A2C
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize
from stable_baselines3.common.env_util import make_vec_env

from huggingface_hub import notebook_login

In [None]:
import os
import inspect
import huggingface_sb3
package_path = os.path.dirname(inspect.getfile(huggingface_sb3))
print(package_path)

/usr/local/lib/python3.11/dist-packages/huggingface_sb3


  and should_run_async(code)


## PandaReachDense-v3 🦾

The agent we're going to train is a robotic arm that needs to do controls (moving the arm and using the end-effector).

In robotics, the *end-effector* is the device at the end of a robotic arm designed to interact with the environment.

In `PandaReach`, the robot must place its end-effector at a target position (green ball).

We're going to use the dense version of this environment. It means we'll get a *dense reward function* that **will provide a reward at each timestep** (the closer the agent is to completing the task, the higher the reward). Contrary to a *sparse reward function* where the environment **return a reward if and only if the task is completed**.

Also, we're going to use the *End-effector displacement control*, it means the **action corresponds to the displacement of the end-effector**. We don't control the individual motion of each joint (joint control).

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/robotics.jpg"  alt="Robotics"/>


This way **the training will be easier**.



### Create the environment

#### The environment 🎮

In `PandaReachDense-v3` the robotic arm must place its end-effector at a target position (green ball).

In [None]:
env_id = "PandaReachDense-v3"

# Create the env
env = gym.make(env_id)

# Get the state space and action space
s_size = env.observation_space.shape
a_size = env.action_space

In [None]:
print("_____OBSERVATION SPACE_____ \n")
print("The State Space is: ", s_size)
print("Sample observation", env.observation_space.sample()) # Get a random observation

_____OBSERVATION SPACE_____ 

The State Space is:  None
Sample observation {'achieved_goal': array([5.496774 , 8.1737585, 9.454633 ], dtype=float32), 'desired_goal': array([5.7971125, 5.557104 , 6.5160303], dtype=float32), 'observation': array([-5.3771296,  8.039149 , -2.0854757, -5.0934124,  6.765815 ,
        4.605464 ], dtype=float32)}


The observation space **is a dictionary with 3 different elements**:
- `achieved_goal`: (x,y,z) the current position of the end-effector.
- `desired_goal`: (x,y,z) the target position for the end-effector.
- `observation`: position (x,y,z) and velocity of the end-effector (vx, vy, vz).

Given it's a dictionary as observation, **we will need to use a MultiInputPolicy policy instead of MlpPolicy**.

In [None]:
print("\n _____ACTION SPACE_____ \n")
print("The Action Space is: ", a_size)
print("Action Space Sample", env.action_space.sample()) # Take a random action


 _____ACTION SPACE_____ 

The Action Space is:  Box(-1.0, 1.0, (3,), float32)
Action Space Sample [-0.59066993 -0.2438499   0.21995263]


The action space is a vector with 3 values:
- Control x, y, z movement

### Normalize observation and rewards

A good practice in reinforcement learning is to [normalize input features](https://stable-baselines3.readthedocs.io/en/master/guide/rl_tips.html).

For that purpose, there is a wrapper that will compute a running average and standard deviation of input features.

We also normalize rewards with this same wrapper by adding `norm_reward = True`

[You should check the documentation to fill this cell](https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html#vecnormalize)

In [None]:
env = make_vec_env(env_id, n_envs=4)

# Adding this wrapper to normalize the observation and the reward
env = VecNormalize(env, norm_obs=True, norm_reward=True, clip_obs=10.)

### Create the A2C Model 🤖

For more information about A2C implementation with StableBaselines3 check: https://stable-baselines3.readthedocs.io/en/master/modules/a2c.html#notes

To find the best parameters I checked the [official trained agents by Stable-Baselines3 team](https://huggingface.co/sb3).

In [None]:
model = A2C(policy = "MultiInputPolicy",
            env = env,
            verbose=1)

Using cuda device


### Train the A2C agent 🏃
- Let's train our agent for 1,000,000 timesteps, don't forget to use GPU on Colab. It will take approximately ~25-40min

In [None]:
model.learn(1_000_000)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
|    std                | 0.319    |
|    value_loss         | 0.00027  |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 2.68     |
|    ep_rew_mean        | -0.205   |
|    success_rate       | 1        |
| time/                 |          |
|    fps                | 396      |
|    iterations         | 23800    |
|    time_elapsed       | 1201     |
|    total_timesteps    | 476000   |
| train/                |          |
|    entropy_loss       | -0.744   |
|    explained_variance | 0.983    |
|    learning_rate      | 0.0007   |
|    n_updates          | 23799    |
|    policy_loss        | 0.00266  |
|    std                | 0.317    |
|    value_loss         | 4.71e-05 |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 2.67     |
|    ep_re

<stable_baselines3.a2c.a2c.A2C at 0x7caa1204bad0>

In [None]:
# Save the model and  VecNormalize statistics when saving the agent
model.save("a2c-PandaReachDense-v3")
env.save("vec_normalize.pkl")

### Evaluate the agent 📈
- Now that's our  agent is trained, we need to **check its performance**.
- Stable-Baselines3 provides a method to do that: `evaluate_policy`

In [None]:
from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize, VecEnv, VecMonitor, is_vecenv_wrapped
from stable_baselines3.common.monitor import Monitor
# Load the saved statistics
eval_env = Monitor(gym.make("PandaReachDense-v3"))
eval_env = DummyVecEnv([lambda: eval_env ])

eval_env = VecNormalize.load("vec_normalize.pkl", eval_env)


# We need to override the render_mode
eval_env.render_mode = "rgb_array"

#  do not update them at test time
eval_env.training = False
# reward normalization is not needed at test time
eval_env.norm_reward = False

# Load the agent
model = A2C.load("a2c-PandaReachDense-v3")

mean_reward, std_reward = evaluate_policy(model, eval_env)

print(f"Mean reward = {mean_reward:.2f} +/- {std_reward:.2f}")

Mean reward = -0.24 +/- 0.09


### Publish your trained model on the Hub 🔥
Now that we saw we got good results after the training, we can publish our trained model on the Hub with one line of code.

📚 The libraries documentation 👉 https://github.com/huggingface/huggingface_sb3/tree/main#hugging-face--x-stable-baselines3-v20


By using `package_to_hub`, as we already mentionned in the former units, **you evaluate, record a replay, generate a model card of your agent and push it to the hub**.

This way:
- You can **showcase our work** 🔥
- You can **visualize your agent playing** 👀
- You can **share with the community an agent that others can use** 💾
- You can **access a leaderboard 🏆 to see how well your agent is performing compared to your classmates** 👉 https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard


To be able to share your model with the community there are three more steps to follow:

1️⃣ (If it's not already done) create an account to HF ➡ https://huggingface.co/join

2️⃣ Sign in and then, you need to store your authentication token from the Hugging Face website.
- Create a new token (https://huggingface.co/settings/tokens) **with write role**

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/create-token.jpg" alt="Create HF Token">

- Copy the token
- Run the cell below and paste the token

In [None]:
notebook_login()
!git config --global credential.helper store

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

If you don't want to use a Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`

3️⃣ We're now ready to push our trained agent to the 🤗 Hub 🔥 using `package_to_hub()` function

For this environment, **running this cell can take approximately 10min**

In [None]:
import datetime
import json
import os
import shutil
import tempfile
import zipfile
from pathlib import Path
from typing import Any, Dict, Optional, Tuple, Union

import gymnasium as gym
import numpy as np
import stable_baselines3
from huggingface_hub import HfApi, upload_folder
from huggingface_hub.repocard import metadata_eval_result, metadata_save
from stable_baselines3.common.base_class import BaseAlgorithm
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.vec_env import (
    DummyVecEnv,
    VecEnv,
    VecVideoRecorder,
    unwrap_vec_normalize,
)
from wasabi import Printer

msg = Printer()


def _generate_config(model: BaseAlgorithm, local_path: Path) -> None:
    """
    Generate a config.json file containing information
    about the agent and the environment
    :param model: name of the model zip file
    :param local_path: path of the local directory
    """
    unzipped_model_folder = model

    # Check if the user forgot to mention the extension of the file
    if model.endswith(".zip") is False:
        model += ".zip"

    # Step 1: Unzip the model
    with zipfile.ZipFile(local_path / model, "r") as zip_ref:
        zip_ref.extractall(local_path / unzipped_model_folder)

    # Step 2: Get data (JSON containing infos) and read it
    with open(Path.joinpath(local_path, unzipped_model_folder, "data")) as json_file:
        data = json.load(json_file)
        # Add system_info elements to our JSON
        data["system_info"] = stable_baselines3.get_system_info(print_info=False)[0]

    # Step 3: Write our config.json file
    with open(local_path / "config.json", "w") as outfile:
        json.dump(data, outfile)


def _evaluate_agent(
    model: BaseAlgorithm,
    eval_env: VecEnv,
    n_eval_episodes: int,
    is_deterministic: bool,
    local_path: Path,
) -> Tuple[float, float]:
    """
    Evaluate the agent using SB3 evaluate_policy method
    and create a results.json

    :param model: name of the model object
    :param eval_env: environment used to evaluate the agent
    :param n_eval_episodes: number of evaluation episodes
    :param is_deterministic: use deterministic or stochastic actions
    :param local_path: path of the local repository
    """
    # Step 1: Evaluate the agent
    mean_reward, std_reward = evaluate_policy(
        model, eval_env, n_eval_episodes, is_deterministic
    )

    # Step 2: Create json evaluation
    # First get datetime
    eval_datetime = datetime.datetime.now()
    eval_form_datetime = eval_datetime.isoformat()

    evaluate_data = {
        "mean_reward": mean_reward,
        "std_reward": std_reward,
        "is_deterministic": is_deterministic,
        "n_eval_episodes": n_eval_episodes,
        "eval_datetime": eval_form_datetime,
    }

    # Step 3: Write a JSON file
    with open(local_path / "results.json", "w") as outfile:
        json.dump(evaluate_data, outfile)

    return mean_reward, std_reward

def entry_point(env_id: str) -> str:
    print("yoooo")
    try:
        return str(gym.envs.registry[env_id].entry_point)
    except KeyError:
        import gym as gym26
        return str(gym26.envs.registry[env_id].entry_point)

def is_atari(env_id: str) -> bool:
    """
    Check if the environment is an Atari one
    (Taken from RL-Baselines3-zoo)
    :param env_id: name of the environment
    """
    return "AtariEnv" in entry_point(env_id)


def generate_replay(
    model: BaseAlgorithm,
    eval_env: VecEnv,
    video_length: int,
    is_deterministic: bool,
    local_path: Path,
):
    """
    Generate a replay video of the agent
    :param model: trained model
    :param eval_env: environment used to evaluate the agent
    :param video_length: length of the video (in timesteps)
    :param is_deterministic: use deterministic or stochastic actions
    :param local_path: path of the local repository
    """
    # This is another temporary directory for video outputs
    # SB3 created a -step-0-to-... meta files as well as other
    # artifacts which we don't want in the repo.
    with tempfile.TemporaryDirectory() as tmpdirname:
        # Step 1: Create the VecVideoRecorder
        env = VecVideoRecorder(
            eval_env,
            tmpdirname,
            record_video_trigger=lambda x: x == 0,
            video_length=video_length,
            name_prefix="",
        )

        obs = env.reset()
        lstm_states = None
        episode_starts = np.ones((env.num_envs,), dtype=bool)

        try:
            for i in range(video_length):
                print(i)
                action, lstm_states = model.predict(
                    obs,
                    state=lstm_states,
                    episode_start=episode_starts,
                    deterministic=is_deterministic,
                )
                obs, _, episode_starts, _ = env.step(action)

            # Save the video
            env.close()

            # Convert the video with x264 codec
            # inp = env.video_recorder.path
            inp = env.video_path
            out = os.path.join(local_path, "replay.mp4")
            os.system(f"ffmpeg -y -i {inp} -vcodec h264 {out}".format(inp, out))

        except KeyboardInterrupt:
            pass
        except Exception as e:
            msg.fail(str(e))
            # Add a message for video
            msg.fail(
                "We are unable to generate a replay of your agent, "
                "the package_to_hub process continues"
            )
            msg.fail(
                "Please open an issue at "
                "https://github.com/huggingface/huggingface_sb3/issues"
            )


def generate_metadata(
    model_name: str, env_id: str, mean_reward: float, std_reward: float
) -> Dict[str, Any]:
    """
    Define the tags for the model card
    :param model_name: name of the model
    :param env_id: name of the environment
    :mean_reward: mean reward of the agent
    :std_reward: standard deviation of the mean reward of the agent
    """
    metadata = {}
    metadata["library_name"] = "stable-baselines3"
    metadata["tags"] = [
        env_id,
        "deep-reinforcement-learning",
        "reinforcement-learning",
        "stable-baselines3",
    ]

    # Add metrics
    eval = metadata_eval_result(
        model_pretty_name=model_name,
        task_pretty_name="reinforcement-learning",
        task_id="reinforcement-learning",
        metrics_pretty_name="mean_reward",
        metrics_id="mean_reward",
        metrics_value=f"{mean_reward:.2f} +/- {std_reward:.2f}",
        dataset_pretty_name=env_id,
        dataset_id=env_id,
    )

    # Merges both dictionaries
    metadata = {**metadata, **eval}

    return metadata


def _generate_model_card(
    model_name: str, env_id: str, mean_reward: float, std_reward: float
) -> Tuple[str, Dict[str, Any]]:
    """
    Generate the model card for the Hub
    :param model_name: name of the model
    :env_id: name of the environment
    :mean_reward: mean reward of the agent
    :std_reward: standard deviation of the mean reward of the agent
    """
    # Step 1: Select the tags
    metadata = generate_metadata(model_name, env_id, mean_reward, std_reward)

    # Step 2: Generate the model card
    model_card = f"""
# **{model_name}** Agent playing **{env_id}**
This is a trained model of a **{model_name}** agent playing **{env_id}**
using the [stable-baselines3 library](https://github.com/DLR-RM/stable-baselines3).
"""

    model_card += """
## Usage (with Stable-baselines3)
TODO: Add your code


```python
from stable_baselines3 import ...
from huggingface_sb3 import load_from_hub

...
```
"""

    return model_card, metadata


def _save_model_card(
    local_path: Path, generated_model_card: str, metadata: Dict[str, Any]
):
    """Saves a model card for the repository.
    :param local_path: repository directory
    :param generated_model_card: model card generated by _generate_model_card()
    :param metadata: metadata
    """
    readme_path = local_path / "README.md"
    readme = ""
    if readme_path.exists():
        with readme_path.open("r", encoding="utf8") as f:
            readme = f.read()
    else:
        readme = generated_model_card

    with readme_path.open("w", encoding="utf-8") as f:
        f.write(readme)

    # Save our metrics to Readme metadata
    metadata_save(readme_path, metadata)


def _add_logdir(local_path: Path, logdir: Path):
    """Adds a logdir to the repository.
    :param local_path: repository directory
    :param logdir: logdir directory
    """
    if logdir.exists() and logdir.is_dir():
        # Add the logdir to the repository under new dir called logs
        repo_logdir = local_path / "logs"

        # Delete current logs if they exist
        if repo_logdir.exists():
            shutil.rmtree(repo_logdir)

        # Copy logdir into repo logdir
        shutil.copytree(logdir, repo_logdir)


def package_to_hub(
    model: BaseAlgorithm,
    model_name: str,
    model_architecture: str,
    env_id: str,
    eval_env: Union[VecEnv, gym.Env],
    repo_id: str,
    commit_message: str,
    is_deterministic: bool = True,
    n_eval_episodes=10,
    token: Optional[str] = None,
    video_length=1000,
    logs=None,
):
    """
    Evaluate, Generate a video and Upload a model to Hugging Face Hub.
    This method does the complete pipeline:
    - It evaluates the model
    - It generates the model card
    - It generates a replay video of the agent
    - It pushes everything to the hub

    :param model: trained model
    :param model_name: name of the model zip file
    :param model_architecture: name of the architecture of your model
        (DQN, PPO, A2C, SAC...)
    :param env_id: name of the environment
    :param eval_env: environment used to evaluate the agent
    :param repo_id: id of the model repository from the Hugging Face Hub
    :param commit_message: commit message
    :param is_deterministic: use deterministic or stochastic actions (by default: True)
    :param n_eval_episodes: number of evaluation episodes (by default: 10)
    :param token: authentication token (See https://huggingface.co/settings/token)
        Caution: your token must remain secret. (See https://huggingface.co/docs/hub/security-tokens)
    :param video_length: length of the video (in timesteps)
    :param logs: directory on local machine of tensorboard logs you'd like to upload
    """

    # Autowrap, so we only have VecEnv afterward
    if not isinstance(eval_env, VecEnv):
        eval_env = DummyVecEnv([lambda: eval_env])

    msg.info(
        "This function will save, evaluate, generate a video of your agent, "
        "create a model card and push everything to the hub. "
        "It might take up to 1min. \n "
        "This is a work in progress: if you encounter a bug, please open an issue."
    )

    repo_url = HfApi().create_repo(
        repo_id=repo_id,
        token=token,
        private=False,
        exist_ok=True,
    )

    with tempfile.TemporaryDirectory() as tmpdirname:
        tmpdirname = Path(tmpdirname)

        # Step 1: Save the model
        model.save(tmpdirname / model_name)

        # Retrieve VecNormalize wrapper if it exists
        # we need to save the statistics
        maybe_vec_normalize = unwrap_vec_normalize(eval_env)

        # Save the normalization
        if maybe_vec_normalize is not None:
            maybe_vec_normalize.save(tmpdirname / "vec_normalize.pkl")
            # Do not update the stats at test time
            maybe_vec_normalize.training = False
            # Reward normalization is not needed at test time
            maybe_vec_normalize.norm_reward = False

        # We create two versions of the environment:
        # one for video generation and one for evaluation
        replay_env = eval_env

        # Deterministic by default (except for Atari)
        if is_deterministic:
            is_deterministic = not is_atari(env_id)

        # Step 2: Create a config file
        _generate_config(model_name, tmpdirname)

        # Step 3: Evaluate the agent
        mean_reward, std_reward = _evaluate_agent(
            model, eval_env, n_eval_episodes, is_deterministic, tmpdirname
        )

        # Step 4: Generate a video
        generate_replay(model, replay_env, video_length, is_deterministic, tmpdirname)

        # Step 5: Generate the model card
        generated_model_card, metadata = _generate_model_card(
            model_architecture, env_id, mean_reward, std_reward
        )
        _save_model_card(tmpdirname, generated_model_card, metadata)

        # Step 6: Add logs if needed
        if logs:
            _add_logdir(tmpdirname, Path(logs))

        msg.info(f"Pushing repo {repo_id} to the Hugging Face Hub")

        repo_url = upload_folder(
            repo_id=repo_id,
            folder_path=tmpdirname,
            path_in_repo="",
            commit_message=commit_message,
            token=token,
        )

        msg.info(
            f"Your model is pushed to the Hub. You can view your model here: {repo_url}"
        )
    return repo_url


def _copy_file(filepath: Path, dst_directory: Path):
    """
    Copy the file to the correct directory
    :param filepath: path of the file
    :param dst_directory: destination directory
    """
    dst = dst_directory / filepath.name
    shutil.copy(str(filepath.name), str(dst))


def push_to_hub(
    repo_id: str,
    filename: str,
    commit_message: str,
    token: Optional[str] = None,
):
    """
    Upload a model to Hugging Face Hub.
    :param repo_id: repo_id: id of the model repository from the Hugging Face Hub
    :param filename: name of the model zip or mp4 file from the repository
    :param commit_message: commit message
    :param token: authentication token (See https://huggingface.co/settings/token)
        Caution: your token must remain secret. (See https://huggingface.co/docs/hub/security-tokens)
    """

    repo_url = HfApi().create_repo(
        repo_id=repo_id,
        token=token,
        private=False,
        exist_ok=True,
    )

    # Add the model
    with tempfile.TemporaryDirectory() as tmpdirname:
        tmpdirname = Path(tmpdirname)
        filename_path = os.path.abspath(filename)
        _copy_file(Path(filename_path), tmpdirname)
        _save_model_card(tmpdirname, "", {})

        msg.info(f"Pushing repo {repo_id} to the Hugging Face Hub")
        repo_url = upload_folder(
            repo_id=repo_id,
            folder_path=tmpdirname,
            path_in_repo="",
            commit_message=commit_message,
            token=token,
        )

    msg.good(
        f"Your model has been uploaded to the Hub, you can find it here: {repo_url}"
    )
    return repo_url


In [None]:
eval_env = gym.make("PandaReachDense-v3")
eval_env = DummyVecEnv([lambda: eval_env ])

eval_env = VecNormalize.load("vec_normalize.pkl", eval_env)


# We need to override the render_mode
eval_env.render_mode = "rgb_array"

#  do not update them at test time
eval_env.training = False
# reward normalization is not needed at test time
eval_env.norm_reward = False

generate_replay(model, eval_env, 1000, True, "/content")

Saving video to /tmp/tmp2w03dmla/-step-0-to-step-1000.mp4
Moviepy - Building video /tmp/tmp2w03dmla/-step-0-to-step-1000.mp4.
Moviepy - Writing video /tmp/tmp2w03dmla/-step-0-to-step-1000.mp4





Moviepy - Done !
Moviepy - video ready /tmp/tmp2w03dmla/-step-0-to-step-1000.mp4


In [None]:
from huggingface_sb3 import package_to_hub

package_to_hub(
    model=model,
    model_name=f"a2c-{env_id}",
    model_architecture="A2C",
    env_id=env_id,
    eval_env=eval_env,
    repo_id=f"sanjay-906/a2c-{env_id}", # Change the username
    commit_message="sec commit",
)

[38;5;4mℹ This function will save, evaluate, generate a video of your agent,
create a model card and push everything to the hub. It might take up to 1min.
This is a work in progress: if you encounter a bug, please open an issue.[0m


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Saving video to /tmp/tmpzqvhk9ax/-step-0-to-step-1000.mp4


  """


Moviepy - Building video /tmp/tmpzqvhk9ax/-step-0-to-step-1000.mp4.
Moviepy - Writing video /tmp/tmpzqvhk9ax/-step-0-to-step-1000.mp4





Moviepy - Done !
Moviepy - video ready /tmp/tmpzqvhk9ax/-step-0-to-step-1000.mp4
[38;5;1m✘ 'DummyVecEnv' object has no attribute 'video_recorder'[0m
[38;5;1m✘ We are unable to generate a replay of your agent, the package_to_hub
process continues[0m
[38;5;1m✘ Please open an issue at
https://github.com/huggingface/huggingface_sb3/issues[0m
[38;5;4mℹ Pushing repo sanjay-906/a2c-PandaReachDense-v3 to the Hugging Face
Hub[0m


a2c-PandaReachDense-v3.zip:   0%|          | 0.00/114k [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

vec_normalize.pkl:   0%|          | 0.00/2.64k [00:00<?, ?B/s]

[38;5;4mℹ Your model is pushed to the Hub. You can view your model here:
https://huggingface.co/sanjay-906/a2c-PandaReachDense-v3/tree/main/[0m


CommitInfo(commit_url='https://huggingface.co/sanjay-906/a2c-PandaReachDense-v3/commit/5a2ce45cb3b3871faafa386e8af16a9eff1dc3e3', commit_message='sec commit', commit_description='', oid='5a2ce45cb3b3871faafa386e8af16a9eff1dc3e3', pr_url=None, repo_url=RepoUrl('https://huggingface.co/sanjay-906/a2c-PandaReachDense-v3', endpoint='https://huggingface.co', repo_type='model', repo_id='sanjay-906/a2c-PandaReachDense-v3'), pr_revision=None, pr_num=None)

## Some additional challenges 🏆
The best way to learn **is to try things by your own**! Why not trying  `PandaPickAndPlace-v3`?

If you want to try more advanced tasks for panda-gym, you need to check what was done using **TQC or SAC** (a more sample-efficient algorithm suited for robotics tasks). In real robotics, you'll use a more sample-efficient algorithm for a simple reason: contrary to a simulation **if you move your robotic arm too much, you have a risk of breaking it**.

PandaPickAndPlace-v1 (this model uses the v1 version of the environment): https://huggingface.co/sb3/tqc-PandaPickAndPlace-v1

And don't hesitate to check panda-gym documentation here: https://panda-gym.readthedocs.io/en/latest/usage/train_with_sb3.html

We provide you the steps to train another agent (optional):

1. Define the environment called "PandaPickAndPlace-v3"
2. Make a vectorized environment
3. Add a wrapper to normalize the observations and rewards. [Check the documentation](https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html#vecnormalize)
4. Create the A2C Model (don't forget verbose=1 to print the training logs).
5. Train it for 1M Timesteps
6. Save the model and  VecNormalize statistics when saving the agent
7. Evaluate your agent
8. Publish your trained model on the Hub 🔥 with `package_to_hub`


### Solution (optional)

In [None]:
# 1 - 2
env_id = "PandaPickAndPlace-v3"
env = make_vec_env(env_id, n_envs=4)

# 3
env = VecNormalize(env, norm_obs=True, norm_reward=True, clip_obs=10.)

# 4
model = A2C(policy = "MultiInputPolicy",
            env = env,
            verbose=1)
# 5
model.learn(1_000_000)

In [None]:
# 6
model_name = "a2c-PandaPickAndPlace-v3";
model.save(model_name)
env.save("vec_normalize.pkl")

# 7
from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize

# Load the saved statistics
eval_env = DummyVecEnv([lambda: gym.make("PandaPickAndPlace-v3")])
eval_env = VecNormalize.load("vec_normalize.pkl", eval_env)

#  do not update them at test time
eval_env.training = False
# reward normalization is not needed at test time
eval_env.norm_reward = False

# Load the agent
model = A2C.load(model_name)

mean_reward, std_reward = evaluate_policy(model, eval_env)

print(f"Mean reward = {mean_reward:.2f} +/- {std_reward:.2f}")

# 8
package_to_hub(
    model=model,
    model_name=f"a2c-{env_id}",
    model_architecture="A2C",
    env_id=env_id,
    eval_env=eval_env,
    repo_id=f"ThomasSimonini/a2c-{env_id}", # TODO: Change the username
    commit_message="Initial commit",
)

See you on Unit 7! 🔥
## Keep learning, stay awesome 🤗