# 第8单元: 使用PyTorch编写Proximal Policy Optimization(PPO) 🤖️

在本单元中, 你将学习**使用PyTorch从头开始编写你的PPO智能体.**

为了测试鲁棒性, 我们将在2个不同的经典环境进行训练:

* [CartPole-v1](https://www.gymlibrary.ml/environments/classic_control/cart_pole/?highlight=cartpole)
* [LunarLander-v2 🚀](https://www.gymlibrary.ml/environments/box2d/lunar_lander/)

我们通过深入了解PPO的工作原理来完成课程的基础部分. 在第1单元, 你学习了在LunarLander-v2上训练PPO智能体. 但是现在, 第8单元, 你可以从头开始编写代码. 这真是太不可思议了 🤩.

![cover.jpg](./assets/cover.jpg)

⬇️ 这是你将在几分钟内实现的目标的示例([原始视频1下载链接](https://huggingface.co/sb3/ppo-CartPole-v1/resolve/main/replay.mp4), [原始视频2下载链接](https://huggingface.co/sb3/ppo-LunarLander-v2/resolve/main/replay.mp4)). ⬇️

In [None]:
%%html
<video autoplay controls><source src='./assets/replay1.mp4' type='video/mp4'></video>

In [None]:
%%html
<video autoplay controls><source src='./assets/replay2.mp4' type='video/mp4'></video>

💡 我们建议你使用Google Colab, 因为某些环境只适用于Ubuntu. Google Colab的免费版本很适合这个教程. 让我们开始吧! 🚀

## 这份笔记来自深度强化学习课程
![Deep Reinforcement Learning Course.jpg](./assets/DeepReinforcementLearningCourse.jpg)

在这个免费课程中, 你将:

* 📖 研究深度强化学习的**理论和实践.**
* 🧑‍💻 学习**使用流行的深度强化学习库**, 例如Stable Baselines3, RL Baselines3 Zoo和RLlib.
* 🤖️ **在独特的环境中训练智能体.**

还有更多的课程 📚 内容 👉 https://github.com/huggingface/deep-rl-class

保持进度的最佳方式是加入我们的Discord服务器与社区和我们进行交流. 👉🏻 https://discord.gg/aYka4Yhff9

## 先决条件 🏗

在深入研究笔记之前, 你需要:

🔲 📚 [阅读第8单元的README.](https://github.com/huggingface/deep-rl-class/blob/main/unit8/README.md)

🔲 📚 通过阅读章节**学习Proximal Policy Optimization(PPO)** 👉 https://huggingface.co/blog/deep-rl-ppo

### 第0步: 设置GPU 💪

* 为了**更快的训练智能体, 我们将使用GPU,** 选择`修改 > 笔记本设置`
![image.png](./assets/image.png)

* `硬件加速器 > GPU`

![image.png](./assets/image1.png)

### 第1步: 安装依赖项 🔽 和 虚拟屏幕 💻

In [None]:
!apt install ffmpeg
# 如果你使用IDE(例如PyCharm或VS Code)将不需要这些步骤.
!apt install python-opengl xvfb 
!pip install pyvirtualdisplay

In [None]:
!pip install gym box2d-py  # 如果使用Apple M1 conda install box2d-py
!pip install huggingface_hub
!pip install imageio imageio-ffmpeg
!pip install pyglet

In [None]:
# 创建虚拟屏幕.
from pyvirtualdisplay import Display

virtual_display = Display(visible=0, size=(1400, 900))
virtual_display.start()

### 第2步: 让我们使用Costa Huang的教程从头开始编写PPO
* 对于PPO的核心实现, 我们将使用优秀的[Costa Huang的教程](https://costa.sh/).
* 除此之外, 更深入的了解你可以阅读13个核心实现细节: https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/

👉 视频教程: https://youtu.be/MEt6rrxH8W4

In [None]:
from IPython.display import HTML

HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/MEt6rrxH8W4" ' 
     + 'title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; '
     + 'clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')

* 最好的办法是先在下面的单元格中编写代码, 这样如果你的进程被关闭, 也不会丢失代码.

In [None]:
### 你的代码:

### 第3步: 添加Hugging Face集成 🤗
* 为了将我们的模型发布到Hugging Face Hub, 我们需要定义一个`package_to_hub`函数.
* 添加我们的需要将模型发布到Hugging Face Hub的依赖项.

In [None]:
import datetime
import imageio
import json
import shutil
import tempfile
from pathlib import Path

from huggingface_hub import HfApi, upload_folder
from huggingface_hub.repocard import metadata_eval_result, metadata_save
from wasabi import Printer

msg = Printer()

* 在函数`parse_args()`中添加新参数来定义我们想要发布模型的`repo-id`.

In [None]:
# 添加Hugging Face Hub参数.
parser.add_argument('--repo-id',
                    type=str,
                    default='ThomasSimonini/ppo-CartPole-v1',
                    help='Hugging Face Hub中模型仓库的ID{用户名/仓库名}')

* 接下来, 我们添加将模型发布到Hugging Face Hub所需的方法
* 这些方法有:
    * `_evalutate_agent()`: 评估智能体.
    * `_generate_model_card()`: 为你的智能体生成模型卡.
    * `_record_video()`: 录制智能体的回放视频.

In [None]:
def package_to_hub(repo_id,
                   model,
                   hyperparameters,
                   eval_env,
                   video_fps=30,
                   commit_message='发布强化学习模型到Hugging Face Hub.',
                   token=None,
                   logs=None):
    """评估, 生成视频并将模型发布到Hugging Face Hub.

    此函数将执行完整的流水线:
        * 评估模型
        * 生成模型卡
        * 生成智能体的回放视频
        * 将全部内容发布到Hugging Face Hub

    Args:
        repo_id: Hugging Face Hub中模型仓库的ID
        model: 训练的模型.
        hyperparameters: 训练模型的超参数.
        eval_env: 用于评估智能体的环境.
        video_fps: 渲染回放视频的帧率.
        commit_message: 提交的信息.
        token: 发布模型的Hugging Face令牌.
        logs: 你要上传的TensorBoard日志的本地目录.
    """
    msg.info('这个函数将保存, 评估, 生成智能体回放视频, 创建模型卡并将模型发布到Hugging Face Hub.'
             '最多可能需要1分钟. 这是一项正在进行的工作, 如果你遇到BUG, 请打开一个issue.')

    # 第1步: 克隆或创建仓库.
    repo_url = HfApi().create_repo(repo_id=repo_id,
                                   token=token,
                                   private=False,
                                   exist_ok=True)

    with tempfile.TemporaryDirectory() as tmp_dir:
        tmp_dir = Path(tmp_dir)

        # 第2步: 保存模型.
        torch.save(model.state_dict(), tmp_dir / 'model.pt')

        # 第3步: 评估模型并构建JSON.
        mean_reward, std_reward = _evaluate_agent(eval_env, 10, model)

        # 首先, 获取当前时间.
        eval_datetime = datetime.datetime.now()
        eval_form_datetime = eval_datetime.isoformat()

        evaluate_data = {
            'env_id': hyperparameters.env_id,
            'mean_reward': mean_reward,
            'std_reward': std_reward,
            'n_evaluation_episodes': 10,
            'eval_datetime': eval_form_datetime
        }

        # 写入JSON文件.
        with open(tmp_dir / 'hyperparameters.json', 'w') as outfile:
            json.dump(evaluate_data, outfile)

        # 第4步: 录制回放视频.
        video_path = tmp_dir / 'replay.mp4'
        _record_video(eval_env, model, video_path, video_fps)

        # 第5步: 创建模型卡.
        generated_model_card, metadata = _generate_model_card('PPO',
                                                              hyperparameters.env_id,
                                                              mean_reward,
                                                              std_reward,
                                                              hyperparameters)
        _save_model_card(tmp_dir, generated_model_card, metadata)

        # 第6步: 如果需要则添加日志.
        if logs:
            _add_logdir(tmp_dir, Path(logs))

        msg.info(f'正在将仓库{repo_id}发布到Hugging Face Hub...')

        repo_url = upload_folder(repo_id=repo_id,
                                 folder_path=tmp_dir,
                                 path_in_repo='',
                                 commit_message=commit_message,
                                 token=token)

        msg.info(f'你的模型已经发布到Hugging Face Hub. 你可以点击链接查看的你的模型: {repo_url}')

    return repo_url


def _evaluate_agent(env, n_eval_episodes, policy):
    """用`n_eval_episodes`轮评估智能体, 并返回奖励的均值和标准差.

    Args:
        env: 评估环境.
        n_eval_episodes: 测试的总轮数.
        policy: 强化学习智能体.

    Returns:
        奖励的均值和标准差.
    """
    episode_rewards = []

    for episode in range(n_eval_episodes):
        state = env.reset()
        done = False
        total_rewards_ep = 0

        while done is False:
            state = torch.Tensor(state)
            action, _, _, _ = policy.get_action_and_value(state)
            new_state, reward, done, info = env.step(action.numpy())
            total_rewards_ep += reward

            if done:
                break
            state = new_state
        episode_rewards.append(total_rewards_ep)

    mean_reward = np.mean(episode_rewards)
    std_reward = np.std(episode_rewards)

    return mean_reward, std_reward


def _record_video(env, policy, out_directory, fps=30):
    images = []
    done = False
    state = env.reset()
    img = env.render(mode='rgb_array')
    images.append(img)

    while not done:
        state = torch.Tensor(state)
        # 在给定状态下, 采取具有最大期望奖励的动作(索引).
        action, _, _, _ = policy.get_action_and_value(state)
        state, reward, done, info = env.step(action.numpy())  # 我们直接使用next_state = state来记录顺序(recording logic).
        img = env.render(mode='rgb_array')
        images.append(img)

    imageio.mimsave(out_directory, [np.array(img) for i, img in enumerate(images)], fps=fps)


def _generate_model_card(model_name, env_id, mean_reward, std_reward, hyperparameters):
    """为Hugging Face Hub生成模型卡.

    Args:
        model_name: 模型的名称.
        env_id: 环境的名称.
        mean_reward: 奖罚的均值.
        std_reward: 奖罚的标准差.
        hyperparameters: 训练模型的超参数.
    """
    # 第1步: 选择元数据.
    metadata = _generate_metadata(model_name, env_id, mean_reward, std_reward)

    # 将超参数命名空间转换为字符串.
    converted_dict = vars(hyperparameters)
    converted_str = str(converted_dict)
    converted_str = converted_str.split(', ')
    converted_str = '\n'.join(converted_str)

    # 第2步: 生成模型卡.
    model_card = f'''
    # 使用PPO智能体来玩 {env_id}
    
    这是一个使用PPO训练有素的模型玩 {env_id}.
    要学习编写你自己的PPO智能体并训练它, 
    请查阅深度强化学习课程第8单元: https://github.com/huggingface/deep-rl-class/tree/main/unit8
    
    # 超参数
    ```python
    {converted_str}
    ```
    '''

    return model_card, metadata


def _generate_metadata(model_name, env_id, mean_reward, std_reward):
    """定义模型卡的元数据.

    Args:
        model_name: 模型的名称.
        env_id: 环境的名称.
        mean_reward: 奖罚的均值.
        std_reward: 奖罚的标准差.
    """
    metadata = {
        'tag': [
            env_id,
            'ppo',
            'deep-reinforcement-learning',
            'reinforcement-learning',
            'custom-implementation',
            'deep-rl-class'
        ]
    }

    # 添加评估.
    eval = metadata_eval_result(model_pretty_name=model_name,
                                task_pretty_name='reinforcement-learning',
                                task_id='reinforcement-learning',
                                metrics_pretty_name='mean_reward',
                                metrics_id='mean_reward',
                                metrics_value=f'{mean_reward:.2f} +/- {std_reward:.2f}',
                                dataset_pretty_name=env_id,
                                dataset_id=env_id)

    # 合并所有的字典.
    metadata = {**metadata, **eval}

    return metadata


def _save_model_card(local_path, generated_model_card, metadata):
    """保存模型卡到仓库.

    Args:
        local_path: 仓库的地址.
        generated_model_card: 通过`_generate_model_card()`生成的模型卡.
        metadata: 元数据.
    """
    readme_path = local_path / 'README.md'

    if readme_path.exists():
        with readme_path.open('r', encoding='utf8') as f:
            readme = f.read()
    else:
        readme = generated_model_card

    with readme_path.open('w', encoding='utf-8') as f:
        f.write(readme)

    # 保存我们的评估信息到README的元数据.
    metadata_save(readme_path, metadata)


def _add_logdir(local_path: Path,
                logdir: Path):
    """添加日志到仓库.

    Args:
        local_path: 仓库的地址.
        logdir: 日志的地址.
    """
    if logdir.exists() and logdir.is_dir():
        # 添加日志到仓库下, 新地址叫`logs`
        repo_logdir = local_path / 'logs'

        # 如果当前的日志目录存在, 就删除.
        if repo_logdir.exists():
            shutil.rmtree(repo_logdir)

        # 复制日志到仓库的日志中.
        shutil.copytree(logdir, repo_logdir)

* 最后, 我们在PPO训练完后调用这个函数.

In [None]:
# 创建一个评估环境.
eval_env = gym.make(args.env_id)

package_to_hub(repo_id=args.repo_id,
               model=agent,  # 我们想要保存的模型.
               hyperparameters=args,
               eval_env=gym.make(args.env_id),
               logs=f'runs/{runs_name}')

* 这是最终的`ppo.py`文件的样子.

In [None]:
"""文档和实验结果可以在 https://docs.cleanrl.dev/rl-algorithms/ppo/#ppopy 找到."""
import argparse
import datetime

import json
import os
import random
import shutil
import tempfile
import time
from distutils.util import strtobool
from pathlib import Path

import gym
import imageio
import numpy as np
import torch

from huggingface_hub.hf_api import HfApi
from huggingface_hub.hf_api import upload_folder
from huggingface_hub.repocard import metadata_eval_result, metadata_save
from torch import nn
from torch.distributions import Categorical
from torch.optim import Adam
from torch.utils.tensorboard import SummaryWriter
from wasabi import Printer

msg = Printer()


def parse_args():
    parser = argparse.ArgumentParser()

    parser.add_argument('--exp-name', type=str, default=os.path.basename(__file__).rstrip('.py'), help='实验的名称')
    parser.add_argument('--seed', type=int, default=1, help='实验的随机种子')
    parser.add_argument('--torch-deterministic',
                        type=lambda x: bool(strtobool(x)),
                        default=True,
                        nargs='?',
                        const=True,
                        help='减少算法的随机性, 如果切换, `torch.backends.cudnn.deterministic=False`')
    parser.add_argument('--cuda',
                        type=lambda x: bool(strtobool(x)),
                        default=True,
                        nargs='?',
                        const=True,
                        help='默认情况下将启用CUDA')
    parser.add_argument('--track',
                        type=lambda x: bool(strtobool(x)),
                        default=False,
                        nargs='?',
                        const=True,
                        help='该实验将对权重和偏差进行追踪')
    parser.add_argument('--wandb-project-name', type=str, default='cleanRL', help='wanDb项目的名称')
    parser.add_argument('--wandb-entity', type=str, default=None, help='wanDb项目的实体')
    parser.add_argument('--capture-video',
                        type=lambda x: bool(strtobool(x)),
                        default=False,
                        nargs='?',
                        const=True,
                        help='是否保存智能体的回放视频(查看`videos`文件夹)')

    # 算法参数.
    parser.add_argument('--env-id', type=str, default='CartPole-v1', help='环境的名称')
    parser.add_argument('--total-timesteps', type=int, default=50000, help='实验的总时间步')
    parser.add_argument('--learning-rate', type=float, default=2.5e-4, help='优化器的学习率')
    parser.add_argument('--num-envs', type=int, default=4, help='并行的环境数量')
    parser.add_argument('--num-steps', type=int, default=128, help='每个环境中策略的每轮最大步数')
    parser.add_argument('--anneal-lr',
                        type=lambda x: bool(strtobool(x)),
                        default=True,
                        nargs='?',
                        const=True,
                        help='策略和价值网络的学习率退火')
    parser.add_argument('--gae',
                        type=lambda x: bool(strtobool(x)),
                        default=True,
                        nargs='?',
                        const=True,
                        help='使用广义优势估计器进行优势计算')
    parser.add_argument('--gamma', type=float, default=0.99, help='折扣系数')
    parser.add_argument('--gae-lambda', type=float, default=0.95, help='广义优势估计器的偏差与方差权衡因子')
    parser.add_argument('--num-minibatches', type=int, default=4, help='批次的大小')
    parser.add_argument('--update-epochs', type=int, default=4, help='更新策略的K个周期')
    parser.add_argument('--norm-adv',
                        type=lambda x: bool(strtobool(x)),
                        default=True,
                        nargs='?',
                        const=True,
                        help='使用广义优势')
    parser.add_argument('--clip-coef', type=float, default=0.2, help='代理裁切系数')
    parser.add_argument('--clip-vloss',
                        type=lambda x: bool(strtobool(x)),
                        default=True,
                        nargs='?',
                        const=True,
                        help='根据论文, 是否对价值函数使用裁切损失')
    parser.add_argument('--ent-coef', type=float, default=0.01, help='熵系数')
    parser.add_argument('--vf-coef', type=float, default=0.5, help='价值系数')
    parser.add_argument('--max-grad-norm', type=float, default=0.5, help='梯度裁切的最大范数')
    parser.add_argument('--target-kl', type=float, default=None, help='目标KL散度阈值')

    # 添加Hugging Face参数.
    parser.add_argument('--repo-id',
                        type=str,
                        default='ThomasSimonini/ppo-CartPole-v1',
                        help='Hugging Face Hub中模型仓库的ID{用户名/仓库名}')

    args = parser.parse_args()
    args.batch_size = int(args.num_envs * args.num_steps)
    args.minibatch_size = int(args.batch_size // args.num_minibatches)

    return args


def package_to_hub(repo_id,
                   model,
                   hyperparameters,
                   eval_env,
                   video_fps=30,
                   commit_message='发布强化学习模型到Hugging Face Hub.',
                   token=None,
                   logs=None):
    """评估, 生成视频并将模型发布到Hugging Face Hub.

    此函数将执行完整的流水线:
        * 评估模型
        * 生成模型卡
        * 生成智能体的回放视频
        * 将全部内容发布到Hugging Face Hub

    Args:
        repo_id: Hugging Face Hub中模型仓库的ID
        model: 训练的模型.
        hyperparameters: 训练模型的超参数.
        eval_env: 用于评估智能体的环境.
        video_fps: 渲染回放视频的帧率.
        commit_message: 提交的信息.
        token: 发布模型的Hugging Face令牌.
        logs: 你要上传的TensorBoard日志的本地目录.
    """
    msg.info('这个函数将保存, 评估, 生成智能体回放视频, 创建模型卡并将模型发布到Hugging Face Hub.'
             '最多可能需要1分钟. 这是一项正在进行的工作, 如果你遇到BUG, 请打开一个issue.')

    # 第1步: 克隆或创建仓库.
    repo_url = HfApi().create_repo(repo_id=repo_id,
                                   token=token,
                                   private=False,
                                   exist_ok=True)

    with tempfile.TemporaryDirectory() as tmp_dir:
        tmp_dir = Path(tmp_dir)

        # 第2步: 保存模型.
        torch.save(model.state_dict(), tmp_dir / 'model.pt')

        # 第3步: 评估模型并构建JSON.
        mean_reward, std_reward = _evaluate_agent(eval_env, 10, model)

        # 首先, 获取当前时间.
        eval_datetime = datetime.datetime.now()
        eval_form_datetime = eval_datetime.isoformat()

        evaluate_data = {
            'env_id': hyperparameters.env_id,
            'mean_reward': mean_reward,
            'std_reward': std_reward,
            'n_evaluation_episodes': 10,
            'eval_datetime': eval_form_datetime
        }

        # 写入JSON文件.
        with open(tmp_dir / 'hyperparameters.json', 'w') as outfile:
            json.dump(evaluate_data, outfile)

        # 第4步: 录制回放视频.
        video_path = tmp_dir / 'replay.mp4'
        _record_video(eval_env, model, video_path, video_fps)

        # 第5步: 创建模型卡.
        generated_model_card, metadata = _generate_model_card('PPO',
                                                              hyperparameters.env_id,
                                                              mean_reward,
                                                              std_reward,
                                                              hyperparameters)
        _save_model_card(tmp_dir, generated_model_card, metadata)

        # 第6步: 如果需要则添加日志.
        if logs:
            _add_logdir(tmp_dir, Path(logs))

        msg.info(f'正在将仓库{repo_id}发布到Hugging Face Hub...')

        repo_url = upload_folder(repo_id=repo_id,
                                 folder_path=tmp_dir,
                                 path_in_repo='',
                                 commit_message=commit_message,
                                 token=token)

        msg.info(f'你的模型已经发布到Hugging Face Hub. 你可以点击链接查看的你的模型: {repo_url}')

    return repo_url


def _evaluate_agent(env, n_eval_episodes, policy):
    """用`n_eval_episodes`轮评估智能体, 并返回奖励的均值和标准差.

    Args:
        env: 评估环境.
        n_eval_episodes: 测试的总轮数.
        policy: 强化学习智能体.

    Returns:
        奖励的均值和标准差.
    """
    episode_rewards = []

    for episode in range(n_eval_episodes):
        state = env.reset()
        done = False
        total_rewards_ep = 0

        while done is False:
            state = torch.Tensor(state)
            action, _, _, _ = policy.get_action_and_value(state)
            new_state, reward, done, info = env.step(action.numpy())
            total_rewards_ep += reward

            if done:
                break
            state = new_state
        episode_rewards.append(total_rewards_ep)

    mean_reward = np.mean(episode_rewards)
    std_reward = np.std(episode_rewards)

    return mean_reward, std_reward


def _record_video(env, policy, out_directory, fps=30):
    images = []
    done = False
    state = env.reset()
    img = env.render(mode='rgb_array')
    images.append(img)

    while not done:
        state = torch.Tensor(state)
        # 在给定状态下, 采取具有最大期望奖励的动作(索引).
        action, _, _, _ = policy.get_action_and_value(state)
        state, reward, done, info = env.step(action.numpy())  # 我们直接使用next_state = state来记录顺序(recording logic).
        img = env.render(mode='rgb_array')
        images.append(img)

    imageio.mimsave(out_directory, [np.array(img) for i, img in enumerate(images)], fps=fps)


def _generate_model_card(model_name, env_id, mean_reward, std_reward, hyperparameters):
    """为Hugging Face Hub生成模型卡.

    Args:
        model_name: 模型的名称.
        env_id: 环境的名称.
        mean_reward: 奖罚的均值.
        std_reward: 奖罚的标准差.
        hyperparameters: 训练模型的超参数.
    """
    # 第1步: 选择元数据.
    metadata = _generate_metadata(model_name, env_id, mean_reward, std_reward)

    # 将超参数命名空间转换为字符串.
    converted_dict = vars(hyperparameters)
    converted_str = str(converted_dict)
    converted_str = converted_str.split(', ')
    converted_str = '\n'.join(converted_str)

    # 第2步: 生成模型卡.
    model_card = f'''
    # 使用PPO智能体来玩 {env_id}
    
    这是一个使用PPO训练有素的模型玩 {env_id}.
    要学习编写你自己的PPO智能体并训练它, 
    请查阅深度强化学习课程第8单元: https://github.com/huggingface/deep-rl-class/tree/main/unit8
    
    # 超参数
    ```python
    {converted_str}
    ```
    '''

    return model_card, metadata


def _generate_metadata(model_name, env_id, mean_reward, std_reward):
    """定义模型卡的元数据.

    Args:
        model_name: 模型的名称.
        env_id: 环境的名称.
        mean_reward: 奖罚的均值.
        std_reward: 奖罚的标准差.
    """
    metadata = {
        'tag': [
            env_id,
            'ppo',
            'deep-reinforcement-learning',
            'reinforcement-learning',
            'custom-implementation',
            'deep-rl-class'
        ]
    }

    # 添加评估.
    eval = metadata_eval_result(model_pretty_name=model_name,
                                task_pretty_name='reinforcement-learning',
                                task_id='reinforcement-learning',
                                metrics_pretty_name='mean_reward',
                                metrics_id='mean_reward',
                                metrics_value=f'{mean_reward:.2f} +/- {std_reward:.2f}',
                                dataset_pretty_name=env_id,
                                dataset_id=env_id)

    # 合并所有的字典.
    metadata = {**metadata, **eval}

    return metadata


def _save_model_card(local_path, generated_model_card, metadata):
    """保存模型卡到仓库.

    Args:
        local_path: 仓库的地址.
        generated_model_card: 通过`_generate_model_card()`生成的模型卡.
        metadata: 元数据.
    """
    readme_path = local_path / 'README.md'

    if readme_path.exists():
        with readme_path.open('r', encoding='utf8') as f:
            readme = f.read()
    else:
        readme = generated_model_card

    with readme_path.open('w', encoding='utf-8') as f:
        f.write(readme)

    # 保存我们的评估信息到README的元数据.
    metadata_save(readme_path, metadata)


def _add_logdir(local_path: Path,
                logdir: Path):
    """添加日志到仓库.

    Args:
        local_path: 仓库的地址.
        logdir: 日志的地址.
    """
    if logdir.exists() and logdir.is_dir():
        # 添加日志到仓库下, 新地址叫`logs`
        repo_logdir = local_path / 'logs'

        # 如果当前的日志目录存在, 就删除.
        if repo_logdir.exists():
            shutil.rmtree(repo_logdir)

        # 复制日志到仓库的日志中.
        shutil.copytree(logdir, repo_logdir)


def make_env(env_id, seed, idx, capture_video, run_name):
    def thunk():
        env = gym.make(env_id)
        env = gym.wrappers.RecordEpisodeStatistics(env)
        if capture_video:
            if idx == 0:
                env = gym.wrappers.RecordVideo(env, f'videos/{run_name}')
        env.seed(seed)
        env.action_space.seed(seed)
        env.observation_space.seed(seed)

        return env

    return thunk


def layer_init(layer, std=np.sqrt(2), bias_const=0.0):
    torch.nn.init.orthogonal_(layer.weight, std)
    torch.nn.init.constant_(layer.bias, bias_const)

    return layer


class Agent(nn.Module):
    def __init__(self, envs):
        super(Agent, self).__init__()
        self.critic = nn.Sequential(
            layer_init(nn.Linear(np.asarray(envs.single_observation_space.shape).prod(), 64)),
            nn.Tanh(),
            layer_init(nn.Linear(64, 64)),
            nn.Tanh(),
            layer_init(nn.Linear(64, 1), std=1.0)
        )

        self.actor = nn.Sequential(
            layer_init(nn.Linear(np.asarray(envs.single_observation_space.shape).prod(), 64)),
            nn.Tanh(),
            layer_init(nn.Linear(64, 64)),
            nn.Tanh(),
            layer_init(nn.Linear(64, envs.single_action_space.n), std=0.01)
        )

    def get_value(self, x):
        return self.critic(x)

    def get_action_and_value(self, x, action=None):
        logits = self.actor(x)
        probs = Categorical(logits=logits)

        if action is None:
            action = probs.sample()

        return action, probs.log_prob(action), probs.entropy(), self.critic(x)


if __name__ == '__main__':
    args = parse_args()
    run_name = f'{args.env_id}__{args.exp_name}__{args.seed}__{int(time.time())}'
    if args.track:
        import wandb

        wandb.init(config=vars(args),
                   project=args.wandb_project_name,
                   entity=args.wandb_entity,
                   name=run_name,
                   sync_tensorboard=True,
                   monitor_gym=True,
                   save_code=True)
    writer = SummaryWriter(f'runs/{run_name}')
    writer.add_text('hyperparameters',
                    '|params|value|\n|-|-|\n%s' % ('\n'.join([f'|{key}|{value}' for key, value in vars(args).items()])))

    # 尽量不要修改: 随机种子
    random.seed(args.seed)
    np.random.seed(args.seed)
    torch.manual_seed(args.seed)
    torch.backends.cudnn.deterministic = args.torch_deterministic

    device = torch.device('cuda' if torch.cuda.is_available() and args.cuda else 'cpu')

    # 环境设置.
    envs = gym.vector.SyncVectorEnv([make_env(env_id=args.env_id,
                                              seed=args.seed + i,
                                              idx=i,
                                              capture_video=args.capture_video,
                                              run_name=run_name) for i in range(args.num_envs)])
    assert isinstance(envs.single_action_space, gym.spaces.Discrete), '仅支持离散动作空间'

    agent = Agent(envs).to(device)
    optimizer = Adam(agent.parameters(), lr=args.learning_rate, eps=1e-5)

    # 算法逻辑: 存储设置.
    obs = torch.zeros((args.num_steps, args.num_envs) + envs.single_observation_space.shape).to(device)
    actions = torch.zeros((args.num_steps, args.num_envs) + envs.single_action_space.shape).to(device)
    logprobs = torch.zeros((args.num_steps, args.num_envs)).to(device)
    rewards = torch.zeros((args.num_steps, args.num_envs)).to(device)
    dones = torch.zeros((args.num_steps, args.num_envs)).to(device)
    values = torch.zeros((args.num_steps, args.num_envs)).to(device)

    # 请不要修改: 开始游戏.
    global_step = 0
    start_time = time.time()
    next_obs = torch.Tensor(envs.reset()).to(device)
    next_done = torch.Tensor(args.num_envs).to(device)
    num_updates = args.total_timesteps // args.batch_size

    for update in range(1, num_updates + 1):
        # 是否对学习速率进行退火.
        if args.anneal_lr:
            frac = 1.0 - (update - 1.0) / num_updates
            lrnow = frac * args.learning_rate
            optimizer.param_groups[0]['lr'] = lrnow

        for step in range(0, args.num_steps):
            global_step += 1 * args.num_envs
            obs[step] = next_obs
            dones[step] = next_done

            # 算法逻辑: 动作逻辑.
            with torch.no_grad():
                action, logprob, _, value = agent.get_action_and_value(next_obs)
                values[step] = value.flatten()

            actions[step] = action
            logprobs[step] = logprob

            # 请不要修改: 执行游戏并保存日志.
            next_obs, reward, done, info = envs.step(action.cpu().numpy())
            rewards[step] = torch.tensor(reward).to(device).view(-1)
            next_obs, next_done = torch.Tensor(next_obs).to(device), torch.Tensor(done).to(device)

            for item in info:
                if 'episode' in item.keys():
                    print(f"总时间步={global_step}, 当前返回={item['episode']['r']}")
                    writer.add_scalar('charts/episodic_return', item['episode']['r'], global_step)
                    writer.add_scalar('charts/episodic_return', item['episode']['r'], global_step)
                    break

        # 如果当前轮没有结束, 则引导值.
        with torch.no_grad():
            next_value = agent.get_value(next_obs).reshape(1, -1)
            if args.gae:
                advantages = torch.zeros_like(rewards).to(device)
                lastgaelam = 0
                for t in reversed(range(args.num_steps)):
                    if t == args.num_steps - 1:
                        nextnonterminal = 1.0 - next_done
                        nextvalues = next_value
                    else:
                        nextnonterminal = 1.0 - dones[t + 1]
                        nextvalues = values[t + 1]
                    delta = rewards[t] + args.gamma * nextvalues * nextnonterminal - values[t]
                    advantages[t] = lastgaelam = delta + args.gamma * args.gae_lambda * nextnonterminal * lastgaelam
                returns = advantages + values
            else:
                returns = torch.zeros_like(rewards).to(device)
                for t in reversed(range(args.num_steps)):
                    if t == args.num_steps - 1:
                        nextnonterminal = 1.0 - next_done
                        next_return = next_value
                    else:
                        nextnonterminal = 1.0 - dones[t + 1]
                        next_return = values[t + 1]
                    returns[t] = rewards[t] + args.gamma * nextnonterminal * next_return
                advantages = returns - values

        # 展平批次.
        b_obs = obs.reshape((-1,) + envs.single_observation_space.shape)
        b_logprobs = logprobs.reshape(-1)
        b_actions = actions.reshape((-1,) + envs.single_action_space.shape)
        b_advantages = advantages.reshape(-1)
        b_returns = returns.reshape(-1)
        b_values = values.reshape(-1)

        # 优化策略和价值网络.
        b_inds = np.arange(args.batch_size)
        clipfracs = []
        for epoch in range(args.update_epochs):
            np.random.shuffle(b_inds)
            for start in range(0, args.batch_size, args.minibatch_size):
                end = start + args.minibatch_size
                mb_inds = b_inds[start:end]

                _, newlogprob, entropy, newvalue = agent.get_action_and_value(b_obs[mb_inds], b_actions.long()[mb_inds])
                logratio = newlogprob - b_logprobs[mb_inds]
                ratio = logratio.exp()

                with torch.no_grad():
                    # 计算近似KL散度 http://joschu.net/blog/kl-approx.html
                    old_approx_kl = (-logratio).mean()
                    approx_kl = ((ratio - 1) - logratio).mean()
                    clipfracs += [((ratio - 1.0).abs() > args.clip_coef).float().mean().item()]

                mb_advantages = b_advantages[mb_inds]
                if args.norm_adv:
                    mb_advantages = (mb_advantages - mb_advantages.mean()) / (mb_advantages.std() + 1e-8)

                # 策略损失.
                pg_loss1 = -mb_advantages * ratio
                pg_loss2 = -mb_advantages * torch.clamp(ratio, 1 - args.clip_coef, 1 + args.clip_coef)
                pg_loss = torch.max(pg_loss1, pg_loss2).mean()

                # 价值损失.
                newvalue = newvalue.view(-1)
                if args.clip_vloss:
                    v_loss_unclipped = (newvalue - b_returns[mb_inds]) ** 2
                    v_clipped = b_values[mb_inds] + torch.clamp(
                        newvalue - b_values[mb_inds],
                        -args.clip_coef,
                        args.clip_coef
                    )
                    v_loss_clipped = (v_clipped - b_returns[mb_inds]) ** 2
                    v_loss_max = torch.max(v_loss_unclipped, v_loss_clipped)
                    v_loss = 0.5 * v_loss_max.mean()
                else:
                    v_loss = 0.5 * ((newvalue - b_returns[mb_inds]) ** 2).mean()

                entropy_loss = entropy.mean()
                loss = pg_loss - args.ent_coef * entropy_loss + v_loss * args.vf_coef

                optimizer.zero_grad()
                loss.backward()
                nn.utils.clip_grad_norm_(agent.parameters(), args.max_grad_norm)
                optimizer.step()

            if args.target_kl is not None:
                if approx_kl > args.target_kl:
                    break

        y_pred, y_true = b_values.cpu().numpy(), b_returns.cpu().numpy()
        var_y = np.var(y_true)
        explained_var = np.nan if var_y == 0 else 1 - np.var(y_true - y_pred) / var_y

        # 请不要修改: 为绘图记录奖励.
        writer.add_scalar('charts/learning_rate', optimizer.param_groups[0]['lr'], global_step)
        writer.add_scalar('losses/value_loss', v_loss.item(), global_step)
        writer.add_scalar('losses/policy_loss', pg_loss.item(), global_step)
        writer.add_scalar('losses/entropy', entropy_loss.item(), global_step)
        writer.add_scalar('losses/old_approx_kl', old_approx_kl.item(), global_step)
        writer.add_scalar('losses/approx_kl', approx_kl.item(), global_step)
        writer.add_scalar('losses/clipfrac', np.mean(clipfracs), global_step)
        writer.add_scalar('losses/explained_variance', explained_var, global_step)
        print('SPS:', int(global_step / (time.time() - start_time)))
        writer.add_scalar('charts/SPS', int(global_step / (time.time() - start_time)), global_step)

    envs.close()
    writer.close()

    # 创建评估环境.
    eval_env = gym.make(args.env_id)

    package_to_hub(repo_id=args.repo_id,
                   model=agent,  # 我们想保存的模型.
                   hyperparameters=args,
                   eval_env=eval_env,
                   logs=f'runs/{run_name}')

为了能分享你的模型到社区, 有以下两个步骤需要做:

1⃣️ (如果没有完成)创建一个Hugging Face账户 ➡ https://huggingface.co/join

2⃣️ 登陆账户, 然后你需要保存一个Hugging Face的身份验证令牌(token).

* 创建一个新的具有**写入规则**的令牌(https://huggingface.co/settings/tokens)

![image.png](./assets/image2.png)

In [None]:
from huggingface_hub import notebook_login
notebook_login()

如果你使用IDE, 也可在终端中使用以下命令:

In [None]:
!huggingface-cli login

### 第4步: 让我们开始训练 🔥
* 现在你已经完成从头开始编写PPO算法并添加到Hugging Face Hub, 我们可以开始训练了 🔥.

* 首先, 你需要复制你的代码到你创建的名为`ppo.py`的文件中.

![image.png](./assets/image3.png)

![image.png](./assets/image4.png)

* 现在我们只需要使用`python <python脚本的名称>.py`和我们用`argparse`定义的参数来运行这个python脚本.

In [None]:
!python ppo.py \
--env-id='LunarLander-v2' \
--repo-id='YOUR_REPO_ID' \
--total-timesteps=50000

## 额外的挑战(可选) 🏆
在[排行榜](https://huggingface.co/spaces/chrisjay/Deep-Reinforcement-Learning-Leaderboard)中, 你将找到你的智能体的位置. 你想要获得第一吗?

以下是一些实现这个目标的想法:
* 训练更多的时间步
* 通过查看你同学所做的来尝试不同的超参数
* **发布你训练的新模型**到Hub上 🔥

下个单元见! 🔥
## 不断学习, 不断精彩 🤗 ! 