# 使用 Stable baselines3 在 Amazon SageMaker 上进行强化学习训练

## 概要介绍


<img src="https://stable-baselines3.readthedocs.io/en/master/_static/logo.png" width="300">

[OpenAI Gym](https://gym.openai.com) 是一个开源的强化学习工具包,它提供了标准的接口和一组环境, 通过这些环境我们可以快速的进行强化学习实验. 

[Stable baselines3](https://stable-baselines.readthedocs.io/en/master/) 是在OpenAI Baselines 基础算法上进行增强的开源强化学习算法项目. 

本次实验我们将使用stable baselines 自带的算法进行对OpenAI Gym自带的雅达利游戏 '吃豆人' [**MsPacman-v0**](https://gym.openai.com/envs/MsPacman-v0/) 进行训练.

在Dockerfile 中已经包含了http://www.atarimania.com/roms/Roms.rar 所有游戏环境，你可以通过 **env_id** 来设置,默认为MsPacman-v0 




## 强化学习机制

强化学习（英语：Reinforcement learning，简称RL）是机器学习中的一个领域，强调如何基于环境而行动，以取得最大化的预期利益。强化学习是除了监督学习和非监督学习之外的第三种基本的机器学习方法。与监督学习不同的是，强化学习不需要带标签的输入输出对，同时也无需对非最优解的精确地纠正。其关注点在于寻找探索（对未知领域的）和利用（对已有知识的）的平衡，强化学习中的“探索-利用”的交换。
[维基百科]

<img src="src/rl.png">

Amazon SageMaker

SageMaker是亚马逊云计算（Amazon Web Service）推出的一个端到端机器学习平台产品，它为数据科学家或算法工程师提供了拿来即用的计算资源和所需的机器学习/深度学习执行环境，您可以直接使用它集成的内置算法或提交自己的代码快速拉起计算资源来完成相应机器学习模型的迭代和部署推理。

<img src="src/sm-arch.png">

## 1 前置条件

### 导入

导入我们需要的Python库, 以及需要的辅助方法: get_execution_role, wait_for_s3_object.

In [1]:
import sagemaker
import boto3
import sys
import os
import subprocess
from IPython.display import HTML
import time
from time import gmtime, strftime
sys.path.append("common")
from misc import wait_for_s3_object
from docker_utils import build_and_push_docker_image
from sagemaker.rl import RLEstimator

### 设置 S3 桶

通过Sagemaker SDK获取默认s3桶, 该桶将会存储模型,检查点和其他元数据

In [2]:
sage_session = sagemaker.session.Session()
s3_bucket = sage_session.default_bucket()  
s3_output_path = 's3://{}/'.format(s3_bucket)
print("S3 bucket path: {}".format(s3_output_path))

S3 bucket path: s3://sagemaker-us-east-1-596030579944/


### 定义任务/Image 名称变量 

我们定义为训练任务和image定义前缀变量: job_name_prefix*

In [3]:
# create a descriptive job name 
rl_problem = 'pacman'
job_name_prefix = 'rl-stabebaselines-'+rl_problem

### 获取IAM角色

使用SageMaker SDK 的`get_execution_role()` 获取SageMaker Notebook的Role, `role = sagemaker.get_execution_role()` 

In [4]:
role = sagemaker.get_execution_role()
print("Using IAM role arn: {}".format(role))

Using IAM role arn: arn:aws:iam::596030579944:role/service-role/AmazonSageMaker-ExecutionRole-20191130T110013


## 构建 docker 镜像

我们必须要构建自己的docker 镜像.  This takes care of everything:

1. 拉取基础镜像
2. 安装g++,cmake 等编译工具
3. 安装stable-baselines 和它需要的依赖库, etc OpenMPI
3. 将镜像上传到Amazone ECR 

这个步骤通常会花费 3-10分钟,具体时间取决于你的网络速度和notebook实例类型.



In [5]:
%%time
instance_type = 'ml.c5.xlarge'
cpu_or_gpu = 'gpu' if instance_type.startswith('ml.p') or instance_type.startswith('ml.g') else 'cpu'
repository_short_name = "sagemaker-roboschool-stablebaselines3-pytorch-1.10-py38-%s" % cpu_or_gpu
docker_build_args = { 
    'AWS_REGION': boto3.Session().region_name,
}
custom_image_name = build_and_push_docker_image(repository_short_name,dockerfile='Dockerfile', build_args=docker_build_args)
print("Using ECR image %s" % custom_image_name)



Building docker image sagemaker-roboschool-stablebaselines3-pytorch-1.10-py38-cpu from Dockerfile
$ docker build -t sagemaker-roboschool-stablebaselines3-pytorch-1.10-py38-cpu -f Dockerfile . --build-arg AWS_REGION=us-east-1
Sending build context to Docker daemon  4.923MB
Step 1/34 : ARG AWS_REGION
Step 2/34 : FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.10.2-cpu-py38-ubuntu20.04-sagemaker
 ---> 6cceafb61a34
Step 3/34 : RUN apt update
 ---> Using cache
 ---> 7a297aa759a4
Step 4/34 : RUN pip install --upgrade pip
 ---> Using cache
 ---> 5916efffba7e
Step 5/34 : RUN pip install sagemaker_containers
 ---> Using cache
 ---> 6fbc8b6a844c
Step 6/34 : ENV LD_LIBRARY_PATH=/usr/local/openmpi/lib:$LD_LIBRARY_PATH
 ---> Using cache
 ---> d3846c409a72
Step 7/34 : ENV PATH /usr/local/openmpi/bin/:$PATH
 ---> Using cache
 ---> ed85d8f62ebb
Step 8/34 : ENV PATH=/usr/local/nvidia/bin:$PATH
 ---> Using cache
 ---> d93f83b1174f
Step 9/34 : ENV PYTHONDONTWRITEBYTECODE=1 PYTHONUNBU

139f6a96c053: Layer already exists
7e8de50b33bd: Layer already exists
cb0d043fe054: Layer already exists
dddc86285242: Layer already exists
867d7ce70f01: Layer already exists
a398efc78f70: Layer already exists
8bcc46e6fd19: Layer already exists
f06e7260a717: Layer already exists
00566075f82d: Layer already exists
ab29f857f3ba: Layer already exists
a8225123a462: Layer already exists
35e4d09b8d0f: Layer already exists
43c7230bab92: Layer already exists
3269ec2e55bf: Layer already exists
c390a0c00675: Layer already exists
a4d696825f37: Layer already exists
7a8a993481e0: Layer already exists
3c26c971124a: Layer already exists
2828aca077df: Layer already exists
759f75afd715: Layer already exists
af54ab71897d: Layer already exists
88bb40d10e9c: Layer already exists
114bea371b88: Layer already exists
c0a022366922: Layer already exists
50bbc942f459: Layer already exists
641f40d37982: Layer already exists
7cafbe790471: Layer already exists
2aca4e7afe8b: Layer already exists
98fffebf2583: Layer 

In [6]:
#设置Image的名字
aws_account = boto3.Session().client("sts").get_caller_identity()['Account']
aws_region =  boto3.Session().region_name
custom_image_name = f'{aws_account}.dkr.ecr.{aws_region}.amazonaws.com/sagemaker-roboschool-stablebaselines3-pytorch-1.10-py38-cpu'


## 2 使用SageMaker SDK 创建 RL 训练任务

你可以选择 GPU 或者 CPU 来创建SageMaker 训练任务. SageMaker SDK提供了`RLEstimator类`用来创建RL训练任务. 

1. Specify the source directory where the environment, presets and training code is uploaded.
2. Specify the entry point as the training code 
3. Specify the choice of RL toolkit and framework. This automatically resolves to the ECR path for the RL Container. 
4. Define the training parameters such as the instance count, job name, S3 path for output and job name. 
5. Specify the hyperparameters for the RL agent algorithm. The `RLSTABLEBASELINES_PRESET` can be used to specify the RL agent algorithm you want to use. 
6. Define the metrics definitions that you are interested in capturing in your logs. These can also be visualized in CloudWatch and SageMaker Notebooks. 

请注意,所有`preset-pacman.py`文件里的预设超参数都可以通过 `hyperparameters` 进行覆盖.

**Note**: PPO1算法需要使用到MPI, 本次实验中请将实例数量 `instance_count` 设置为 `1`.

模型输出地址 /opt/ml/output/intermediate/rl_model.zip

## 2.1 编写训练代码

#### 配置 RL 算法超参数

配置RL训练作业的预设文件是在./src目录中的preset-pacman.py中定义的。使用预设文件，您可以定义代理参数以选择特定的代理算法。您还可以设置环境参数，定义计划和可视化参数以及定义图形管理器。预设包含以下PPO1训练的需要超参数：

* `num_timesteps`: (int) Number of training steps - Preset: 1e4
* `n_steps` – (int) timesteps per actor per update - Preset: 2048
* `clip_range` – (float) clipping parameter epsilon - Preset: 0.2
* `ent_coef` – (float) the entropy loss weight - Preset: 0.0
* `n_epochs` – (float) the optimizer’s number of epochs - Preset: 10
* `learning_rate` – (float) the optimizer’s stepsize - Preset: 3e-4
* `batchsize` – (int) the optimizer’s the batch size - Preset: 64
* `gamma` – (float) discount factor - Preset: 0.99
* `gae_lambda` – (float) advantage estimation - Preset: 0.95
* `verbose` – (int) the verbosity level: 0 none, 1 training information, 2 tensorflow debug - Preset: 1

你可以在这里获取到完整的PPO1算法超参数列表和详细文档: https://stable-baselines.readthedocs.io/en/master/modules/ppo1.html


通过指定RLSTABLEBASELINES_PRESET超参数，可以制定预设超参数定义文件,这里我们使用了`"RLSTABLEBASELINES_PRESET":"preset-{}.py".format(rl_problem)`

#### 编写训练代码

训练代码在`./src`目录中的`train_stable_baselines.py` 文件.

In [40]:
!pygmentize src/train_stable_baselines3.py

[34mimport[39;49;00m [04m[36margparse[39;49;00m

[34mfrom[39;49;00m [04m[36msagemaker_rl[39;49;00m[04m[36m.[39;49;00m[04m[36mstable_baselines3_launcher[39;49;00m [34mimport[39;49;00m SagemakerStableBaselines3PPOLauncher, create_env


[34mdef[39;49;00m [32mparse_args[39;49;00m():
    parser = argparse.ArgumentParser()
    parser.add_argument([33m'[39;49;00m[33m--RLSTABLEBASELINES_PRESET[39;49;00m[33m'[39;49;00m, required=[34mTrue[39;49;00m, [36mtype[39;49;00m=[36mstr[39;49;00m)
    parser.add_argument([33m'[39;49;00m[33m--output_path[39;49;00m[33m'[39;49;00m, default=[33m"[39;49;00m[33m/opt/ml/output/intermediate/[39;49;00m[33m"[39;49;00m, [36mtype[39;49;00m=[36mstr[39;49;00m)
    parser.add_argument([33m'[39;49;00m[33m--instance_type[39;49;00m[33m'[39;49;00m, [36mtype[39;49;00m=[36mstr[39;49;00m)
    parser.add_argument([33m'[39;49;00m[33m--num_timesteps[39;49;00m[33m'[39;49;00m, default=[34m1e4[39;49;00m) [

**发布训练任务**

In [27]:
custom_image_name

'596030579944.dkr.ecr.us-east-1.amazonaws.com/sagemaker-roboschool-stablebaselines3-pytorch-1.10-py38-cpu'

In [10]:
%%time
instance_type = 'ml.c5.xlarge'
estimator = RLEstimator(entry_point="train_stable_baselines3.py",
                        source_dir='src',
                        dependencies=["common/sagemaker_rl"],
                        image_uri=custom_image_name,
                        role=role,
                        instance_type=instance_type,
                        use_spot_instances=True,
                        max_wait = (72 * 60 * 60),
                        instance_count=1,
                        output_path=s3_output_path,
                        base_job_name=job_name_prefix,
                        hyperparameters={
                            #"num_timesteps":1e7, #更长的step
                            "num_timesteps":1e3,
                            "instance_type":instance_type,
                            #"env_id":"SpaceInvaders-v0" #默认env 是MsPacman-v0 
                            
                        }
                    )

estimator.fit(wait=True)

2022-07-22 02:48:05 Starting - Starting the training job...
2022-07-22 02:48:33 Starting - Preparing the instances for trainingProfilerReport-1658458085: InProgress
......
2022-07-22 02:49:33 Downloading - Downloading input data...
2022-07-22 02:49:53 Training - Downloading the training image............
2022-07-22 02:52:02 Training - Training image download completed. Training in progress..[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34msed: can't read changehostname.c: No such file or directory[0m
[34mgcc: error: changehostname.c: No such file or directory[0m
[34mgcc: fatal error: no input files[0m
[34mcompilation terminated.[0m
[34mgcc: error: changehostname.o: No such file or directory[0m
[34mERROR: ld.so: object '/libchangehostname.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.[0m
[34m2022-07-22 02:52:05,375 sagemaker-containers INFO     Imported f

[34m---------------------------------[0m
[34m| rollout/           |          |[0m
[34m|    ep_len_mean     | 656      |[0m
[34m|    ep_rew_mean     | 223      |[0m
[34m| time/              |          |[0m
[34m|    fps             | 171      |[0m
[34m|    iterations      | 1        |[0m
[34m|    time_elapsed    | 11       |[0m
[34m|    total_timesteps | 2048     |[0m
[34m---------------------------------[0m
[34mERROR: ld.so: object '/libchangehostname.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.[0m
[34mPredict and video record was completed, /opt/ml/output/intermediate//rl_out.mp4[0m
[34mmodel saved: /opt/ml/output/intermediate//rl_model[0m
[34m2022-07-22 02:53:11,372 sagemaker-training-toolkit INFO     Waiting for the process to finish and give a return code.[0m
[34m2022-07-22 02:53:11,372 sagemaker-training-toolkit INFO     Done waiting for a return code. Received 0 from exiting process.[0m
[34m2022-07-22 02:53:11,37

## 可视化

强化学习训练通常需要很长时间，因此在运行过程中我们需要通过多种方式来跟踪正在运行的培训工作的进度。在训练期间，训练任务可以将一些中间输出到S3，我们可以根据这些这里中间输出来进行监控或者分析。

### 获取训练输出的视频
在训练期间，可以将环境的训练视频将输出到S3。接下来，我们将获取所有可用的视频，并且在notebook中渲染最后一个。
我们可以先暂停上一个cell的执行（因为我们已经向SageMaker提交了整合个训练任务）

In [11]:

job_name = estimator.latest_training_job.job_name
print("Training job: %s" % job_name)

s3_url = "s3://{}/{}".format(s3_bucket,job_name)

output_tar_key = "{}/output/output.tar.gz".format(job_name)

intermediate_folder_key = "{}/output/intermediate".format(job_name)
output_url = "s3://{}/{}".format(s3_bucket, output_tar_key)
intermediate_url = "s3://{}/{}".format(s3_bucket, intermediate_folder_key)

print("S3 job path: {}".format(s3_url))
print("Output.tar.gz location: {}".format(output_url))
print("Intermediate folder path: {}".format(intermediate_url))
    
tmp_dir = "/tmp/{}".format(job_name)
os.system("mkdir {}".format(tmp_dir))
print("Create local folder {}".format(tmp_dir))
wait_for_s3_object(s3_bucket, intermediate_folder_key, tmp_dir) 

Training job: rl-stabebaselines-pacman-2022-07-22-02-48-05-541
S3 job path: s3://sagemaker-us-east-1-596030579944/rl-stabebaselines-pacman-2022-07-22-02-48-05-541
Output.tar.gz location: s3://sagemaker-us-east-1-596030579944/rl-stabebaselines-pacman-2022-07-22-02-48-05-541/output/output.tar.gz
Intermediate folder path: s3://sagemaker-us-east-1-596030579944/rl-stabebaselines-pacman-2022-07-22-02-48-05-541/output/intermediate
Create local folder /tmp/rl-stabebaselines-pacman-2022-07-22-02-48-05-541
Waiting for s3://sagemaker-us-east-1-596030579944/rl-stabebaselines-pacman-2022-07-22-02-48-05-541/output/intermediate...
Downloading rl-stabebaselines-pacman-2022-07-22-02-48-05-541/output/intermediate/1.monitor.csv
Downloading rl-stabebaselines-pacman-2022-07-22-02-48-05-541/output/intermediate/rl_model.zip
Downloading rl-stabebaselines-pacman-2022-07-22-02-48-05-541/output/intermediate/rl_out.meta.json
Downloading rl-stabebaselines-pacman-2022-07-22-02-48-05-541/output/intermediate/rl_out.m

['/tmp/rl-stabebaselines-pacman-2022-07-22-02-48-05-541/1.monitor.csv',
 '/tmp/rl-stabebaselines-pacman-2022-07-22-02-48-05-541/rl_model.zip',
 '/tmp/rl-stabebaselines-pacman-2022-07-22-02-48-05-541/rl_out.meta.json',
 '/tmp/rl-stabebaselines-pacman-2022-07-22-02-48-05-541/rl_out.mp4']

### RL 视频输出

In [12]:
import io
import base64
video = io.open("{}/rl_out.mp4".format(tmp_dir), 'r+b').read()
encoded = base64.b64encode(video)
HTML(data='''<video alt="test" controls>
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii')))

### Stabel baselines3 参数调整(可选)

可以调整Stabel baselines 的参数使用更多的机器和step来获得更好的效果:
* `train_instance_count`: 10
* `train_instance_type`: ml.c5.xlarge
* `num_timesteps`: 1e7

使用上述设置训练模型花费了40分钟。您可以使用更少的实例和更长的培训时间来获得类似的输出。

In [None]:
import io
import base64
video = io.open("{}/rl_out.mp4", 'r+b').read()
encoded = base64.b64encode(video)
HTML(data='''<video alt="test" controls>
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii')))