# 使用 Stable baselines 在 Amazon SageMaker 上进行强化学习训练

## 概要介绍


<img src="https://stable-baselines.readthedocs.io/en/master/_static/logo.png" width="300">

[OpenAI Gym](https://gym.openai.com) 是一个开源的强化学习工具包,它提供了标准的接口和一组环境, 通过这些环境我们可以快速的进行强化学习实验. 

[Stable baselines](https://stable-baselines.readthedocs.io/en/master/) 是在OpenAI Baselines 基础算法上进行增强的开源强化学习算法项目. 

本次实验我们将使用stable baselines 自带的算法进行对OpenAI Gym自带的雅达利游戏 '吃豆人' [**MsPacman-v0**](https://gym.openai.com/envs/MsPacman-v0/) 进行训练.





## 强化学习机制

强化学习（英语：Reinforcement learning，简称RL）是机器学习中的一个领域，强调如何基于环境而行动，以取得最大化的预期利益。强化学习是除了监督学习和非监督学习之外的第三种基本的机器学习方法。与监督学习不同的是，强化学习不需要带标签的输入输出对，同时也无需对非最优解的精确地纠正。其关注点在于寻找探索（对未知领域的）和利用（对已有知识的）的平衡，强化学习中的“探索-利用”的交换。
[维基百科]

<img src="src/rl.png">

Amazon SageMaker

SageMaker是亚马逊云计算（Amazon Web Service）推出的一个端到端机器学习平台产品，它为数据科学家或算法工程师提供了拿来即用的计算资源和所需的机器学习/深度学习执行环境，您可以直接使用它集成的内置算法或提交自己的代码快速拉起计算资源来完成相应机器学习模型的迭代和部署推理。

<img src="src/sm-arch.png">

In [8]:
rl_problem = 'pacman'

## 前置条件

### 导入

导入我们需要的Python库, 以及需要的辅助方法: get_execution_role, wait_for_s3_object.

In [3]:
import sagemaker
import boto3
import sys
import os
import subprocess
from IPython.display import HTML
import time
from time import gmtime, strftime
sys.path.append("common")
from misc import wait_for_s3_object
from docker_utils import build_and_push_docker_image
from sagemaker.rl import RLEstimator

### 设置 S3 桶

通过Sagemaker SDK获取默认s3桶, 该桶将会存储模型,检查点和其他元数据

In [4]:
sage_session = sagemaker.session.Session()
s3_bucket = sage_session.default_bucket()  
s3_output_path = 's3://{}/'.format(s3_bucket)
print("S3 bucket path: {}".format(s3_output_path))

S3 bucket path: s3://sagemaker-us-west-2-907488872981/


### 定义任务/Image 名称变量 

我们定义为训练任务和image定义前缀变量: job_name_prefix*

In [5]:
# create a descriptive job name 
job_name_prefix = 'rl-stabebaselines-'+rl_problem

### 获取IAM角色

使用SageMaker SDK 的`get_execution_role()` 获取SageMaker Notebook的Role, `role = sagemaker.get_execution_role()` 

In [6]:
role = sagemaker.get_execution_role()
print("Using IAM role arn: {}".format(role))

Using IAM role arn: arn:aws:iam::907488872981:role/service-role/AmazonSageMaker-ExecutionRole-20190926T171845


## 构建 docker 镜像

我们必须要构建自己的docker 镜像.  This takes care of everything:

1. 拉取基础镜像
2. 安装g++,cmake 等编译工具
3. 安装stable-baselines 和它需要的依赖库, etc OpenMPI
3. 将镜像上传到Amazone ECR 

这个步骤通常会花费 3-10分钟,具体时间取决于你的网络速度和notebook实例类型.



In [7]:
%%time
instance_type = 'ml.c5.xlarge'
cpu_or_gpu = 'gpu' if instance_type.startswith('ml.p') or instance_type.startswith('ml.g') else 'cpu'
repository_short_name = "sagemaker-roboschool-stablebaselines-%s" % cpu_or_gpu
docker_build_args = { 
    'AWS_REGION': boto3.Session().region_name,
}
custom_image_name = build_and_push_docker_image(repository_short_name, build_args=docker_build_args)
print("Using ECR image %s" % custom_image_name)

https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded
Logged into ECR
Building docker image sagemaker-roboschool-stablebaselines-cpu from Dockerfile
$ docker build -t sagemaker-roboschool-stablebaselines-cpu -f Dockerfile . --build-arg AWS_REGION=us-west-2
Sending build context to Docker daemon  1.147MB
Step 1/42 : ARG AWS_REGION
Step 2/42 : FROM 520713654638.dkr.ecr.${AWS_REGION}.amazonaws.com/sagemaker-rl-tensorflow:coach0.11.0-cpu-py3
coach0.11.0-cpu-py3: Pulling from sagemaker-rl-tensorflow
7b8b6451c85f: Pulling fs layer
ab4d1096d9ba: Pulling fs layer
e6797d1788ac: Pulling fs layer
e25c5c290bde: Pulling fs layer
cd105da4078d: Pulling fs layer
f2c8922a63b9: Pulling fs layer
58eea52eed06: Pulling fs layer
adcd0d06f606: Pulling fs layer
92301bd3f2ee: Pulling fs layer
e8a5f78e0876: Pulling fs layer
1da3b4231414: Pulling fs layer
f641cf6bbcd6: Pulling fs layer
1e7cf85187ea: Pulling fs layer
cd105da4078d: Waiting
8e3ddd1d8078: Pulling fs layer
168

## 编写训练代码

#### 配置 RL 算法超参数

配置RL训练作业的预设文件是在./src目录中的preset-pacman.py中定义的。使用预设文件，您可以定义代理参数以选择特定的代理算法。您还可以设置环境参数，定义计划和可视化参数以及定义图形管理器。预设包含以下PPO1训练的需要超参数：

* `num_timesteps`: (int) Number of training steps - Preset: 1e4
* `timesteps_per_actorbatch` – (int) timesteps per actor per update - Preset: 2048
* `clip_param` – (float) clipping parameter epsilon - Preset: 0.2
* `entcoeff` – (float) the entropy loss weight - Preset: 0.0
* `optim_epochs` – (float) the optimizer’s number of epochs - Preset: 10
* `optim_stepsize` – (float) the optimizer’s stepsize - Preset: 3e-4
* `optim_batchsize` – (int) the optimizer’s the batch size - Preset: 64
* `gamma` – (float) discount factor - Preset: 0.99
* `lam` – (float) advantage estimation - Preset: 0.95
* `schedule` – (str) The type of scheduler for the learning rate update (‘linear’, ‘constant’, ‘double_linear_con’, ‘middle_drop’ or ‘double_middle_drop’) - Preset: linear
* `verbose` – (int) the verbosity level: 0 none, 1 training information, 2 tensorflow debug - Preset: 1

你可以在这里获取到完整的PPO1算法超参数列表和详细文档: https://stable-baselines.readthedocs.io/en/master/modules/ppo1.html


通过指定RLSTABLEBASELINES_PRESET超参数，可以制定预设超参数定义文件,这里我们使用了`"RLSTABLEBASELINES_PRESET":"preset-{}.py".format(rl_problem)`

####  查看preset-pacman.py 超参数定义文件

In [9]:
!pygmentize src/preset-{rl_problem}.py

[34mimport[39;49;00m [04m[36margparse[39;49;00m

[34mfrom[39;49;00m [04m[36msagemaker_rl[39;49;00m[04m[36m.[39;49;00m[04m[36mstable_baselines_launcher[39;49;00m [34mimport[39;49;00m SagemakerStableBaselinesPPO1Launcher, create_env


[34mdef[39;49;00m [32mparse_args[39;49;00m():
    parser = argparse.ArgumentParser()
    parser.add_argument([33m'[39;49;00m[33m--output_path[39;49;00m[33m'[39;49;00m, default=[33m"[39;49;00m[33m/opt/ml/output/intermediate/[39;49;00m[33m"[39;49;00m, [36mtype[39;49;00m=[36mstr[39;49;00m)
    parser.add_argument([33m'[39;49;00m[33m--num_timesteps[39;49;00m[33m'[39;49;00m, default=[34m1e4[39;49;00m) [37m#default 1e4[39;49;00m
    parser.add_argument([33m'[39;49;00m[33m--timesteps_per_actorbatch[39;49;00m[33m'[39;49;00m, default=[34m2048[39;49;00m, [36mtype[39;49;00m=[36mint[39;49;00m)
    parser.add_argument([33m'[39;49;00m[33m--clip_param[39;49;00m[33m'[39;49;00m, default=[34m0.2[39;49;00m, 

#### 编写训练代码

训练代码在`./src`目录中的`train_stable_baselines.py` 文件.

In [10]:
!pygmentize src/train_stable_baselines.py

[34mimport[39;49;00m [04m[36margparse[39;49;00m

[34mfrom[39;49;00m [04m[36msagemaker_rl[39;49;00m[04m[36m.[39;49;00m[04m[36mmpi_launcher[39;49;00m [34mimport[39;49;00m MPILauncher


[34mdef[39;49;00m [32mparse_args[39;49;00m():
    parser = argparse.ArgumentParser()
    parser.add_argument([33m'[39;49;00m[33m--RLSTABLEBASELINES_PRESET[39;49;00m[33m'[39;49;00m, required=[34mTrue[39;49;00m, [36mtype[39;49;00m=[36mstr[39;49;00m)
    parser.add_argument([33m'[39;49;00m[33m--output_path[39;49;00m[33m'[39;49;00m, default=[33m"[39;49;00m[33m/opt/ml/output/intermediate/[39;49;00m[33m"[39;49;00m, [36mtype[39;49;00m=[36mstr[39;49;00m)
    parser.add_argument([33m'[39;49;00m[33m--instance_type[39;49;00m[33m'[39;49;00m, [36mtype[39;49;00m=[36mstr[39;49;00m)

    [34mreturn[39;49;00m parser.parse_known_args()


[34mif[39;49;00m [31m__name__[39;49;00m == [33m"[39;49;00m[33m__main__[39;49;00m[33m"[39;49;00m:
    args, unknown_ar

## 使用SageMaker SDK 创建 RL 训练任务

你可以选择 GPU 或者 CPU 来创建SageMaker 训练任务. SageMaker SDK提供了`RLEstimator类`用来创建RL训练任务. 

1. Specify the source directory where the environment, presets and training code is uploaded.
2. Specify the entry point as the training code 
3. Specify the choice of RL toolkit and framework. This automatically resolves to the ECR path for the RL Container. 
4. Define the training parameters such as the instance count, job name, S3 path for output and job name. 
5. Specify the hyperparameters for the RL agent algorithm. The `RLSTABLEBASELINES_PRESET` can be used to specify the RL agent algorithm you want to use. 
6. Define the metrics definitions that you are interested in capturing in your logs. These can also be visualized in CloudWatch and SageMaker Notebooks. 

请注意,所有`preset-pacman.py`文件里的预设超参数都可以通过 `hyperparameters` 进行覆盖.

**Note**: PPO1算法需要使用到MPI, 本次实验中请将实例数量 `instance_count` 设置为 `1`.

In [11]:
custom_image_name

'907488872981.dkr.ecr.us-west-2.amazonaws.com/sagemaker-roboschool-stablebaselines-cpu'

In [12]:
%%time

estimator = RLEstimator(entry_point="train_stable_baselines.py",
                        source_dir='src',
                        dependencies=["common/sagemaker_rl"],
                        image_uri=custom_image_name,
                        role=role,
                        instance_type=instance_type,
                        use_spot_instances=True,
                        max_wait = (72 * 60 * 60),
                        instance_count=1,
                        output_path=s3_output_path,
                        base_job_name=job_name_prefix,
                        hyperparameters={
                            "RLSTABLEBASELINES_PRESET":"preset-{}.py".format(rl_problem),
                            "num_timesteps":1e4,
                            "instance_type":instance_type
                        },
                        metric_definitions= [
                            {
                                "Name":"EpisodesLengthMean",
                                "Regex":"\[.*,.*\]\<stdout\>\:\| *EpLenMean *\| *([-+]?[0-9]*\.?[0-9]*) *\|"
                            },
                            {
                                "Name":"EpisodesRewardMean",
                                "Regex":"\[.*,.*\]\<stdout\>\:\| *EpRewMean *\| *([-+]?[0-9]*\.?[0-9]*) *\|"
                            },
                            {
                                "Name":"EpisodesSoFar",
                                "Regex":"\[.*,.*\]\<stdout\>\:\| *EpisodesSoFar *\| *([-+]?[0-9]*\.?[0-9]*) *\|"
                            }
                        ]
                    )

estimator.fit(wait=True)

2021-05-14 06:51:00 Starting - Starting the training job...
2021-05-14 06:51:23 Starting - Launching requested ML instancesProfilerReport-1620975060: InProgress
......
2021-05-14 06:52:29 Starting - Preparing the instances for training......
2021-05-14 06:53:25 Downloading - Downloading input data
2021-05-14 06:53:25 Training - Downloading the training image...........[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2021-05-14 06:55:10,983 sagemaker-containers INFO     Imported framework sagemaker_tensorflow_container.training[0m
[34m2021-05-14 06:55:10,987 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2021-05-14 06:55:11,180 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2021-05-14 06:55:11,190 sagemaker-containers INFO     Invoking user script
[0m
[34mTraining Env:
[0m
[34m{
    "additional_framework_parameters": {


KeyboardInterrupt: 

## 可视化

强化学习训练通常需要很长时间，因此在运行过程中我们需要通过多种方式来跟踪正在运行的培训工作的进度。在训练期间，训练任务可以将一些中间输出到S3，我们可以根据这些这里中间输出来进行监控或者分析。

### 获取训练输出的视频
在训练期间，可以将环境的训练视频将输出到S3。接下来，我们将获取所有可用的视频，并且在notebook中渲染最后一个。
我们可以先暂停上一个cell的执行（因为我们已经向SageMaker提交了整合个训练任务）

In [22]:

job_name = estimator.latest_training_job.job_name
print("Training job: %s" % job_name)

s3_url = "s3://{}/{}".format(s3_bucket,job_name)

output_tar_key = "{}/output/output.tar.gz".format(job_name)

intermediate_folder_key = "{}/output/intermediate".format(job_name)
output_url = "s3://{}/{}".format(s3_bucket, output_tar_key)
intermediate_url = "s3://{}/{}".format(s3_bucket, intermediate_folder_key)

print("S3 job path: {}".format(s3_url))
print("Output.tar.gz location: {}".format(output_url))
print("Intermediate folder path: {}".format(intermediate_url))
    
tmp_dir = "/tmp/{}".format(job_name)
os.system("mkdir {}".format(tmp_dir))
print("Create local folder {}".format(tmp_dir))
wait_for_s3_object(s3_bucket, intermediate_folder_key, tmp_dir) 

Training job: rl-stabebaselines-pacman-2021-05-14-06-51-00-187
S3 job path: s3://sagemaker-us-west-2-907488872981/rl-stabebaselines-pacman-2021-05-14-06-51-00-187
Output.tar.gz location: s3://sagemaker-us-west-2-907488872981/rl-stabebaselines-pacman-2021-05-14-06-51-00-187/output/output.tar.gz
Intermediate folder path: s3://sagemaker-us-west-2-907488872981/rl-stabebaselines-pacman-2021-05-14-06-51-00-187/output/intermediate
Create local folder /tmp/rl-stabebaselines-pacman-2021-05-14-06-51-00-187
Waiting for s3://sagemaker-us-west-2-907488872981/rl-stabebaselines-pacman-2021-05-14-06-51-00-187/output/intermediate...
Downloading rl-stabebaselines-pacman-2021-05-14-06-51-00-187/output/intermediate/0.monitor.csv
Downloading rl-stabebaselines-pacman-2021-05-14-06-51-00-187/output/intermediate/rl_out.meta.json
Downloading rl-stabebaselines-pacman-2021-05-14-06-51-00-187/output/intermediate/rl_out.mp4


['/tmp/rl-stabebaselines-pacman-2021-05-14-06-51-00-187/0.monitor.csv',
 '/tmp/rl-stabebaselines-pacman-2021-05-14-06-51-00-187/rl_out.meta.json',
 '/tmp/rl-stabebaselines-pacman-2021-05-14-06-51-00-187/rl_out.mp4']

### RL 视频输出

In [23]:
import io
import base64
video = io.open("{}/rl_out.mp4".format(tmp_dir), 'r+b').read()
encoded = base64.b64encode(video)
HTML(data='''<video alt="test" controls>
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii')))

### Stabel baselines 参数调整(可选)

可以调整Stabel baselines 的参数使用更多的机器和step来获得更好的效果:
* `train_instance_count`: 10
* `train_instance_type`: ml.c5.xlarge
* `num_timesteps`: 1e7

使用上述设置训练模型花费了40分钟。您可以使用更少的实例和更长的培训时间来获得类似的输出。

In [None]:
import io
import base64
video = io.open("{}/rl_out.mp4", 'r+b').read()
encoded = base64.b64encode(video)
HTML(data='''<video alt="test" controls>
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii')))