<font size=6><center><big><b>适配你自己的训练镜像</b></big></center></font>

通常客户都有自己的一套训练脚本或者训练容器，如果要集成到Sagemaker中，需要将您的训练脚本和容器按照Sagemaker的要求进行打包和修改。

在本例中我们将阐述符合适配您现有的训练脚本到Sagemaker上

### 新建一个Hello World算法

运行如下魔法命令，创建一个train.py训练脚本，我们稍后将把这个脚本打包进docker image中

In [1]:
%%writefile train.py

import tensorflow as tf

mnist = tf.keras.datasets.mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10, activation='softmax')
])

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.fit(x_train, y_train, epochs=1)

model.evaluate(x_test, y_test)

Overwriting train.py


### 创建Dockerfile

    我们从公共的docker hub上拉标准的tensorflow镜像，如果只是这样sagemaker是无法正常使用这个image进行训练的。我们需要在dockerfile中做三件事情
- 安装sagemaker-training（以前叫sagemaker-containers）这个library，sagemaker会通过这个lib实现很多扩展的功能
- copy训练脚本到容器的/opt/ml/code/路径下，这个是一个规定
- 添加环境变量SAGEMAKER_PROGRAM，这个是告诉sagemaker你的训练入口在哪里，实际上获取这个环境变量并开启训练都是sagemaker-training完成的

In [2]:
%%writefile Dockerfile
FROM tensorflow/tensorflow:2.0.0a0

RUN pip install sagemaker-training

# Copies the training code inside the container
COPY train.py /opt/ml/code/train.py

# Defines train.py as script entry point
ENV SAGEMAKER_PROGRAM train.py

Overwriting Dockerfile


### Build Image

添加一个-lib的后缀，表示这个image是安装了sagemaker-training的

In [3]:
! docker build -t tf-2.0-with-lib .

Sending build context to Docker daemon  80.38kB
Step 1/4 : FROM tensorflow/tensorflow:2.0.0a0
2.0.0a0: Pulling from tensorflow/tensorflow

[1B2c1070cd: Pulling fs layer 
[1B74db61f1: Pulling fs layer 
[1Bcb72e5c9: Pulling fs layer 
[1B7a67709e: Pulling fs layer 
[2B7a67709e: Waiting fs layer 
[1B5d1c3937: Pulling fs layer 
[1Bc3f56b0a: Pulling fs layer 
[1B0fc18b45: Pulling fs layer 
[1B97d79d36: Pulling fs layer 
[1BDigest: sha256:c51e5432db0faaca6a25025f8fbc29ee14b0b6bbb46ad4fd48e24a0901b9dde4[1K[K[6A[1K[K[10A[1K[K[10A[1K[K[6A[1K[K[10A[1K[K[6A[1K[K[10A[1K[K[10A[1K[K[6A[1K[K[6A[1K[K[1K[K[5A[1K[K[10A[1K[K[6A[1K[K[10A[1K[K[6A[1K[K[10A[1K[K[6A[1K[K[10A[1K[K[6A[1K[K[10A[1K[K[6A[1K[K[10A[1K[K[6A[1K[K[10A[1K[K[6A[1K[K[10A[1K[K[6A[1K[K[10A[1K[K[10A[1K[K[6A[1K[K[10A[1K[K[9A[1K[K[8A[1K[K[7A[1K[K[7A[1K[K[2A[1K[K[6A[1K[K[6A[1K[K[1A[1K[K[6A[1K[K[6A[1K[K[6A[1K[

### 运行训练任务

sagemaker可以直接从docker hub上拉镜像，所以需要在image_name参数中指定

可以看到在这个运行的容器中，多了很多的环境变量，这些其实都是sagemaker-containers这个库做的事情，它会把这个job的一些info以环境变量的方式传递给容器，这样容器内的训练算法就可以直接使用这些环境变量，另外在下边可以看到训练入口的train.py，它是如何实际执行的：

/usr/bin/python train.py --batch_size 128 --epochs 5 --learning_rate 0.01 --other_para 0.1

可以看到这里sagemaker-training把超参数以参数的形式传递给了训练脚本，这样就可以在脚本内部通过argparse解析这些超参数

In [4]:
import sagemaker
from sagemaker.estimator import Estimator

hyperparameters = {'epochs': 5, 'batch_size': 128, 'learning_rate': 0.01, 'other_para':0.1}

estimator = Estimator(image_name='tf-2.0-with-lib',
                      role=sagemaker.get_execution_role(),
                      hyperparameters=hyperparameters,
                      train_instance_count=1,
                      train_instance_type='local')

estimator.fit()

Creating tmpnjz5duxl_algo-1-90j4h_1 ... 
[1BAttaching to tmpnjz5duxl_algo-1-90j4h_12mdone[0m
[36malgo-1-90j4h_1  |[0m   from cryptography.hazmat.backends import default_backend
[36malgo-1-90j4h_1  |[0m 2020-10-16 15:38:45,716 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-90j4h_1  |[0m 2020-10-16 15:38:45,728 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-90j4h_1  |[0m 2020-10-16 15:38:45,740 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-90j4h_1  |[0m 2020-10-16 15:38:45,749 sagemaker-training-toolkit INFO     Invoking user script
[36malgo-1-90j4h_1  |[0m 
[36malgo-1-90j4h_1  |[0m Training Env:
[36malgo-1-90j4h_1  |[0m 
[36malgo-1-90j4h_1  |[0m {
[36malgo-1-90j4h_1  |[0m     "module_dir": "/opt/ml/code", 
[36malgo-1-90j4h_1  |[0m     "channel_input_dirs": {}, 
[36malgo-1-90j4h_1  |[0m     "resource_config": {
[36malgo-1

### 扩展问题：如果不装sagemaker-containers会怎样

In [5]:
%%writefile Dockerfile
FROM tensorflow/tensorflow:2.0.0a0

# RUN pip install sagemaker-training

# Copies the training code inside the container
COPY train.py /opt/ml/code/train.py

# Defines train.py as script entry point
ENV SAGEMAKER_PROGRAM train.py

Overwriting Dockerfile


In [6]:
! docker build -t tf-2.0-without-lib .

Sending build context to Docker daemon  96.26kB
Step 1/3 : FROM tensorflow/tensorflow:2.0.0a0
 ---> 2ebc856b5e27
Step 2/3 : COPY train.py /opt/ml/code/train.py
 ---> 50103b4804b6
Step 3/3 : ENV SAGEMAKER_PROGRAM train.py
 ---> Running in 1d4f7d39db9f
Removing intermediate container 1d4f7d39db9f
 ---> 9931a95f9000
Successfully built 9931a95f9000
Successfully tagged tf-2.0-without-lib:latest


### 答案是：任务失败

因为没有sagemaker-training，所以docker无法找到训练入口，及时您指定了环境变量也不能被感知到

In [40]:
import sagemaker
from sagemaker.estimator import Estimator

hyperparameters = {'epochs': 5, 'batch_size': 128, 'learning_rate': 0.01, 'other_para':0.1}

estimator = Estimator(image_name='tf-2.0-without-lib',
                      role=sagemaker.get_execution_role(),
                      hyperparameters=hyperparameters,
                      train_instance_count=1,
                      train_instance_type='local')

estimator.fit()

Creating tmpsc_hp6h9_algo-1-2cplg_1 ... 
[1Bting tmpsc_hp6h9_algo-1-2cplg_1 ... [31merror[0m
ERROR: for tmpsc_hp6h9_algo-1-2cplg_1  Cannot start service algo-1-2cplg: OCI runtime create failed: container_linux.go:345: starting container process caused "exec: \"train\": executable file not found in $PATH": unknown

ERROR: for algo-1-2cplg  Cannot start service algo-1-2cplg: OCI runtime create failed: container_linux.go:345: starting container process caused "exec: \"train\": executable file not found in $PATH": unknown
Encountered errors while bringing up the project.


RuntimeError: Failed to run: ['docker-compose', '-f', '/tmp/tmpsc_hp6h9/docker-compose.yaml', 'up', '--build', '--abort-on-container-exit'], Process exited with code: 1

如上面这个例子，我在Dockerfile中不安装sagemaker-containers，运行训练任务就回报错，另外在报错中找到了一条有用的信息：
原来sagemaker是通过docker compose进行调度的 ['docker-compose', '-f', '/tmp/tmpsc_hp6h9/docker-compose.yaml', 'up', '--build', '--abort-on-container-exit']

我们查看一下这个文件，实际上要在容器中执行train命令，而如果没有安装sagemaker-training，那么是没有这个入口脚本的

In [13]:
! cat /tmp/tmpdn05x743/docker-compose.yaml

networks:
  sagemaker-local:
    name: sagemaker-local
services:
  algo-1-dw4a3:
    command: train
    environment:
    - AWS_REGION=us-east-2
    - TRAINING_JOB_NAME=tf-2.0-without-lib-2020-02-19-08-13-15-912
    image: tf-2.0-without-lib
    networks:
      sagemaker-local:
        aliases:
        - algo-1-dw4a3
    stdin_open: true
    tty: true
    volumes:
    - /tmp/tmpdn05x743/algo-1-dw4a3/output:/opt/ml/output
    - /tmp/tmpdn05x743/algo-1-dw4a3/output/data:/opt/ml/output/data
    - /tmp/tmpdn05x743/algo-1-dw4a3/input:/opt/ml/input
    - /tmp/tmpdn05x743/model:/opt/ml/model
version: '2.3'


### 扩展问题：如果我不想安装sagemaker-training，还可以使用Sagemaker训练吗

In [24]:
%%writefile train1.py
#!/usr/bin/env python

import tensorflow as tf

mnist = tf.keras.datasets.mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10, activation='softmax')
])

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.fit(x_train, y_train, epochs=1)

model.evaluate(x_test, y_test)

Writing train1.py


既然我们知道Sagemaker会调用容器的train命令，那么我只要把训练脚本的名字设置为‘train’，并且将脚本所在的路径加入到PATH就可以了

In [28]:
%%writefile Dockerfile
FROM tensorflow/tensorflow:2.0.0a0

# RUN pip install sagemaker-containers

# Copies the training code inside the container
# train作为脚本将被执行
COPY train1.py /opt/ml/code/train

RUN chmod 777 /opt/ml/code/train

# 将此目录添加到PATH中，因为sagemaker会之行docker run CONTAINER_ID train, train作为命令需要添加到PATH中
ENV PATH="/opt/ml/code:${PATH}"

# 指定工作目录
WORKDIR /opt/ml/code

# Defines train.py as script entry point
# 如果没有sagemaker-containers，这个环境变量没有用ENV SAGEMAKER_PROGRAM train.py

Overwriting Dockerfile


In [29]:
! docker build -t tf-2.0-without-lib-fixed .

Sending build context to Docker daemon  40.96kB
Step 1/5 : FROM tensorflow/tensorflow:2.0.0a0
 ---> 2ebc856b5e27
Step 2/5 : COPY train1.py /opt/ml/code/train
 ---> Using cache
 ---> 77a7d58ad757
Step 3/5 : RUN chmod 777 /opt/ml/code/train
 ---> Using cache
 ---> c4c6a9316d0b
Step 4/5 : ENV PATH="/opt/ml/code:${PATH}"
 ---> Using cache
 ---> 5afe63d43b22
Step 5/5 : WORKDIR /opt/ml/code
 ---> Using cache
 ---> 1f9e7dfef833
Successfully built 1f9e7dfef833
Successfully tagged tf-2.0-without-lib-fixed:latest


### 答案是：可以

但是很多功能是缺失的，比如看不到更多的sagamaker吐出的信息，很多配置不会被配置成容器内的环境变量等等

In [30]:
import sagemaker
from sagemaker.estimator import Estimator

hyperparameters = {'epochs': 5, 'batch_size': 128, 'learning_rate': 0.01, 'other_para':0.1}

estimator = Estimator(image_name='tf-2.0-without-lib-fixed',
                      role=sagemaker.get_execution_role(),
                      hyperparameters=hyperparameters,
                      train_instance_count=1,
                      train_instance_type='local')

estimator.fit()

Creating tmp20oye3fi_algo-1-3vg62_1 ... 
[1BAttaching to tmp20oye3fi_algo-1-3vg62_12mdone[0m
[36malgo-1-3vg62_1  |[0m Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
[36malgo-1-3vg62_1  |[0m 2020-02-19 10:17:19.418049: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
[36malgo-1-3vg62_1  |[0m 2020-02-19 10:17:19.439793: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2500000000 Hz
[36malgo-1-3vg62_1  |[0m 2020-02-19 10:17:19.440077: I tensorflow/compiler/xla/service/service.cc:162] XLA service 0x4c1ca90 executing computations on platform Host. Devices:
[36malgo-1-3vg62_1  |[0m 2020-02-19 10:17:19.440105: I tensorflow/compiler/xla/service/service.cc:169]   StreamExecutor device (0): <undefined>, <undefined>
[36mtmp20oye3fi_algo-1-3vg62_1 exited with code 0
[0mAborting on container exit...
===== Job Comp