# LightGBMをカスタムコンテナで利用する手順を学び、SageMakerの動作を理解します

2hを想定

コンテンツ
* カスタムコンテナ(ローカル学習、ローカル推論、学習ジョブ、推論ジョブ）
* SageMaker Training Toolkit導入（コードを外出しにする）：ローカル学習、ローカル推論


## 実行環境
本ノートブックは、SageMakerノートブックインスタンス上で動作確認しています。
* インスタンスタイプ：ml.t3.medium
* カーネル：conda_python3

## コンテンツ

* LightGBM入りのカスタムコンテナを作る（パターン3:https://aws.amazon.com/jp/blogs/news/sagemaker-custom-containers-pattern-training/）
    *
    * SageMakerノートブックインスタンス上で作成する
    * 中に入って確かめてみる
    * ECRに登録

* SageMaker学習ジョブを実行する
    * trainに記載されている、SageMakerのお作法の解説
    * ローカルモードで動かす
        * ローカル推論
    * 学習ジョブで動かす(waitなし)
        * 推論エンドポイント構築。推論(とばす）
    * 【課題】train.py を外出しで指定して、学習ジョブを動かす（ローカルモード）
        * パターン２：Training Toolkitを入れたカスタムコンテナを作る(trainをCOPYしない。Toolkitをpip install）
        * パラメータを指定【エラー】trainが実行されてしまう（調査中）
        * 試しに train.sh を実行する
        * train.pyをローカルモードで学習
        * ローカル推論
* LightGBM + SageMaker Toolkit 入りのカスタムコンテナを作る（応用編）
    * ローカルモードで動かす
    * 学習ジョブで動かす
    * 出力の違いを観察
* （おまけ）パターン1ビルトインコンテナのrequirements.txt の紹介（→ カスタムコンテナでないので、パターン0でした）

In [1]:
import sys
#Pythonのバージョン情報
sys.version

'3.8.12 | packaged by conda-forge | (default, Oct 12 2021, 21:59:51) \n[GCC 9.4.0]'

'3.8.12 | packaged by conda-forge | (default, Oct 12 2021, 21:59:51) \n[GCC 9.4.0]'

In [2]:
# Pythonのバージョン確認 (システムコマンド使用
!python -V

Python 3.8.12


Python 3.8.12

In [3]:
import sagemaker

print('Current SageMaker Python SDK Version ={0}'.format(sagemaker.__version__))

Current SageMaker Python SDK Version =2.109.0


Current SageMaker Python SDK Version =2.109.0

## ライブラリインポート

In [4]:
# This is a sample Python program that trains a simple LightGBM Regression model, and then performs inference.
# This implementation will work on your local computer.
#
# Prerequisites:
#   1. Install required Python packages:
#       pip install boto3 sagemaker pandas scikit-learn
#       pip install 'sagemaker[local]'
#   2. Docker Desktop has to be installed on your computer, and running.
#   3. Open terminal and run the following commands:
#       docker build  -t sagemaker-lightgbm-regression-local container/.
########################################################################################################################


In [6]:
import pandas as pd
from sagemaker.estimator import Estimator
from sagemaker.local import LocalSession
from sagemaker.predictor import csv_serializer
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split

In [8]:
sagemaker_session = LocalSession()
sagemaker_session.config = {'local': {'local_code': True}}

# For local training a dummy role will be sufficient
role = 'arn:aws:iam::111111111111:role/service-role/AmazonSageMaker-ExecutionRole-20200101T000001'

# 1.データ準備

ボストンの住宅価格データセットを利用します。

https://github.com/aws-samples/amazon-sagemaker-local-mode/blob/main/lightgbm_bring_your_own_container_local_training_and_serving/lightgbm_bring_your_own_container_local_training_and_serving.py

In [12]:
import sklearn
sklearn.__version__

'1.0.1'

In [9]:
data = load_boston()

X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=45)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=45)

trainX = pd.DataFrame(X_train, columns=data.feature_names)
trainX['target'] = y_train

valX = pd.DataFrame(X_val, columns=data.feature_names)
valX['target'] = y_val

testX = pd.DataFrame(X_test, columns=data.feature_names)


    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np


        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_h

In [13]:
from pathlib import Path

Path('./data/train').mkdir(parents=True, exist_ok=True)
Path('./data/valid').mkdir(parents=True, exist_ok=True)
Path('./data/test').mkdir(parents=True, exist_ok=True)

In [14]:
local_train = './data/train/boston_train.csv'
local_valid = './data/valid/boston_valid.csv'
local_test = './data/test/boston_test.csv'

trainX.to_csv(local_train, header=None, index=False)
valX.to_csv(local_valid, header=None, index=False)
testX.to_csv(local_test, header=None, index=False)

# 2.カスタムコンテナ作成

https://aws.amazon.com/jp/blogs/news/sagemaker-custom-containers-pattern-training/

SageMakerカスタムコンテナパターン3の形式

containerディレクトリに資材が格納されています。

## 資材の解説

* Dockerfile : コンテナ作成
* 学習用ファイル
    * train: 学習時に実行されるスクリプトファイル
* 推論用ファイル
    * serve: デプロイ時に実行されるスクリプトファイル
    * nginx.conf: Webサーバのnginxの設定ファイル
    * wsgi.py: ninxの立ち上げ時に利用？
    * predictor.py: 推論のための関数を定義したファイル

In [15]:
%%sh

# The name of our algorithm
algorithm_name=sagemaker-lightgbm-regression

cd container

chmod +x lightgbm_regression/train
chmod +x lightgbm_regression/serve

account=$(aws sts get-caller-identity --query Account --output text)

# Get the region defined in the current configuration (default to us-west-2 if none defined)
region=$(aws configure get region)
region=${region:-us-west-2}

fullname="${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest"

# If the repository doesn't exist in ECR, create it.
aws ecr describe-repositories --repository-names "${algorithm_name}" > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "${algorithm_name}" > /dev/null
fi

# Get the login command from ECR and execute it directly
aws ecr get-login-password --region ${region}|docker login --username AWS --password-stdin ${fullname}

# Build the docker image locally with the image name and then push it to ECR
# with the full name.

docker build -t ${algorithm_name} .
docker tag ${algorithm_name} ${fullname}

docker push ${fullname}


Login Succeeded
Sending build context to Docker daemon   25.6kB
Step 1/10 : FROM ubuntu:16.04
 ---> b6f507652425
Step 2/10 : MAINTAINER Amazon AI <sage-learner@amazon.com>
 ---> Using cache
 ---> c5602b2d98e4
Step 3/10 : ARG CONDA_DIR=/opt/conda
 ---> Using cache
 ---> 618227bc5218
Step 4/10 : ENV PATH $CONDA_DIR/bin:$PATH
 ---> Using cache
 ---> f0165591799f
Step 5/10 : RUN apt-get update &&     apt-get install -y --no-install-recommends         ca-certificates         cmake         build-essential         gcc         g++         git         nginx         wget &&     wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh &&     /bin/bash Miniconda3-latest-Linux-x86_64.sh -f -b -p $CONDA_DIR &&     export PATH="$CONDA_DIR/bin:$PATH" &&     conda config --set always_yes yes --set changeps1 no &&     conda install -q -y numpy scipy scikit-learn pandas flask gevent gunicorn &&     git clone --recursive --branch stable --depth 1 https://github.com/Microsoft/LightGBM && 

https://docs.docker.com/engine/reference/commandline/login/#credentials-store



# ECRでpushしたコンテナのURIを確認

AWSコンソールでECRに移動し、作成したコンテナがあることを確認します。

image URIを取得し、以下にはりつけます。

In [16]:
#image = 'sagemaker-lightgbm-regression-local'
#image = '805433377179.dkr.ecr.us-west-2.amazonaws.com/sagemaker-lightgbm-regression:latest' # ビルドしたイメージのURI
image = '021345128571.dkr.ecr.ap-northeast-1.amazonaws.com/sagemaker-lightgbm-regression'

In [17]:
train_location = 'file://'+local_train
valid_location = 'file://'+local_valid

In [19]:
print(train_location)
print(valid_location)

file://./data/train/boston_train.csv
file://./data/valid/boston_valid.csv


# ローカル学習
ECRからビルドしたイメージを持ってきて、ローカルのdockerでビルドして、実行する

In [20]:
local_lightgbm = Estimator(
    image,
    role,
    instance_count=1,
    instance_type="local",
    hyperparameters={'boosting_type': 'gbdt',
            'objective': 'regression',
            'num_leaves': 31,
            'learning_rate': 0.05,
            'feature_fraction': 0.9,
            'bagging_fraction': 0.8,
            'bagging_freq': 5,
            'verbose': 0})

In [21]:
local_lightgbm.fit({'train':train_location, 'validation': valid_location}, logs=True)

Creating 0c9r493w5l-algo-1-0z4o6 ... 
Creating 0c9r493w5l-algo-1-0z4o6 ... done
Attaching to 0c9r493w5l-algo-1-0z4o6
[36m0c9r493w5l-algo-1-0z4o6 |[0m Starting the training.
[36m0c9r493w5l-algo-1-0z4o6 |[0m Reading hyperparameters data: /opt/ml/input/config/hyperparameters.json
[36m0c9r493w5l-algo-1-0z4o6 |[0m hyperparameters_data: {'boosting_type': 'gbdt', 'objective': 'regression', 'num_leaves': '31', 'learning_rate': '0.05', 'feature_fraction': '0.9', 'bagging_fraction': '0.8', 'bagging_freq': '5', 'verbose': '0'}
[36m0c9r493w5l-algo-1-0z4o6 |[0m Found train files: ['/opt/ml/input/data/train/boston_train.csv']
[36m0c9r493w5l-algo-1-0z4o6 |[0m Found validation files: ['/opt/ml/input/data/validation/boston_valid.csv']
[36m0c9r493w5l-algo-1-0z4o6 |[0m building training and validation datasets
[36m0c9r493w5l-algo-1-0z4o6 |[0m Starting training...
[36m0c9r493w5l-algo-1-0z4o6 |[0m You can set `force_col_wise=true` to remove the overhead.
[36m0c9r493w5l-algo-1-0z4o6 |[0m [

ローカルモードの学習結果は

Amazon S3
Buckets
sagemaker-us-west-2-805433377179
sagemaker-lightgbm-regression-2022-10-03-06-17-32-054/

に出力されます。


### ローカルサービング

serializer : インプットデータの形式を指定します。
https://sagemaker.readthedocs.io/en/stable/v2.html

In [33]:
local_predictor = local_lightgbm.deploy(1, 'local', serializer=sagemaker.serializers.CSVSerializer()) 

Attaching to jioo7o4vps-algo-1-9ktq9
[36mjioo7o4vps-algo-1-9ktq9 |[0m Starting the inference server with 16 workers.
[36mjioo7o4vps-algo-1-9ktq9 |[0m [2022-10-13 03:23:19 +0000] [10] [INFO] Starting gunicorn 20.1.0
[36mjioo7o4vps-algo-1-9ktq9 |[0m [2022-10-13 03:23:19 +0000] [10] [INFO] Listening at: unix:/tmp/gunicorn.sock (10)
[36mjioo7o4vps-algo-1-9ktq9 |[0m [2022-10-13 03:23:19 +0000] [10] [INFO] Using worker: gevent
[36mjioo7o4vps-algo-1-9ktq9 |[0m [2022-10-13 03:23:19 +0000] [12] [INFO] Booting worker with pid: 12
[36mjioo7o4vps-algo-1-9ktq9 |[0m [2022-10-13 03:23:19 +0000] [13] [INFO] Booting worker with pid: 13
[36mjioo7o4vps-algo-1-9ktq9 |[0m [2022-10-13 03:23:19 +0000] [14] [INFO] Booting worker with pid: 14
[36mjioo7o4vps-algo-1-9ktq9 |[0m [2022-10-13 03:23:19 +0000] [15] [INFO] Booting worker with pid: 15
[36mjioo7o4vps-algo-1-9ktq9 |[0m [2022-10-13 03:23:19 +0000] [16] [INFO] Booting worker with pid: 16
[36mjioo7o4vps-algo-1-9ktq9 |[0m [2022-10-13 03:23

In [30]:
!docker ps

CONTAINER ID   IMAGE                                                                             COMMAND   CREATED         STATUS         PORTS                                       NAMES
69350c1f4475   021345128571.dkr.ecr.ap-northeast-1.amazonaws.com/sagemaker-lightgbm-regression   "serve"   3 minutes ago   Up 3 minutes   0.0.0.0:8080->8080/tcp, :::8080->8080/tcp   98mgw9t4v0-algo-1-lepex


In [31]:
!docker stop 69350c1f4475

[36m98mgw9t4v0-algo-1-lepex |[0m [2022-10-13 03:23:12 +0000] [10] [INFO] Handling signal: term
[36m98mgw9t4v0-algo-1-lepex exited with code 0
69350c1f4475
[0mAborting on container exit...


In [32]:
!docker ps

CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES


In [36]:
### 推論実行
with open(local_test, 'r') as f:
    payload = f.read().strip()

predicted = local_predictor.predict(payload).decode('utf-8')
print('=' * 20)
print(predicted)

[36mjioo7o4vps-algo-1-9ktq9 |[0m Invoked with 102 records
19.95642073217597
27.844891841022335
23.747437427003455
21.961517177305176
33.70952263893306
16.546899933876215
20.7577247308279
21.58941351302627
28.44096446328559
21.573610198594977
16.520022349295115
18.56239893242527
33.70952263893306
21.66404760045202
18.839854556333133
20.524517944865078
23.512192914502315
19.720552829648888
14.831841119971708
25.48273874904075
24.232639474441545
21.624005932843115
24.961489794296718
31.737194191676068
21.634052928440624
28.40721160777621
21.408363849719503
14.831841119971708
22.218594550645975
21.174456098551236
21.78791955089051
14.831841119971708
29.996695633096042
22.44097524661187
33.83316205414468
26.41403196992683
33.70952263893306
17.366188662166092
27.56686070285819
30.785697489113854
19.36938873496206
20.70626548555591
17.759853567831996
27.888269821752413
20.521395163186774
14.831841119971708
24.776417537973362
24.965857100129327
19.649289821764185
21.026797620813866
33.709522

# 学習ジョブを発行
次は、ローカルモードではなく、
同じカスタムコンテナで、学習ジョブを実行します。

In [43]:
from sagemaker import get_execution_role

role = get_execution_role()

## S3bucket作成して、格納

In [39]:
import sagemaker
bucket_name = '<bucket_name>' # input your bucket name
bucket_name = 'demo-lgbm-container'

train_s3 = sagemaker.s3.S3Uploader.upload('./data/train/boston_train.csv', f's3://{bucket_name}/demo_lightgbm/train')
valid_s3 = sagemaker.s3.S3Uploader.upload('./data/valid/boston_valid.csv', f's3://{bucket_name}/demo_lightgbm/valid')

In [44]:
est_lightgbm = Estimator(
    image,
    role,
    instance_count=1,
    instance_type="ml.m4.2xlarge",
    hyperparameters={'boosting_type': 'gbdt',
            'objective': 'regression',
            'num_leaves': 31,
            'learning_rate': 0.05,
            'feature_fraction': 0.9,
            'bagging_fraction': 0.8,
            'bagging_freq': 5,
            'verbose': 0})

In [45]:
est_lightgbm.fit({'train':train_s3, 'validation': valid_s3}, logs=True)

2022-10-13 03:26:10 Starting - Starting the training job...
2022-10-13 03:26:34 Starting - Preparing the instances for trainingProfilerReport-1665631570: InProgress
.........
2022-10-13 03:28:05 Downloading - Downloading input data...
2022-10-13 03:28:35 Training - Downloading the training image...
2022-10-13 03:29:11 Uploading - Uploading generated training model.[34mStarting the training.[0m
[34mReading hyperparameters data: /opt/ml/input/config/hyperparameters.json[0m
[34mhyperparameters_data: {'bagging_fraction': '0.8', 'bagging_freq': '5', 'boosting_type': 'gbdt', 'feature_fraction': '0.9', 'learning_rate': '0.05', 'num_leaves': '31', 'objective': 'regression', 'verbose': '0'}[0m
[34mFound train files: ['/opt/ml/input/data/train/boston_train.csv'][0m
[34mFound validation files: ['/opt/ml/input/data/validation/boston_valid.csv'][0m
[34mbuilding training and validation datasets[0m
[34mStarting training...[0m
[34mYou can set `force_row_wise=true` to remove the overhead

学習には3分ほど時間がかかります。

課金されるのは75秒ほどです。

## エンドポイントにデプロイ
waitしない-> する

waitの間に解説

デプロイすると、
SageMaker は docker run <image> serveを実行します。
    serveスクリプトには、xxxxx
    webサーバ：nginx
    appサーバ：gunicorn
    が起動し、Flaskを使ったアプリケーションpredict.pyを読み込みます。

In [46]:
#predictor = est_lightgbm.deploy(1, 'ml.m4.xlarge', serializer=csv_serializer, wait=False)
#predictor = est_lightgbm.deploy(1, 'ml.m4.xlarge', serializer=csv_serializer, wait=False)
predictor = est_lightgbm.deploy(1, 'ml.m4.xlarge', serializer=csv_serializer, wait=True)

-----!

In [None]:
### 推論実行
with open(local_test, 'r') as f:
    payload = f.read().strip()

predicted = predictor.predict(payload).decode('utf-8')
print(predicted)

# 2. 実行ファイルを外部から指定する

Part1 ではカスタムコンテナ内に学習起動スクリプトtrainを配置しましたが、
ソースコードを修正するごとにコンテナを作り替える必要があります。

保守性を上げるには、コンテナ（環境）とソースコードを分けた方がいい場合もあります。
以下では外部からスクリプトファイルを指定する方法を紹介します。

## SageMaker Training Toolkit
外部からスクリプトを指定するためには、SageMaker Training Toolkitを導入します。

https://github.com/aws/sagemaker-training-toolkit


trainコマンドが
/conca/bin/train
にインストールされます。


先程のdockerfileに追記します。
資材からは、trainを除外しておきます。trainを含んだままだと、
docker run <image> train
を実行したときに、カレントディレクトリのtrainスクリプトが実行されてしまい、training toolkitが導入した　trainコマンドが実行できないためです。

### 学習スクリプトをローカルに保存して実行
GitHubから実行したい場合も。

SageMaker Training Toolkitが必要

### カスタムコンテナ作成
trainは含めないように注意しましょう。

Dockerfileにて、
・train を /opt/program/train と配置
・カレントディレクトリを /opt/program に設定
・SageMaker Training Toolkit が /opt/conda/bin/train にインストールされる
・train を実行すると、カレントにある /opt/program/train が実行されてしまう。
解決するには、
・カレントディレクトリを 持ち込みのtrainがある場所にしない
・train をそもそもコンテナに入れない（確実）

In [47]:
%%sh

# The name of our algorithm
algorithm_name=sagemaker-toolkit

#cd container
cd container_smtrtoolkit ### 変更点

#chmod +x lightgbm_regression/train
chmod +x lightgbm_regression/serve

account=$(aws sts get-caller-identity --query Account --output text)

# Get the region defined in the current configuration (default to us-west-2 if none defined)
region=$(aws configure get region)
region=${region:-us-west-2}

fullname="${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest"

# If the repository doesn't exist in ECR, create it.
aws ecr describe-repositories --repository-names "${algorithm_name}" > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "${algorithm_name}" > /dev/null
fi

# Get the login command from ECR and execute it directly
aws ecr get-login-password --region ${region}|docker login --username AWS --password-stdin ${fullname}

# Build the docker image locally with the image name and then push it to ECR
# with the full name.

docker build -t ${algorithm_name} .
docker tag ${algorithm_name} ${fullname}

docker push ${fullname}


Login Succeeded
Sending build context to Docker daemon  19.97kB
Step 1/14 : FROM ubuntu:16.04
 ---> b6f507652425
Step 2/14 : MAINTAINER Amazon AI <sage-learner@amazon.com>
 ---> Using cache
 ---> c5602b2d98e4
Step 3/14 : ARG CONDA_DIR=/opt/conda
 ---> Using cache
 ---> 618227bc5218
Step 4/14 : ENV PATH $CONDA_DIR/bin:$PATH
 ---> Using cache
 ---> f0165591799f
Step 5/14 : RUN apt-get update &&     apt-get install -y --no-install-recommends         ca-certificates         cmake         build-essential         gcc         g++         git         nginx         wget &&     wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh &&     /bin/bash Miniconda3-latest-Linux-x86_64.sh -f -b -p $CONDA_DIR &&     export PATH="$CONDA_DIR/bin:$PATH" &&     conda config --set always_yes yes --set changeps1 no &&     conda install -q -y numpy scipy scikit-learn pandas flask gevent gunicorn &&     git clone --recursive --branch stable --depth 1 https://github.com/Microsoft/LightGBM && 

https://docs.docker.com/engine/reference/commandline/login/#credentials-store



## 学習(ローカル)

In [51]:
#image = '805433377179.dkr.ecr.us-west-2.amazonaws.com/sagemaker-lightgbm-toolkit:latest'
image = '021345128571.dkr.ecr.ap-northeast-1.amazonaws.com/sagemaker-toolkit'
#image = <input your own image URI>

In [52]:
est_lightgbm3 = Estimator(
    image,
    role,
    instance_count=1,
    #instance_type="ml.m4.2xlarge",
    instance_type="local",
    hyperparameters={'boosting_type': 'gbdt',
            'objective': 'regression',
            'num_leaves': 31,
            'learning_rate': 0.05,
            'feature_fraction': 0.9,
            'bagging_fraction': 0.8,
            'bagging_freq': 5,
            'verbose': 0},
    #source_dir='./practice_src',
    entry_point='./src/train_practice.py'
    #entry_point='./practice_src/train_practice.sh'
    )
est_lightgbm3.fit({'train':train_s3, 'validation': valid_s3}, logs=True)

Creating n9nwn2rft8-algo-1-l0xvv ... 
Creating n9nwn2rft8-algo-1-l0xvv ... done
Attaching to n9nwn2rft8-algo-1-l0xvv
[36mn9nwn2rft8-algo-1-l0xvv |[0m 2022-10-13 03:45:55,344 botocore.credentials INFO     Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[36mn9nwn2rft8-algo-1-l0xvv |[0m 2022-10-13 03:45:55,545 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
[36mn9nwn2rft8-algo-1-l0xvv |[0m 2022-10-13 03:45:55,546 sagemaker-training-toolkit INFO     Failed to parse hyperparameter boosting_type value gbdt to Json.
[36mn9nwn2rft8-algo-1-l0xvv |[0m Returning the value itself
[36mn9nwn2rft8-algo-1-l0xvv |[0m 2022-10-13 03:45:55,546 sagemaker-training-toolkit INFO     Failed to parse hyperparameter objective value regression to Json.
[36mn9nwn2rft8-algo-1-l0xvv |[0m Returning the value itself
[36mn9nwn2rft8-algo-1-l0xvv |[0m 2022-10-13 03:45:55,555 sagemaker-training-toolkit INFO     instance_groups entry not present in resou

## デプロイローカル

デプロイは割愛します

In [55]:
!docker ps

CONTAINER ID   IMAGE                                                                             COMMAND   CREATED          STATUS          PORTS                                       NAMES
020337809d76   021345128571.dkr.ecr.ap-northeast-1.amazonaws.com/sagemaker-lightgbm-regression   "serve"   23 minutes ago   Up 23 minutes   0.0.0.0:8080->8080/tcp, :::8080->8080/tcp   jioo7o4vps-algo-1-9ktq9


In [56]:
!docker stop 020337809d76

[36mjioo7o4vps-algo-1-9ktq9 |[0m [2022-10-13 03:46:55 +0000] [10] [INFO] Handling signal: term
[36mjioo7o4vps-algo-1-9ktq9 exited with code 0
020337809d76
[0mAborting on container exit...


In [57]:
!docker ps

CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES


In [58]:
predictor3 = est_lightgbm3.deploy(1, 'local', serializer=csv_serializer) 

Attaching to b791gxyeqr-algo-1-ndhry
[36mb791gxyeqr-algo-1-ndhry |[0m Starting the inference server with 16 workers.
[36mb791gxyeqr-algo-1-ndhry |[0m [2022-10-13 03:47:02 +0000] [10] [INFO] Starting gunicorn 20.1.0
[36mb791gxyeqr-algo-1-ndhry |[0m [2022-10-13 03:47:02 +0000] [10] [INFO] Listening at: unix:/tmp/gunicorn.sock (10)
[36mb791gxyeqr-algo-1-ndhry |[0m [2022-10-13 03:47:02 +0000] [10] [INFO] Using worker: gevent
[36mb791gxyeqr-algo-1-ndhry |[0m [2022-10-13 03:47:02 +0000] [12] [INFO] Booting worker with pid: 12
[36mb791gxyeqr-algo-1-ndhry |[0m [2022-10-13 03:47:02 +0000] [13] [INFO] Booting worker with pid: 13
[36mb791gxyeqr-algo-1-ndhry |[0m [2022-10-13 03:47:02 +0000] [14] [INFO] Booting worker with pid: 14
[36mb791gxyeqr-algo-1-ndhry |[0m [2022-10-13 03:47:02 +0000] [15] [INFO] Booting worker with pid: 15
[36mb791gxyeqr-algo-1-ndhry |[0m [2022-10-13 03:47:02 +0000] [16] [INFO] Booting worker with pid: 16
[36mb791gxyeqr-algo-1-ndhry |[0m [2022-10-13 03:47

In [59]:
### 推論実行
with open(local_test, 'r') as f:
    payload = f.read().strip()

predicted = predictor3.predict(payload).decode('utf-8')
print(predicted)

The csv_serializer has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


[36mb791gxyeqr-algo-1-ndhry |[0m Invoked with 102 records
19.95642073217597
27.844891841022335
23.747437427003455
21.961517177305176
33.70952263893306
16.546899933876215
20.7577247308279
21.58941351302627
28.44096446328559
21.573610198594977
16.520022349295115
18.56239893242527
33.70952263893306
21.66404760045202
18.839854556333133
20.524517944865078
23.512192914502315
19.720552829648888
14.831841119971708
25.48273874904075
24.232639474441545
21.624005932843115
24.961489794296718
31.737194191676068
21.634052928440624
28.40721160777621
21.408363849719503
14.831841119971708
22.218594550645975
21.174456098551236
21.78791955089051
14.831841119971708
29.996695633096042
22.44097524661187
33.83316205414468
26.41403196992683
33.70952263893306
17.366188662166092
27.56686070285819
30.785697489113854
19.36938873496206
20.70626548555591
17.759853567831996
27.888269821752413
20.521395163186774
14.831841119971708
24.776417537973362
24.965857100129327
19.649289821764185
21.026797620813866
33.709522

## シェルスクリプトを実行する

In [65]:
est_lightgbm3 = Estimator(
    image,
    role,
    instance_count=1,
    #instance_type="ml.m4.2xlarge",
    instance_type="local",
    hyperparameters={'boosting_type': 'gbdt',
            'objective': 'regression',
            'num_leaves': 31,
            'learning_rate': 0.05,
            'feature_fraction': 0.9,
            'bagging_fraction': 0.8,
            'bagging_freq': 5,
            'verbose': 0},
    #source_dir='./practice_src',
    entry_point='./src/train_practice.sh'
    #entry_point='./practice_src/train_practice.sh'
    )
est_lightgbm3.fit({'train':train_s3, 'validation': valid_s3}, logs=True)

Creating sijzljjfxk-algo-1-ff6u8 ... 
Creating sijzljjfxk-algo-1-ff6u8 ... done
Attaching to sijzljjfxk-algo-1-ff6u8
[36msijzljjfxk-algo-1-ff6u8 |[0m 2022-10-13 03:55:07,951 botocore.credentials INFO     Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[36msijzljjfxk-algo-1-ff6u8 |[0m 2022-10-13 03:55:08,131 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
[36msijzljjfxk-algo-1-ff6u8 |[0m 2022-10-13 03:55:08,131 sagemaker-training-toolkit INFO     Failed to parse hyperparameter boosting_type value gbdt to Json.
[36msijzljjfxk-algo-1-ff6u8 |[0m Returning the value itself
[36msijzljjfxk-algo-1-ff6u8 |[0m 2022-10-13 03:55:08,131 sagemaker-training-toolkit INFO     Failed to parse hyperparameter objective value regression to Json.
[36msijzljjfxk-algo-1-ff6u8 |[0m Returning the value itself
[36msijzljjfxk-algo-1-ff6u8 |[0m 2022-10-13 03:55:08,141 sagemaker-training-toolkit INFO     instance_groups entry not present in resou

# 参考


SageMaker-Pytorth training Toolkit
https://github.com/aws/sagemaker-pytorch-training-toolkit/


SageMaker-Pytorch Inference Toolkit

https://github.com/aws/sagemaker-pytorch-inference-toolkit



https://stackoverflow.com/questions/73694705/what-is-the-difference-between-sagemaker-pytorch-training-toolkit-and-sagemaker


## 参考
SageMaker のtrainingジョブを理解する

https://github.com/aws-samples/aws-ml-jp/tree/main/sagemaker/sagemaker-traning/tutorial

# Toolkitを入れず、train からtrain.shを実行

ソースをS3に配置しなればならない


fit()について

https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.EstimatorBase.fit

datasetの指定は、S3のパスか、ローカルモードならfile://　つまりGitHubは不可

### SageMaker Traiing Toolkitについて

https://github.com/aws/sagemaker-training-toolkit/blob/master/README.md

inference toolkitもある。

https://docs.aws.amazon.com/sagemaker/latest/dg/amazon-sagemaker-toolkits.html


https://github.com/aws/sagemaker-inference-toolkit


## （おまけ）カスタムコンテナを使わず、built-inコンテナのrequirement.txtにlightgbmを記載して実行する



過去バージョン（1.3-3, 1.2-2, 1.2-1, 1.0-1)はこちら

https://github.com/aws/sagemaker-xgboost-container/releases


In [60]:
import boto3
container = sagemaker.image_uris.retrieve("xgboost", boto3.Session().region_name, "1.5-1")
#container = sagemaker.image_uris.retrieve("xgboost", boto3.Session().region_name, "latest")

In [61]:
container

'354813040037.dkr.ecr.ap-northeast-1.amazonaws.com/sagemaker-xgboost:1.5-1'

In [62]:
est_lightgbm5 = Estimator(
    #image,
    container, # xgboostのbuilt-inコンテナ
    role,
    instance_count=1,
    #instance_type="ml.m4.2xlarge",
    instance_type="local",
    hyperparameters={'boosting_type': 'gbdt',
            'objective': 'regression',
            'num_leaves': 31,
            'learning_rate': 0.05,
            'feature_fraction': 0.9,
            'bagging_fraction': 0.8,
            'bagging_freq': 5,
            'verbose': 0},
    source_dir='./src_builtin_container',
    entry_point='train_practice.py'
    #entry_point='./practice_src/train_practice.sh'
    )

In [63]:
est_lightgbm5.fit({'train':train_s3, 'validation': valid_s3}, logs=True)

Creating 1cbefdrksy-algo-1-3cmae ... 
Creating 1cbefdrksy-algo-1-3cmae ... done
Attaching to 1cbefdrksy-algo-1-3cmae
[36m1cbefdrksy-algo-1-3cmae |[0m [2022-10-13 03:47:30.133 bb2b5d82f97f:1 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None
[36m1cbefdrksy-algo-1-3cmae |[0m [2022-10-13:03:47:30:INFO] Imported framework sagemaker_xgboost_container.training
[36m1cbefdrksy-algo-1-3cmae |[0m [2022-10-13:03:47:30:INFO] Failed to parse hyperparameter boosting_type value gbdt to Json.
[36m1cbefdrksy-algo-1-3cmae |[0m Returning the value itself
[36m1cbefdrksy-algo-1-3cmae |[0m [2022-10-13:03:47:30:INFO] Failed to parse hyperparameter objective value regression to Json.
[36m1cbefdrksy-algo-1-3cmae |[0m Returning the value itself
[36m1cbefdrksy-algo-1-3cmae |[0m [2022-10-13:03:47:30:INFO] No GPUs detected (normal if no gpus installed)
[36m1cbefdrksy-algo-1-3cmae |[0m [2022-10-13:03:47:30:INFO] Invoking user training script.
[36m1cbefdrksy-algo-1-3cmae |[0m [2022-10-13:03:47:3

In [64]:
est_lightgbm6 = Estimator(
    #image,
    container, # xgboostのbuilt-inコンテナ
    role,
    instance_count=1,
    #instance_type="ml.m4.2xlarge",
    instance_type="local",
    hyperparameters={'boosting_type': 'gbdt',
            'objective': 'regression',
            'num_leaves': 31,
            'learning_rate': 0.05,
            'feature_fraction': 0.9,
            'bagging_fraction': 0.8,
            'bagging_freq': 5,
            'verbose': 0},
    source_dir='./src_builtin_container_no_lgbm',
    entry_point='train_practice.py'
    #entry_point='./practice_src/train_practice.sh'
    )

est_lightgbm6.fit({'train':train_s3, 'validation': valid_s3}, logs=True)

Creating 968s2db7vl-algo-1-pu2lq ... 
Creating 968s2db7vl-algo-1-pu2lq ... done
Attaching to 968s2db7vl-algo-1-pu2lq
[36m968s2db7vl-algo-1-pu2lq |[0m [2022-10-13 03:48:39.027 146ca5e7942d:1 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None
[36m968s2db7vl-algo-1-pu2lq |[0m [2022-10-13:03:48:39:INFO] Imported framework sagemaker_xgboost_container.training
[36m968s2db7vl-algo-1-pu2lq |[0m [2022-10-13:03:48:39:INFO] Failed to parse hyperparameter boosting_type value gbdt to Json.
[36m968s2db7vl-algo-1-pu2lq |[0m Returning the value itself
[36m968s2db7vl-algo-1-pu2lq |[0m [2022-10-13:03:48:39:INFO] Failed to parse hyperparameter objective value regression to Json.
[36m968s2db7vl-algo-1-pu2lq |[0m Returning the value itself
[36m968s2db7vl-algo-1-pu2lq |[0m [2022-10-13:03:48:39:INFO] No GPUs detected (normal if no gpus installed)
[36m968s2db7vl-algo-1-pu2lq |[0m [2022-10-13:03:48:39:INFO] Invoking user training script.
[36m968s2db7vl-algo-1-pu2lq |[0m [2022-10-13:03:48:3

RuntimeError: Failed to run: ['docker-compose', '-f', '/tmp/tmpyjfbk2z0/docker-compose.yaml', 'up', '--build', '--abort-on-container-exit'], Process exited with code: 1

lightgbmモジュールが存在しないため、エラーとなります

File "/opt/ml/code/train_practice.py", line 13, in <module>  
import lightgbm as lgb  
ModuleNotFoundError: No module named 'lightgbm'  


# 後片付け

* ECR
* S3
* SageMakerノートブックインスタンス

XGBoostビルトインコンテナでは、LGBMの推論を実行できないので、独自にserveを指定する必要があります。

SageMaker Python SDK  
https://github.com/aws/sagemaker-python-sdk

## 実験
Sagemaker.model.Model()でコード指定できる？

https://sagemaker.readthedocs.io/en/stable/api/inference/model.html


そのモデルを指定してエンドポイントを作る？

それでビルトインコンテナを使いつつ、LGBM推論ができる？

# XGBビルトインコンテナでLGBM推論を実施するには

ライブラリのインポート
/opt/ml/model/code/requirement.txt
を配置する必要があります。

https://github.com/aws/sagemaker-inference-toolkit/blob/master/src/sagemaker_inference/model_server.py

既存モデルをSageMakerエンドポイントにデプロイする場合

https://dev.classmethod.jp/articles/amazon-sagemaker-deploy_existing_model/

作成したモデルインスタンスをデプロイします

https://sagemaker.readthedocs.io/en/stable/api/inference/model.html#sagemaker.model.Model.deploy

## 参考
エンドポイントのインスタンス

https://sagemaker.readthedocs.io/en/stable/api/inference/predictors.html#sagemaker.predictor.Predictor