# LightGBMをカスタムコンテナで利用する手順を学び、SageMakerの動作を理解します

2hを想定

コンテンツ
* カスタムコンテナ(ローカル学習、ローカル推論、学習ジョブ、推論ジョブ）
* SageMaker Training Toolkit導入（コードを外出しにする）：ローカル学習、ローカル推論


In [1]:
# This is a sample Python program that trains a simple LightGBM Regression model, and then performs inference.
# This implementation will work on your local computer.
#
# Prerequisites:
#   1. Install required Python packages:
#       pip install boto3 sagemaker pandas scikit-learn
#       pip install 'sagemaker[local]'
#   2. Docker Desktop has to be installed on your computer, and running.
#   3. Open terminal and run the following commands:
#       docker build  -t sagemaker-lightgbm-regression-local container/.
########################################################################################################################


In [2]:
import pandas as pd
from sagemaker.estimator import Estimator
from sagemaker.local import LocalSession
from sagemaker.predictor import csv_serializer
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split

In [3]:
sagemaker_session = LocalSession()
sagemaker_session.config = {'local': {'local_code': True}}

# For local training a dummy role will be sufficient
role = 'arn:aws:iam::111111111111:role/service-role/AmazonSageMaker-ExecutionRole-20200101T000001'

# 1.データ準備

https://github.com/aws-samples/amazon-sagemaker-local-mode/blob/main/lightgbm_bring_your_own_container_local_training_and_serving/lightgbm_bring_your_own_container_local_training_and_serving.py

In [11]:
data = load_boston()

X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=45)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=45)

trainX = pd.DataFrame(X_train, columns=data.feature_names)
trainX['target'] = y_train

valX = pd.DataFrame(X_val, columns=data.feature_names)
valX['target'] = y_val

testX = pd.DataFrame(X_test, columns=data.feature_names)

In [12]:
from pathlib import Path

Path('./data/train').mkdir(parents=True, exist_ok=True)
Path('./data/valid').mkdir(parents=True, exist_ok=True)
Path('./data/test').mkdir(parents=True, exist_ok=True)

In [18]:
local_train = './data/train/boston_train.csv'
local_valid = './data/valid/boston_valid.csv'
local_test = './data/test/boston_test.csv'

trainX.to_csv(local_train, header=None, index=False)
valX.to_csv(local_valid, header=None, index=False)
testX.to_csv(local_test, header=None, index=False)

# 2.カスタムコンテナ作成

In [17]:
%%sh

# The name of our algorithm
algorithm_name=sagemaker-lightgbm-regression

cd container

chmod +x lightgbm_regression/train
chmod +x lightgbm_regression/serve

account=$(aws sts get-caller-identity --query Account --output text)

# Get the region defined in the current configuration (default to us-west-2 if none defined)
region=$(aws configure get region)
region=${region:-us-west-2}

fullname="${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest"

# If the repository doesn't exist in ECR, create it.
aws ecr describe-repositories --repository-names "${algorithm_name}" > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "${algorithm_name}" > /dev/null
fi

# Get the login command from ECR and execute it directly
aws ecr get-login-password --region ${region}|docker login --username AWS --password-stdin ${fullname}

# Build the docker image locally with the image name and then push it to ECR
# with the full name.

docker build -t ${algorithm_name} .
docker tag ${algorithm_name} ${fullname}

docker push ${fullname}


Login Succeeded
Sending build context to Docker daemon  32.77kB
Step 1/10 : FROM ubuntu:16.04
 ---> b6f507652425
Step 2/10 : MAINTAINER Amazon AI <sage-learner@amazon.com>
 ---> Using cache
 ---> dffb77cb5ac9
Step 3/10 : ARG CONDA_DIR=/opt/conda
 ---> Using cache
 ---> 5fa24152f9a6
Step 4/10 : ENV PATH $CONDA_DIR/bin:$PATH
 ---> Using cache
 ---> 61f2949df5be
Step 5/10 : RUN apt-get update &&     apt-get install -y --no-install-recommends         ca-certificates         cmake         build-essential         gcc         g++         git         nginx         wget &&     wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh &&     /bin/bash Miniconda3-latest-Linux-x86_64.sh -f -b -p $CONDA_DIR &&     export PATH="$CONDA_DIR/bin:$PATH" &&     conda config --set always_yes yes --set changeps1 no &&     conda install -q -y numpy scipy scikit-learn pandas flask gevent gunicorn &&     git clone --recursive --branch stable --depth 1 https://github.com/Microsoft/LightGBM && 

https://docs.docker.com/engine/reference/commandline/login/#credentials-store



# ECRでpushしたコンテナのURIを確認

In [19]:
#image = 'sagemaker-lightgbm-regression-local'
image = '805433377179.dkr.ecr.us-west-2.amazonaws.com/sagemaker-lightgbm-regression:latest' # ビルドしたイメージのURI


local_lightgbm = Estimator(
    image,
    role,
    instance_count=1,
    instance_type="local",
    hyperparameters={'boosting_type': 'gbdt',
            'objective': 'regression',
            'num_leaves': 31,
            'learning_rate': 0.05,
            'feature_fraction': 0.9,
            'bagging_fraction': 0.8,
            'bagging_freq': 5,
            'verbose': 0})


In [20]:
train_location = 'file://'+local_train
valid_location = 'file://'+local_valid

# ローカル学習
ECRからビルドしたイメージを持ってきて、ローカルのdockerでビルドして、実行する

In [39]:
local_lightgbm.fit({'train':train_location, 'validation': valid_location}, logs=True)

Creating r6pvns1cfr-algo-1-4eyqe ... 
Creating r6pvns1cfr-algo-1-4eyqe ... done
Attaching to r6pvns1cfr-algo-1-4eyqe
[36mr6pvns1cfr-algo-1-4eyqe |[0m Starting the training.
[36mr6pvns1cfr-algo-1-4eyqe |[0m Reading hyperparameters data: /opt/ml/input/config/hyperparameters.json
[36mr6pvns1cfr-algo-1-4eyqe |[0m hyperparameters_data: {'boosting_type': 'gbdt', 'objective': 'regression', 'num_leaves': '31', 'learning_rate': '0.05', 'feature_fraction': '0.9', 'bagging_fraction': '0.8', 'bagging_freq': '5', 'verbose': '0'}
[36mr6pvns1cfr-algo-1-4eyqe |[0m Found train files: ['/opt/ml/input/data/train/boston_train.csv']
[36mr6pvns1cfr-algo-1-4eyqe |[0m Found validation files: ['/opt/ml/input/data/validation/.ipynb_checkpoints', '/opt/ml/input/data/validation/boston_valid.csv']
[36mr6pvns1cfr-algo-1-4eyqe |[0m Exception during training: [Errno 21] Is a directory: '/opt/ml/input/data/validation/.ipynb_checkpoints'
[36mr6pvns1cfr-algo-1-4eyqe |[0m Traceback (most recent call last):


RuntimeError: Failed to run: ['docker-compose', '-f', '/tmp/tmpawwutseo/docker-compose.yaml', 'up', '--build', '--abort-on-container-exit'], Process exited with code: 255

ローカルモードの学習結果は

Amazon S3
Buckets
sagemaker-us-west-2-805433377179
sagemaker-lightgbm-regression-2022-10-03-06-17-32-054/

に出力されます。


### ローカルサービング

In [22]:
local_predictor = local_lightgbm.deploy(1, 'local', serializer=csv_serializer) 

!

Exception in thread Thread-8:
Traceback (most recent call last):
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.8/site-packages/sagemaker/local/image.py", line 854, in run
    _stream_output(self.process)
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.8/site-packages/sagemaker/local/image.py", line 916, in _stream_output
    raise RuntimeError("Process exited with code: %s" % exit_code)
RuntimeError: Process exited with code: 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.8/site-packages/sagemaker/local/image.py", line 859, in run
    raise RuntimeError(msg)
RuntimeError: Failed to run: ['docker-compose', '-f', '/tmp/tmp4o8x8s9j/docker-compose.yaml', 'up', '--build', '--abort-on-container-exit'], Process exited with code: 1


In [24]:
### 推論実行
with open(local_test, 'r') as f:
    payload = f.read().strip()

predicted = local_predictor.predict(payload).decode('utf-8')
print(predicted)

The csv_serializer has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


18.777674892839066
27.25713360024323
23.610414339091246
21.97347013412657
34.430502515390316
16.498234706290795
20.840398680562625
21.145769100245438
27.879145192043296
21.58681134764175
15.043675807546869
17.77015058574251
34.430502515390316
21.31411867264166
18.69840430147045
20.744076911548735
22.741347284000348
18.86324767003252
14.886787117323767
26.163148096824077
24.291976313017795
21.418573054582424
25.16661402130081
31.63264493407418
21.31751434520187
27.464668816033427
21.16667249503399
14.344510956372458
21.72794462867519
21.03705804401459
21.81567930570058
14.344510956372458
30.724418417536683
23.31427238016595
34.510253957229565
25.786240254401637
34.430502515390316
16.459547989460095
25.707439418271306
31.526529614323298
18.84014372096384
20.76365092804235
18.30057590464775
27.09058276275205
19.91942852413607
14.886787117323767
24.071240114857428
24.365744945784318
19.12043389834621
20.86607225572482
34.430502515390316
22.742033783803905
25.114897176817582
31.948114288280

# 学習ジョブを発行

In [25]:
from sagemaker import get_execution_role

role = get_execution_role()

In [26]:
est_lightgbm = Estimator(
    image,
    role,
    instance_count=1,
    instance_type="ml.m4.2xlarge",
    hyperparameters={'boosting_type': 'gbdt',
            'objective': 'regression',
            'num_leaves': 31,
            'learning_rate': 0.05,
            'feature_fraction': 0.9,
            'bagging_fraction': 0.8,
            'bagging_freq': 5,
            'verbose': 0})

In [27]:
import sagemaker

train_s3 = sagemaker.s3.S3Uploader.upload('./data/train/boston_train.csv','s3://work-aws-oregon/demo_lightgbm/train')
valid_s3 = sagemaker.s3.S3Uploader.upload('./data/valid/boston_valid.csv','s3://work-aws-oregon/demo_lightgbm/valid')

In [28]:
est_lightgbm.fit({'train':train_s3, 'validation': valid_s3}, logs=True)

2022-10-03 06:32:10 Starting - Starting the training job...
2022-10-03 06:32:39 Starting - Preparing the instances for trainingProfilerReport-1664778729: InProgress
......
2022-10-03 06:33:40 Downloading - Downloading input data...
2022-10-03 06:34:08 Training - Downloading the training image...
2022-10-03 06:34:41 Training - Training image download completed. Training in progress...[34mStarting the training.[0m
[34mReading hyperparameters data: /opt/ml/input/config/hyperparameters.json[0m
[34mhyperparameters_data: {'bagging_fraction': '0.8', 'bagging_freq': '5', 'boosting_type': 'gbdt', 'feature_fraction': '0.9', 'learning_rate': '0.05', 'num_leaves': '31', 'objective': 'regression', 'verbose': '0'}[0m
[34mFound train files: ['/opt/ml/input/data/train/boston_train.csv'][0m
[34mFound validation files: ['/opt/ml/input/data/validation/boston_valid.csv'][0m
[34mbuilding training and validation datasets[0m
[34mStarting training...[0m
[34mYou can set `force_col_wise=true` to 

## エンドポイントにデプロイ
waitしない

In [29]:
predictor = est_lightgbm.deploy(1, 'ml.m4.xlarge', serializer=csv_serializer) 

-----!

In [31]:
### 推論実行
with open(local_test, 'r') as f:
    payload = f.read().strip()

predicted = predictor.predict(payload).decode('utf-8')
print(predicted)

The csv_serializer has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


19.95642073217597
27.844891841022335
23.747437427003455
21.961517177305176
33.70952263893306
16.546899933876215
20.7577247308279
21.58941351302627
28.44096446328559
21.573610198594977
16.520022349295115
18.56239893242527
33.70952263893306
21.66404760045202
18.839854556333133
20.524517944865078
23.512192914502315
19.720552829648888
14.831841119971708
25.48273874904075
24.232639474441545
21.624005932843115
24.961489794296718
31.737194191676068
21.634052928440624
28.40721160777621
21.408363849719503
14.831841119971708
22.218594550645975
21.174456098551236
21.78791955089051
14.831841119971708
29.996695633096042
22.44097524661187
33.83316205414468
26.41403196992683
33.70952263893306
17.366188662166092
27.56686070285819
30.785697489113854
19.36938873496206
20.70626548555591
17.759853567831996
27.888269821752413
20.521395163186774
14.831841119971708
24.776417537973362
24.965857100129327
19.649289821764185
21.026797620813866
33.70952263893306
22.770867837558004
25.12436361101226
32.04499227317

### 学習スクリプトをローカルに保存して実行
GitHubから実行したい場合も。

SageMaker Training Toolkitが必要

### カスタムコンテナ作成
trainは含めないように注意しましょう。

Dockerfileにて、
・train を /opt/program/train と配置
・カレントディレクトリを /opt/program に設定
・SageMaker Training Toolkit が /opt/conda/bin/train にインストールされる
・train を実行すると、カレントにある /opt/program/train が実行されてしまう。
解決するには、
・カレントディレクトリを 持ち込みのtrainがある場所にしない
・train をそもそもコンテナに入れない（確実）

In [35]:
%%sh

# The name of our algorithm
algorithm_name=sagemaker-toolkit

#cd container
cd container_smtrtoolkit ### 変更点

#chmod +x lightgbm_regression/train
chmod +x lightgbm_regression/serve

account=$(aws sts get-caller-identity --query Account --output text)

# Get the region defined in the current configuration (default to us-west-2 if none defined)
region=$(aws configure get region)
region=${region:-us-west-2}

fullname="${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest"

# If the repository doesn't exist in ECR, create it.
aws ecr describe-repositories --repository-names "${algorithm_name}" > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "${algorithm_name}" > /dev/null
fi

# Get the login command from ECR and execute it directly
aws ecr get-login-password --region ${region}|docker login --username AWS --password-stdin ${fullname}

# Build the docker image locally with the image name and then push it to ECR
# with the full name.

docker build -t ${algorithm_name} .
docker tag ${algorithm_name} ${fullname}

docker push ${fullname}


Login Succeeded
Sending build context to Docker daemon  23.55kB
Step 1/14 : FROM ubuntu:16.04
 ---> b6f507652425
Step 2/14 : MAINTAINER Amazon AI <sage-learner@amazon.com>
 ---> Using cache
 ---> dffb77cb5ac9
Step 3/14 : ARG CONDA_DIR=/opt/conda
 ---> Using cache
 ---> 5fa24152f9a6
Step 4/14 : ENV PATH $CONDA_DIR/bin:$PATH
 ---> Using cache
 ---> 61f2949df5be
Step 5/14 : RUN apt-get update &&     apt-get install -y --no-install-recommends         ca-certificates         cmake         build-essential         gcc         g++         git         nginx         wget &&     wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh &&     /bin/bash Miniconda3-latest-Linux-x86_64.sh -f -b -p $CONDA_DIR &&     export PATH="$CONDA_DIR/bin:$PATH" &&     conda config --set always_yes yes --set changeps1 no &&     conda install -q -y numpy scipy scikit-learn pandas flask gevent gunicorn &&     git clone --recursive --branch stable --depth 1 https://github.com/Microsoft/LightGBM && 

https://docs.docker.com/engine/reference/commandline/login/#credentials-store



## 学習(ローカル)

In [36]:
#image = '805433377179.dkr.ecr.us-west-2.amazonaws.com/sagemaker-lightgbm-toolkit:latest'
image = '805433377179.dkr.ecr.us-west-2.amazonaws.com/sagemaker-toolkit:latest'
image = <input your own image URI>

est_lightgbm3 = Estimator(
    image,
    role,
    instance_count=1,
    #instance_type="ml.m4.2xlarge",
    instance_type="local",
    hyperparameters={'boosting_type': 'gbdt',
            'objective': 'regression',
            'num_leaves': 31,
            'learning_rate': 0.05,
            'feature_fraction': 0.9,
            'bagging_fraction': 0.8,
            'bagging_freq': 5,
            'verbose': 0},
    #source_dir='./practice_src',
    entry_point='./src/train_practice.py'
    #entry_point='./practice_src/train_practice.sh'
    )
est_lightgbm3.fit({'train':train_s3, 'validation': valid_s3}, logs=True)

Creating 95kmua3hgb-algo-1-xcq2r ... 
Creating 95kmua3hgb-algo-1-xcq2r ... done
Attaching to 95kmua3hgb-algo-1-xcq2r
[36m95kmua3hgb-algo-1-xcq2r |[0m 2022-10-03 06:56:31,092 botocore.credentials INFO     Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
[36m95kmua3hgb-algo-1-xcq2r |[0m 2022-10-03 06:56:31,286 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
[36m95kmua3hgb-algo-1-xcq2r |[0m 2022-10-03 06:56:31,287 sagemaker-training-toolkit INFO     Failed to parse hyperparameter boosting_type value gbdt to Json.
[36m95kmua3hgb-algo-1-xcq2r |[0m Returning the value itself
[36m95kmua3hgb-algo-1-xcq2r |[0m 2022-10-03 06:56:31,287 sagemaker-training-toolkit INFO     Failed to parse hyperparameter objective value regression to Json.
[36m95kmua3hgb-algo-1-xcq2r |[0m Returning the value itself
[36m95kmua3hgb-algo-1-xcq2r |[0m 2022-10-03 06:56:31,298 sagemaker-training-toolkit INFO     instance_groups entry not present in resou

## デプロイローカル

In [37]:
predictor3 = est_lightgbm3.deploy(1, 'local', serializer=csv_serializer) 

!

In [38]:
### 推論実行
with open(local_test, 'r') as f:
    payload = f.read().strip()

predicted = predictor3.predict(payload).decode('utf-8')
print(predicted)

The csv_serializer has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


18.777674892839066
27.25713360024323
23.610414339091246
21.97347013412657
34.430502515390316
16.498234706290795
20.840398680562625
21.145769100245438
27.879145192043296
21.58681134764175
15.043675807546869
17.77015058574251
34.430502515390316
21.31411867264166
18.69840430147045
20.744076911548735
22.741347284000348
18.86324767003252
14.886787117323767
26.163148096824077
24.291976313017795
21.418573054582424
25.16661402130081
31.63264493407418
21.31751434520187
27.464668816033427
21.16667249503399
14.344510956372458
21.72794462867519
21.03705804401459
21.81567930570058
14.344510956372458
30.724418417536683
23.31427238016595
34.510253957229565
25.786240254401637
34.430502515390316
16.459547989460095
25.707439418271306
31.526529614323298
18.84014372096384
20.76365092804235
18.30057590464775
27.09058276275205
19.91942852413607
14.886787117323767
24.071240114857428
24.365744945784318
19.12043389834621
20.86607225572482
34.430502515390316
22.742033783803905
25.114897176817582
31.948114288280

Exception in thread Thread-9:
Traceback (most recent call last):
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.8/site-packages/sagemaker/local/image.py", line 854, in run
    _stream_output(self.process)
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.8/site-packages/sagemaker/local/image.py", line 916, in _stream_output
    raise RuntimeError("Process exited with code: %s" % exit_code)
RuntimeError: Process exited with code: 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.8/site-packages/sagemaker/local/image.py", line 859, in run
    raise RuntimeError(msg)
RuntimeError: Failed to run: ['docker-compose', '-f', '/tmp/tmp3mv6akn6/docker-compose.yaml', 'up', '--build', '--abort-on-container-exit'], Process exited with code: 1


# 参考


SageMaker-Pytorth training Toolkit
https://github.com/aws/sagemaker-pytorch-training-toolkit/


SageMaker-Pytorch Inference Toolkit

https://github.com/aws/sagemaker-pytorch-inference-toolkit



https://stackoverflow.com/questions/73694705/what-is-the-difference-between-sagemaker-pytorch-training-toolkit-and-sagemaker


## 参考
SageMaker のtrainingジョブを理解する

https://github.com/aws-samples/aws-ml-jp/tree/main/sagemaker/sagemaker-traning/tutorial

# Toolkitを入れず、train からtrain.shを実行

ソースをS3に配置しなればならない


fit()について

https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.EstimatorBase.fit

datasetの指定は、S3のパスか、ローカルモードならfile://　つまりGitHubは不可

### SageMaker Traiing Toolkitについて

https://github.com/aws/sagemaker-training-toolkit/blob/master/README.md

inference toolkitもある。

https://docs.aws.amazon.com/sagemaker/latest/dg/amazon-sagemaker-toolkits.html


https://github.com/aws/sagemaker-inference-toolkit


# End Of Containts ================