# LightGBMを、SageMakerカスタムコンテナで実行する

https://dev.classmethod.jp/articles/sagemaker-container-image-lightgbm/


https://github.com/aws/amazon-sagemaker-examples/blob/main/advanced_functionality/scikit_bring_your_own/scikit_bring_your_own.ipynb


* カスタムコンテナ作成
* SageMaker学習ジョブ - ローカルモード
* Sagemaker学習ジョブ
* エンドポイントデプロイ
* 推論実施

* データセットはiris(動作確認が目的のため)

# 1.カスタムコンテナのbuild & push

In [12]:
!cat container/Dockerfile

# Build an image that can do training and inference in SageMaker
# This is a Python 3 image that uses the nginx, gunicorn, flask stack
# for serving inferences in a stable way.

FROM ubuntu:18.04

MAINTAINER Amazon AI <sage-learner@amazon.com>


RUN apt -y update && apt install -y --no-install-recommends \
    wget \
    python3-distutils \
    nginx \
    ca-certificates \
    libgomp1 \
    && apt clean

# Here we get all python packages.
# There's substantial overlap between scipy and numpy that we eliminate by
# linking them together. Likewise, pip leaves the install caches populated which uses
# a significant amount of space. These optimizations save a fair amount of space in the
# image, which reduces start up time.
RUN wget https://bootstrap.pypa.io/get-pip.py && python3 get-pip.py && \
    pip install wheel numpy scipy scikit-learn pandas lightgbm flask gevent gunicorn && \
    rm -rf /root/.cache

# Set some environment variables. PYTHONUNBUFFERED keeps Python from buffering o

In [13]:
%run ./container/build_and_push.sh

SyntaxError: invalid syntax (build_and_push.sh, line 7)

In [9]:
!which sh

/usr/bin/sh


In [34]:
%%sh

pwd

python --version
python3 --version

# アルゴリズム名
algorithm_name=sagemaker-lightgbm

# ファイルを実行可能にする
chmod +x lightgbm_container/train
chmod +x lightgbm_container/serve

# アカウントID取得
account=$(aws sts get-caller-identity --query Account --output text)

# リージョン名
#region='ap-northeast-1'
region='us-west-2'

# リポジトリarn
fullname="${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest"

# ECRのリポジトリが存在しなければ作成する
aws --region ${region} ecr describe-repositories --repository-names "${algorithm_name}" > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws --region ${region} ecr create-repository --repository-name "${algorithm_name}" > /dev/null
fi

# ECRへのログインコマンドを取得し、ログインする
$(aws ecr get-login --region ${region} --no-include-email)


# コンテナイメージをビルドする
docker build  -t ${algorithm_name} .
docker tag ${algorithm_name} ${fullname}

# ECRのリポジトリへプッシュする
docker push ${fullname}


/home/ec2-user/SageMaker/aws_distributed_training/tabular_data/lightgbm_sm_trainingjob
Python 3.8.12
Python 3.8.12
Login Succeeded
Sending build context to Docker daemon  89.09kB
Step 1/10 : FROM ubuntu:18.04
 ---> 35b3f4f76a24
Step 2/10 : RUN apt-get -y update && apt-get install -y --no-install-recommends          wget          python3-distutils          python3-pip          python3-setuptools          nginx          ca-certificates          libgomp1     && rm -rf /var/lib/apt/lists/*
 ---> Using cache
 ---> 70124194f97e
Step 3/10 : RUN ln -s /usr/bin/python3 /usr/bin/python
 ---> Using cache
 ---> 0fdb6b1b1732
Step 4/10 : RUN ln -s /usr/bin/pip3 /usr/bin/pip
 ---> Using cache
 ---> 73a2685aa1cd
Step 5/10 : RUN pip --no-cache-dir install wheel numpy scipy scikit-learn pandas lightgbm flask gunicorn &&     rm -rf /root/.cache
 ---> Running in a23744afb947
Collecting wheel
  Downloading https://files.pythonhosted.org/packages/27/d6/003e593296a85fd6ed616ed962795b2f87709c3eee2bca4f6d0fe55

https://docs.docker.com/engine/reference/commandline/login/#credentials-store



In [22]:
import sys
print(sys.executable)

/home/ec2-user/anaconda3/envs/python3/bin/python


In [23]:
!python --version

Python 3.8.12


## geventなしじゃだめ？試しにやってみる。

In [35]:
!conda install -c conda-forge lightgbm -y

Collecting package metadata (current_repodata.json): done
Solving environment: / 
The environment is inconsistent, please check the package plan carefully
The following packages are causing the inconsistency:

  - conda-forge/noarch::tqdm==4.62.3=pyhd8ed1ab_0
  - conda-forge/noarch::black==21.11b1=pyhd8ed1ab_0
  - conda-forge/linux-64::conda-package-handling==1.7.3=py38h497a2fe_1
  - conda-forge/noarch::dask-core==2021.11.2=pyhd8ed1ab_0
  - conda-forge/noarch::flake8==4.0.1=pyhd8ed1ab_0
  - conda-forge/noarch::imageio==2.9.0=py_0
  - conda-forge/noarch::importlib_metadata==4.8.2=hd8ed1ab_0
  - conda-forge/linux-64::pytest==6.2.5=py38h578d9bd_1
  - conda-forge/linux-64::watchdog==2.1.6=py38h578d9bd_1
  - conda-forge/linux-64::aiohttp==3.8.1=py38h497a2fe_0
  - conda-forge/linux-64::astropy==5.0=py38h6c62de6_0
  - conda-forge/linux-64::bokeh==2.4.2=py38h578d9bd_0
  - conda-forge/linux-64::distributed==2021.11.2=py38h578d9bd_0
  - conda-forge/noarch::flask==2.0.2=pyhd8ed1ab_0
  - conda-for

In [36]:
1

1

In [45]:
import boto3
import re
import os
from os import path
import numpy as np
import pandas as pd
from sagemaker import get_execution_role
import sagemaker as sage
from sagemaker.predictor import csv_serializer
import lightgbm as lgb
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn import metrics
import json


# 各データを保存するS3の場所
prefix = 'sagemaker_custom_container/byom-lightgbm/'
bucket_name = 'work-aws-oregon'

# 学習やエンドポイント作成時などに使用するIAMロール
role = get_execution_role()

# sagemaker用セッションの作成
sess = sage.Session()

In [46]:
# irisデータを読み込む
iris = datasets.load_iris()

# 学習用と検証用にデータを分ける
train_x, validation_x, train_y, validation_y = train_test_split(iris.data, iris.target, test_size=0.2, stratify=iris.target)

In [47]:
# lgb用データセットを作成する
train = lgb.Dataset(train_x, label=train_y)

# validationデータは学習用データと関連づける
validation = train.create_valid(validation_x, label=validation_y)

# ローカルの保存場所
train_data_local = './data/train.bin'
val_data_local = './data/validation.bin'

# バイナリ形式で保存する
train.save_binary(train_data_local)
validation.save_binary(val_data_local)





<lightgbm.basic.Dataset at 0x7f94e0a7b790>

In [84]:
# 確認
np.savetxt('./data/train.csv', train_x)

In [48]:
### 確認
print(bucket_name)
print(prefix)

work-aws-oregon
sagemaker_custom_container/byom-lightgbm/


In [49]:
train_data_s3 = sess.upload_data(train_data_local, key_prefix=path.join(prefix, 'input/train'), bucket=bucket_name)
val_data_s3 = sess.upload_data(val_data_local, key_prefix=path.join(prefix, 'input/validation'), bucket=bucket_name)


## 学習

In [105]:
# ハイパーパラメータ
params = dict(
    #num_round = 10, ### SageMakerに渡すと、文字列として認識され、エラーとなる
    objective = 'multiclass',
    num_class = len(iris.target_names)
)
# メトリクス
metric_definitions = [dict(
    Name = 'multilogloss',
    Regex = '.*\\[[0-9]+\\].*valid_[0-9]+\'s\\smulti_logloss: (\\S+)'
)]

In [106]:
### 確認
print(sess)
print(role)

<sagemaker.session.Session object at 0x7f94c60a0310>
arn:aws:iam::805433377179:role/service-role/AmazonSageMaker-ExecutionRole-20220807T102095


# ローカルモードで動作確認

In [107]:
from sagemaker.local import LocalSession ### localモードを利用する場合

account = sess.boto_session.client('sts').get_caller_identity()['Account']
region = sess.boto_session.region_name

modelartifact_path = "s3://"+path.join(bucket_name, prefix, 'output')
model = sage.estimator.Estimator(
    '{}.dkr.ecr.{}.amazonaws.com/sagemaker-lightgbm:latest'.format(account, region), # コンテナイメージのarn
    role, # 使用するIAMロール
    1, # インスタンス数
    #'ml.c4.2xlarge', # インスタンスタイプ
    'local', # インスタンスタイプ
    output_path=modelartifact_path, # モデルの保存場所
    #sagemaker_session=sess, # SageMakerのセッション
    sagemaker_session=LocalSession(), # SageMakerのセッション
    
    metric_definitions=metric_definitions # メトリクスの定義
)

# ハイパーパラメータを設定
model.set_hyperparameters(**params)

In [108]:
# 入力データを設定し、学習ジョブを実行
model.fit(dict(
    train = train_data_s3,
    validation = val_data_s3
))

Creating 1b3awei1v7-algo-1-tq6ai ... 
Creating 1b3awei1v7-algo-1-tq6ai ... done
Attaching to 1b3awei1v7-algo-1-tq6ai
[36m1b3awei1v7-algo-1-tq6ai |[0m Starting the training.
[36m1b3awei1v7-algo-1-tq6ai |[0m [LightGBM] [Info] Load from binary file /opt/ml/input/data/train/train.bin
[36m1b3awei1v7-algo-1-tq6ai |[0m You can set `force_col_wise=true` to remove the overhead.
[36m1b3awei1v7-algo-1-tq6ai |[0m [LightGBM] [Info] Total Bins 87
[36m1b3awei1v7-algo-1-tq6ai |[0m [LightGBM] [Info] Number of data points in the train set: 120, number of used features: 4
[36m1b3awei1v7-algo-1-tq6ai |[0m [LightGBM] [Info] Start training from score -1.098612
[36m1b3awei1v7-algo-1-tq6ai |[0m [LightGBM] [Info] Start training from score -1.098612
[36m1b3awei1v7-algo-1-tq6ai |[0m [LightGBM] [Info] Start training from score -1.098612
[36m1b3awei1v7-algo-1-tq6ai |[0m [1]	valid_0's multi_logloss: 0.956813
[36m1b3awei1v7-algo-1-tq6ai |[0m [2]	valid_0's multi_logloss: 0.839018
[36m1b3awei1v7-a

# 学習ジョブで実行（ローカルモードで動作確認とれたので）

In [109]:
account = sess.boto_session.client('sts').get_caller_identity()['Account']
region = sess.boto_session.region_name

modelartifact_path = "s3://"+path.join(bucket_name, prefix, 'output')
model = sage.estimator.Estimator(
    '{}.dkr.ecr.{}.amazonaws.com/sagemaker-lightgbm:latest'.format(account, region), # コンテナイメージのarn
    role, # 使用するIAMロール
    1, # インスタンス数
    'ml.c4.2xlarge', # インスタンスタイプ
    #'local', # インスタンスタイプ
    output_path=modelartifact_path, # モデルの保存場所
    sagemaker_session=sess, # SageMakerのセッション
    #sagemaker_session=LocalSession(), # SageMakerのセッション
    
    metric_definitions=metric_definitions # メトリクスの定義
)

# ハイパーパラメータを設定
model.set_hyperparameters(**params)

In [110]:
# 入力データを設定し、学習ジョブを実行
model.fit(dict(
    train = train_data_s3,
    validation = val_data_s3
))

2022-09-14 09:08:40 Starting - Starting the training job...
2022-09-14 09:09:08 Starting - Preparing the instances for trainingProfilerReport-1663146520: InProgress
......
2022-09-14 09:10:10 Downloading - Downloading input data...
2022-09-14 09:10:36 Training - Training image download completed. Training in progress.
2022-09-14 09:10:36 Uploading - Uploading generated training model[34mStarting the training.[0m
[34m[LightGBM] [Info] Load from binary file /opt/ml/input/data/train/train.bin[0m
[34mYou can set `force_col_wise=true` to remove the overhead.[0m
[34m[LightGBM] [Info] Total Bins 87[0m
[34m[LightGBM] [Info] Number of data points in the train set: 120, number of used features: 4[0m
[34m[LightGBM] [Info] Start training from score -1.098612[0m
[34m[LightGBM] [Info] Start training from score -1.098612[0m
[34m[LightGBM] [Info] Start training from score -1.098612[0m
[34m[1]#011valid_0's multi_logloss: 0.956813[0m
[34m[2]#011valid_0's multi_logloss: 0.839018[0m
[

## デプロイ

* どれくらいの時間で終わる？
* タイムアウトはあるの？ 20分。ping応答がなくてタイムアウトエラー


In [111]:
predictor = model.deploy(1, 'ml.m4.xlarge', serializer=csv_serializer)

------------------------------------------*

UnexpectedStatusException: Error hosting endpoint sagemaker-lightgbm-2022-09-14-09-16-15-467: Failed. Reason: The primary container for production variant AllTraffic did not pass the ping health check. Please check CloudWatch logs for this endpoint..

## 推論

In [None]:
result = predictor.predict(validation_x)
result = json.loads(result)
result

In [None]:
cm = metrics.confusion_matrix(validation_y, np.argmax(result['results'], axis=1))
cm

## エンドポイント削除

In [None]:
'''
sess.delete_endpoint(predictor.endpoint)
'''

# 1.train.pyが正常に動作するのか

ファイルの説明は以下を参照

https://dev.classmethod.jp/articles/sagemaker-custom-container-deoloy/


In [67]:
#!/usr/bin/env python3

import os
import json
import sys
import traceback
import lightgbm as lgb


# sagemakerがデータを渡すためにコンテナにマウントするパス
#prefix = '/opt/ml/' ### trainingジョブの場合
prefix = './data/' ### ノートブックインスタンス上で実行する場合
input_path = prefix + 'input/data'
output_path = os.path.join(prefix, 'output')
model_path = os.path.join(prefix, 'model')
param_path = os.path.join(prefix, 'input/config/hyperparameters.json')
inputdataconfig_path = os.path.join(prefix, 'input/config/inputdataconfig.json')


# 有効なデータチャネル(Fileモードのみ対応)
valid_channel_names = ['train', 'validation']


def train():
    print('Starting the training.')
    try:
        # ハイパーパラメータを読み込みます
        #with open(param_path, 'r') as f:
        #    hyperparams = json.load(f)
        
        # ノートブックインスタンス上で実行する場合
        hyperparams = {}
        

        # 入力データコンフィグを読み込みます
        #with open(inputdataconfig_path, 'r') as f:
        #    inputdataconfig = json.load(f)
        
        # ノートブックインスタンス上で実行する場合
        inputdataconfig = {}

        # 入力データを読み込みます。
        inputdata_dic = {}
        for channel_name in inputdataconfig.keys():
            assert channel_name in valid_channel_names, 'input data channel must be included in '+str(valid_channel_names)
            inputdata_path = os.path.join(input_path, channel_name, channel_name+'.bin')
            inputdata_dic[channel_name] = lgb.Dataset(inputdata_path) ### LGBMフォーマットに変換


        # light-gbmで学習
        model = lgb.train(
            hyperparams,
            inputdata_dic['train'],
            valid_sets= [inputdata_dic['validation']] if 'validation' in inputdata_dic else None
        )

        # モデルを保存
        model.save_model(os.path.join(model_path, 'lightgbm_model.txt'))
        print('Training complete.')

    except Exception as e:
        # 何かエラーが発生したら、その内容をfailureに吐き出すことで失敗理由を伝達する
        trc = traceback.format_exc()
        with open(os.path.join(output_path, 'failure'), 'w') as s:
            s.write('Exception during training: ' + str(e) + '\n' + trc)
        # 標準出力に出すことでログにも送る
        print('Exception during training: ' + str(e) + '\n' + trc, file=sys.stderr)
        # 0以外の値を返すことで実行失敗を伝える
        sys.exit(255)

if __name__ == '__main__':
    train()

    # 0を返すことで実行成功を伝える
    sys.exit(0)

Exception during training: [Errno 2] No such file or directory: './data/input/config/inputdataconfig.json'
Traceback (most recent call last):
  File "/tmp/ipykernel_16814/4081494452.py", line 36, in train
    with open(inputdataconfig_path, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: './data/input/config/inputdataconfig.json'

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.



Starting the training.
Traceback (most recent call last):
  File "/tmp/ipykernel_16814/4081494452.py", line 36, in train
    with open(inputdataconfig_path, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: './data/input/config/inputdataconfig.json'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3524, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "/tmp/ipykernel_16814/4081494452.py", line 69, in <cell line: 68>
    train()
  File "/tmp/ipykernel_16814/4081494452.py", line 66, in train
    sys.exit(255)
SystemExit: 255

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.8/site-packages/IPython/core/ultratb.py", line 1101, in get_records
    return _fixed_getinnerfra

TypeError: object of type 'NoneType' has no len()

## 動作をより理解するために、ファイルを作成して学習ジョブと同じ構成にして実行

hyperparameters.json作る

In [76]:
#!/usr/bin/env python3

import os
import json
import sys
import traceback
import lightgbm as lgb


# sagemakerがデータを渡すためにコンテナにマウントするパス
#prefix = '/opt/ml/' ### trainingジョブの場合
prefix = './data/' ### ノートブックインスタンス上で実行する場合
input_path = prefix + 'input/data'
output_path = os.path.join(prefix, 'output')
model_path = os.path.join(prefix, 'model')
param_path = os.path.join(prefix, 'input/config/hyperparameters.json')
inputdataconfig_path = os.path.join(prefix, 'input/config/inputdataconfig.json')


# 有効なデータチャネル(Fileモードのみ対応)
valid_channel_names = ['train', 'validation']


def train():
    print('Starting the training.')
    try:
        # ハイパーパラメータを読み込みます
        with open(param_path, 'r') as f:
            hyperparams = json.load(f)
        
        # ノートブックインスタンス上で実行する場合
        print(hyperparams)
        

        # 入力データコンフィグを読み込みます
        with open(inputdataconfig_path, 'r') as f:
            inputdataconfig = json.load(f)
        
        # ノートブックインスタンス上で実行する場合
        print(inputdataconfig)
        #inputdataconfig = {}

        # 入力データを読み込みます。
        inputdata_dic = {}
        for channel_name in inputdataconfig.keys():
            assert channel_name in valid_channel_names, 'input data channel must be included in '+str(valid_channel_names)
            inputdata_path = os.path.join(input_path, channel_name, channel_name+'.bin')
            inputdata_dic[channel_name] = lgb.Dataset(inputdata_path) ### LGBMフォーマットに変換


        # light-gbmで学習
        print('===== start training =====')
        print(hyperparams)
        print(inputdata_dic)
        print(inputdata_dic['train'])
        print([inputdata_dic['validation']] if 'validation' in inputdata_dic else None)
        model = lgb.train(
            hyperparams,
            inputdata_dic['train'],
            #valid_sets= [inputdata_dic['validation']] if 'validation' in inputdata_dic else None
        )

        # モデルを保存
        print('===== saving model =====')
        model.save_model(os.path.join(model_path, 'lightgbm_model.txt'))
        print('Training complete.')

    except Exception as e:
        # 何かエラーが発生したら、その内容をfailureに吐き出すことで失敗理由を伝達する
        trc = traceback.format_exc()
        with open(os.path.join(output_path, 'failure'), 'w') as s:
            s.write('Exception during training: ' + str(e) + '\n' + trc)
        # 標準出力に出すことでログにも送る
        print('Exception during training: ' + str(e) + '\n' + trc, file=sys.stderr)
        # 0以外の値を返すことで実行失敗を伝える
        sys.exit(255)

if __name__ == '__main__':
    train()

    # 0を返すことで実行成功を伝える
    sys.exit(0)

Exception during training: '<=' not supported between instances of 'str' and 'int'
Traceback (most recent call last):
  File "/tmp/ipykernel_16814/1192742344.py", line 57, in train
    model = lgb.train(
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.8/site-packages/lightgbm/engine.py", line 189, in train
    if num_boost_round <= 0:
TypeError: '<=' not supported between instances of 'str' and 'int'

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.



Starting the training.
{'num_round': '10', 'objective': 'multiclass', 'num_class': '3'}
{'train': {'TrainingInputMode': 'File', 'S3DistributionType': 'FullyReplicated', 'RecordWrapperType': 'None'}}
===== start training =====
{'num_round': '10', 'objective': 'multiclass', 'num_class': '3'}
{'train': <lightgbm.basic.Dataset object at 0x7f955125a190>}
<lightgbm.basic.Dataset object at 0x7f955125a190>
None
Traceback (most recent call last):
  File "/tmp/ipykernel_16814/1192742344.py", line 57, in train
    model = lgb.train(
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.8/site-packages/lightgbm/engine.py", line 189, in train
    if num_boost_round <= 0:
TypeError: '<=' not supported between instances of 'str' and 'int'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3524, in run_code
    exec(code_obj, self.u

TypeError: object of type 'NoneType' has no len()