# SageMaker에서 YOLOv5 학습

**References**
> YOLOv5: https://github.com/ultralytics/yolov5  
> Amazon SageMaker를 이용한 시계열 학습과 MLOps 구성: https://github.com/Napkin-DL/sm-informer-mlops-quicksight  
> How to Train YOLOv5 On a Custom Dataset: https://blog.roboflow.com/how-to-train-yolov5-on-a-custom-dataset/

**Kernel:** `conda_pytorch_latest_p36`

## 1. 필요한 패키지 설치 및 업데이트

In [1]:
install_needed = True  # should only be True once
# install_needed = False

In [2]:
import sys
import IPython

if install_needed:
    print("installing deps and restarting kernel")
    !{sys.executable} -m pip install -U 'sagemaker[local]'
    !{sys.executable} -m pip install -U sagemaker-experiments # SageMaker Experiments SDK 
    !{sys.executable} -m pip install -U sagemaker             # SageMaker Python SDK
    !/bin/bash ./local/local_mode_setup.sh
    IPython.Application.instance().kernel.do_shutdown(True)

installing deps and restarting kernel
You should consider upgrading via the '/home/ec2-user/anaconda3/envs/pytorch_latest_p36/bin/python -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/home/ec2-user/anaconda3/envs/pytorch_latest_p36/bin/python -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/home/ec2-user/anaconda3/envs/pytorch_latest_p36/bin/python -m pip install --upgrade pip' command.[0m
nvidia-docker2 already installed. We are good to go!
Stopping docker: [60G[[0;32m  OK  [0;39m]
Starting docker:	.[60G[[0;32m  OK  [0;39m]
SageMaker instance route table setup is ok. We are good to go.
SageMaker instance routing for Docker is ok. We are good to go!


## 2. 환경 설정

In [1]:
import matplotlib.pyplot as plt
import sagemaker
# import splitfolders

import os
import time
import warnings

from smexperiments.experiment import Experiment
from smexperiments.trial import Trial

import boto3
import numpy as np

# from tqdm import tqdm
from time import strftime

from sagemaker import get_execution_role
from sagemaker.pytorch import PyTorch

warnings.filterwarnings('ignore')
%config InlineBackend.figure_format = 'retina'

## 3. SageMaker Experiments 설정
[SageMaker Experiments](https://aws.amazon.com/blogs/machine-learning/streamline-modeling-with-amazon-sagemaker-studio-and-amazon-experiments-sdk/)는 기계 학습 실험을 구성, 추적, 비교 및 평가할 수 있는 Amazon SageMaker 의 기능입니다. 기계 학습은 반복적인 프로세스입니다. 점진적인 변화가 모델 정확도에 미치는 영향을 관찰하면서 데이터, 알고리즘 및 파라미터의 여러 조합을 이용해 실험을 해야 합니다. 시간이 지남에 따라 실험이 반복되면서 수천 개의 모델 훈련 및 모델 버전이 생성될 수 있습니다. 따라서 최고의 성과를 보이는 모델과 입력 구성을 추적하기가 어렵습니다. 또한 현재 진행 중인 실험을 이전의 실험과 비교하여 추가적이고 점진적인 개선 기회를 찾아내는 것도 어렵습니다.

SageMaker Experiments는 반복 작업의 입력, 파라미터, 구성 및 결과를재판. 이러한 시도를 실험으로 할당하고 그룹화 및 구성할 수 있습니다. SageMaker 실험은 Amazon SageMaker 스튜디오와 통합되어 현재 진행 중인 실험과 과거 실험을 탐색하고, 주요 성과 지표를 토대로 시도를 비교하며, 최고의 성과를 보이는 모델을 식별하기 위한 시각적 인터페이스를 제공합니다.

SageMaker Experiments는 Experiment, Trial, Trial Component, Tracker로 구성되어 있습니다. 각 구성요소의 관계는 아래 그림을 참조하세요.

<p align="center">
<center><img src="./image/sm-experiments.jpeg" height="400" width="600" alt=""><center>
<br><br>
<b>Figure 1.SageMaker Experiments 구성요소</b> 
</p>

In [2]:
def create_experiment(experiment_name):
    try:
        sm_experiment = Experiment.load(experiment_name)
    except:
        sm_experiment = Experiment.create(experiment_name=experiment_name,
                                          tags=[
                                              {
                                                  'Key': 'modelname',
                                                  'Value': 'yolov5_sm'
                                              },
                                          ])

In [3]:
def create_trial(experiment_name, set_param, i_type, i_cnt, spot):
    create_date = strftime("%m%d-%H%M%s")
    
    algo = 'dp'
    
    spot = 's' if spot else 'd'
    i_tag = 'test'
    
    if i_type == 'ml.p3.16xlarge':
        i_tag = 'p3'
    elif i_type == 'ml.p2.8xlarge':
        i_tag = 'p2'
    elif i_type == 'ml.p3dn.24xlarge':
        i_tag = 'p3dn'
    elif i_type == 'ml.p4d.24xlarge':
        i_tag = 'p4d'    
        
    trial = "-".join([i_tag,str(i_cnt),algo, spot])
       
    sm_trial = Trial.create(trial_name=f'{experiment_name}-{trial}-{create_date}',
                            experiment_name=experiment_name)

    job_name = f'{sm_trial.trial_name}'
    return job_name

## 4. 데이터 저장소와 학습 script 위치 설정  
>[Using the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/overview.html)  
>[Session](https://sagemaker.readthedocs.io/en/stable/api/utility/session.html)

In [4]:
prefix = 'sinjoonk/yolov5'

sess = boto3.Session() 
sagemaker_session = sagemaker.Session()
sm = sess.client('sagemaker')
default_bucket = sagemaker_session.default_bucket()

role = sagemaker.get_execution_role()

s3_data_path = f's3://{default_bucket}/{prefix}'
source_dir = 'yolov5' # Folder name having training codes

## 5. yolov5 format 데이터 설정

실습에 사용한 데이터셋은 [roboflow 에서 공개한 BCCD Dataset](https://public.roboflow.com/object-detection/bccd)으로, 혈액의 WBC(백혈구), RBC(적혈구), Platelets(혈소판)를 촬영한 이미지들입니다.

yolov5 object detection모델을 학습하기 위한 train/val/test 데이터셋은 다음과 같은 폴더 구조를 따라야 합니다. `images` 폴더에는 이미지를 저장하고, `labels` 폴더에는 이미지 별 annotation 결과 파일을 저장합니다.
```
├── test
│   ├── images
│   └── labels
├── train
│   ├── images
│   └── labels
└── valid
    ├── images
    └── labels
  

YOLOv5에서는 데이터셋이 저장된 경로와 Class수, Class이름을 별도 YAML파일에 선언합니다. 

- `data_local.yaml`: 학습을 에서 수행할 경우 사용하는 설정 파일입니다.
- `data_sm.yaml`: 학습을 SageMaker Local mode, SageMaker managed training에서 수행할 경우 사용하는 설정 파일입니다. SageMaker는 S3에 저장된 데이터셋을(managed training의 경우, Local mode에서는 Local에 저장된 데이터셋) SageMaker container 내 `/opt/ml/input/data/[channel_name]/` 에 저장하므로 `train`, `val` 경로는 Jupyter notebook local 경로가 아닌 SageMaker container의 경로를 지정합니다.

In [5]:
%%writefile yolov5/data/data_sm.yaml
train: /opt/ml/input/data/yolov5_input/train/images
val: /opt/ml/input/data/yolov5_input/valid/images

nc: 3
names: ['Platelets', 'RBC', 'WBC']

Overwriting yolov5/data/data_sm.yaml


In [6]:
%%writefile yolov5/data/data_local.yaml
train: BCCD/train/images
val: BCCD/valid/images

nc: 3
names: ['Platelets', 'RBC', 'WBC']

Overwriting yolov5/data/data_local.yaml


Jupyter notebook내 데이터셋을 S3에 업로드 합니다.

In [7]:
s3_data_path

's3://sagemaker-us-east-1-889750940888/sinjoonk/yolov5'

In [9]:
!aws s3 sync ./BCCD {s3_data_path}

## 6. 실험 설정

SageMaker managed training 수행 중 발생하는 output file들과 checkpoint를 저장할 S3경로를 지정합니다. Output은 학습 결과물인 **model artifacts, SageMaker debugger output, SageMaker debugger profiling output, SageMaker debugger rules output** 등을 포함합니다.

In [None]:
# code_location = f's3://{default_bucket}/{prefix}/sm_codes'
output_path = f's3://{default_bucket}/{prefix}/output' 
checkpoint_s3_bucket = f's3://{default_bucket}/{prefix}/checkpoints'

학습코드 수행 시 발생하는 Standard output 로그에서 특정 패턴을 만족하는 값을 찾아 CloudWatch 사용자 metric으로 저장할 수 있습니다. `metric_definitions`는 SageMaker `Estimator`를 선언할 때 `metric_definitions` 파라미터의 값으로 전달 합니다.

In [None]:
# TODO
metric_definitions = [
    {'Name': 'Precision', 'Regex': r'all\s+[0-9.]+\s+[0-9.]+\s+([0-9.]+)'},
    {'Name': 'Recall', 'Regex': r'all\s+[0-9.]+\s+[0-9.]+\s+[0-9.]+\s+([0-9.]+)'},
    {'Name': 'mAP@.5', 'Regex': r'all\s+[0-9.]+\s+[0-9.]+\s+[0-9.]+\s+[0-9.]+\s+([0-9.]+)'},
    {'Name': 'mAP@.5:.95', 'Regex': r'all\s+[0-9.]+\s+[0-9.]+\s+[0-9.]+\s+[0-9.]+\s+[0-9.]+\s+([0-9.]+)'}
]

## 7. 로컬에서 학습 코드 수행

### WandB 설정 (Optional)
https://wandb.ai/cayush/yoloV5/reports/Track-and-debug-your-YOLOv5-models--VmlldzozMDQ1OTg

In [None]:
# !pip install -r yolov5/requirements.txt

In [16]:
!python yolov5/train_sm.py \
--batch-size 64 \
--cfg yolov5s.yaml \
--data data_local.yaml \
--epochs 1 \
--freeze 24 \
--weights weights/yolov5s.pt \
--workers 0

Not found!!!
[34m[1mwandb[0m: Currently logged in as: [33mannakie[0m (use `wandb login --relogin` to force relogin)
THIS IS train_sm.py!!!
torchvision version: 0.10.1+cu111
torch version: 1.9.1+cu111
[34m[1mtrain_sm: [0mweights=weights/yolov5s.pt, cfg=yolov5s.yaml, data=data_local.yaml, hyp=yolov5/data/hyps/hyp.scratch.yaml, epochs=1, batch_size=64, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, evolve=None, bucket=, cache=None, image_weights=False, device=, multi_scale=False, single_cls=False, adam=False, sync_bn=False, workers=0, project=yolov5/runs/train, name=exp, exist_ok=False, quad=False, linear_lr=False, label_smoothing=0.0, patience=100, freeze=24, save_period=-1, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest, model_dir=None
[34m[1mgithub: [0mskipping check (not a git repository), for updates see https://github.com/ultralytics/yolov5
Parse error at "'--find-l'": Expected W:(abcd...)
YOL

`wandb.init()`을 수행하면 `/root/.netrc` 파일에 WEB API Key가 저장됩니다. `.netrc`파일을 SageMaker Local/Managed 학습 수행 시 실행되는 Container내부의 `/root/.netrc`로 저장하기 위한 코드를 `utils/loggers/__init__.py`에 추가합니다. `.netrc` 파일은 `source_dir/.netrc`에 미리 저장해야 합니다.

```
# __init__.py
...
################## For SageMaker ##################
from pathlib import Path
import subprocess

### Thanks to Youngjoon Choi :)
def wandb_setting():
    set_path = '/opt/ml/code/.netrc' #WANDB API Key
    file = Path(set_path)
    if file.exists():
        subprocess.run(['cp', '-r', set_path, '/root/.netrc'])
    else:
        print('=' * 100)
        print('Not found!!!')
        print('=' * 100)    

wandb_setting()
################## For SageMaker ##################
...
```

## 8. Local mode

`yolo5/train.py` 에 argument로 passing할 hyperparameter를 정의합니다. SageMaker에서 estimator를 만들 때 지정한 hyperparameter를 SageMaker container 내부의 `/opt/ml/input/config/hyperparameters.json`으로 저장하고 `train.py` 코드를 수행할 때 `hyperparameters.json` 파일을 읽어 argument로 feeding합니다.

Local mode에서는 `train.py`가 SageMaker환경에서 오류 없이 수행 되는지를 확인하려는 목적이므로 `epochs`의 값을 `1`으로 지정합니다.

학습 시 상대적으로 적은 이미지를 사용하므로 Transfer Learning 기법을 사용합니다.  
- Transfer Learning with Frozen Layers: https://github.com/ultralytics/yolov5/issues/1314

In [17]:
hyperparameters_local = {
    'data': 'data_sm.yaml',
    'cfg': 'yolov5s.yaml',
    'weights': 'weights/yolov5s.pt', # Transfer learning
    'batch-size': 64,
    'epochs': 1,
    'project': '/opt/ml/model',
    'workers': 0, # To avoid shm OOM issue
    'freeze': 10, # For transfer learning, freeze all Layers except for the final output convolution layers.
}

SageMaker prebuilt Pytorch container이미지의 torch, torchvision버전을 각각 1.9.1+cu111, 0.10.1+cu111으로 재설치 할 수 있도록 `requirements.txt` 파일에 아래 항목을 추가합니다.

```
# requirements.txt
...
### For SageMaker
--find-links https://download.pytorch.org/whl/torch_stable.html
torch==1.9.1+cu111
torchvision==0.10.1+cu111
### For SageMaker
...
```

In [18]:
from sagemaker.local import LocalSession
sagemaker_session = LocalSession()

In [19]:
# all input configurations, parameters, and metrics specified in estimator 
# definition are automatically tracked

estimator_local = PyTorch(
    entry_point='train_sm.py',
    source_dir=source_dir,
    base_job_name='yolov5-on-sagemaker',
    role=role,
    sagemaker_session=sagemaker_session,
    framework_version='1.8.1',
    py_version='py36',
    instance_count=1,
    instance_type='local_gpu',
    volume_size=256,
    output_path=output_path,
    hyperparameters=hyperparameters_local,
#     metric_definitions=metric_definitions,
    max_run=3*60*60,
)

In [20]:
train_dir = os.path.join(os.getcwd(), 'BCCD')
!ls {train_dir}

README.dataset.txt  README.roboflow.txt  test  train  valid


In [21]:
inputs = {'yolov5_input': 'file://{}'.format(train_dir)}

In [22]:
estimator_local.fit(inputs)

INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: latest.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker:Creating training-job with name: yolov5-on-sagemaker-2021-11-08-12-23-24-937
INFO:sagemaker.local.local_session:Starting training job
INFO:sagemaker.local.image:No AWS credentials found in session but credentials from EC2 Metadata Service are available.
INFO:sagemaker.local.image:docker compose file: 
networks:
  sagemaker-local:
    name: sagemaker-local
services:
  algo-1-m9v5w:
    command: train
    container_name: n076r1dscw-algo-1-m9v5w
    environment:
    - '[Masked]'
    - '[Masked]'
    image: 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.8.1-gpu-py36
    networks:
      sagemaker-local:
        aliases:
        - algo-1-m9v5w
    runtime: nvidia
    stdin_open: true
    tty: true
    volumes:
    - /tmp/tmpvjgav8jt/algo-1-m9v5w/input:/opt/ml/input
    - /tmp/tmpvjgav8jt/algo-1-m9v5

Creating n076r1dscw-algo-1-m9v5w ... 
Creating n076r1dscw-algo-1-m9v5w ... done
Attaching to n076r1dscw-algo-1-m9v5w
[36mn076r1dscw-algo-1-m9v5w |[0m 2021-11-08 12:23:32,025 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training
[36mn076r1dscw-algo-1-m9v5w |[0m 2021-11-08 12:23:32,049 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
[36mn076r1dscw-algo-1-m9v5w |[0m 2021-11-08 12:23:32,052 sagemaker_pytorch_container.training INFO     Invoking user training script.
[36mn076r1dscw-algo-1-m9v5w |[0m 2021-11-08 12:23:33,076 sagemaker-training-toolkit INFO     Installing dependencies from requirements.txt:
[36mn076r1dscw-algo-1-m9v5w |[0m /opt/conda/bin/python3.6 -m pip install -r requirements.txt
[36mn076r1dscw-algo-1-m9v5w |[0m Looking in links: https://download.pytorch.org/whl/torch_stable.html
[36mn076r1dscw-algo-1-m9v5w |[0m Collecting torch==1.9.1+cu111
[36mn076r1dscw-algo-1-m9v5w |[0m   Downlo

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[36mn076r1dscw-algo-1-m9v5w |[0m [?25hCollecting torchvision==0.10.1+cu111
[36mn076r1dscw-algo-1-m9v5w |[0m   Downloading https://download.pytorch.org/whl/cu111/torchvision-0.10.1%2Bcu111-cp36-cp36m-linux_x86_64.whl (20.6 MB)
     |████████████████████████████████| 20.6 MB 18.5 MB/s            
[36mn076r1dscw-algo-1-m9v5w |[0m [?25hCollecting tensorboard>=2.4.1
[36mn076r1dscw-algo-1-m9v5w |[0m   Downloading tensorboard-2.7.0-py3-none-any.whl (5.8 MB)
     |████████████████████████████████| 5.8 MB 14.7 MB/s            
[36mn076r1dscw-algo-1-m9v5w |[0m [?25hCollecting wandb
[36mn076r1dscw-algo-1-m9v5w |[0m   Downloading wandb-0.12.6-py2.py3-none-any.whl (1.7 MB)
     |████████████████████████████████| 1.7 MB 37.1 MB/s            
[36mn076r1dscw-algo-1-m9v5w |[0m Collecting thop
[36mn076r1dscw-algo-1-m9v5w |[0m   Downloading thop-0.0.31.post2005241907-py3-none-any.whl (8.7 kB)
[36mn076r1dscw-algo-1-m9v5w |[0m Collecting google-auth-oauthlib<0.5,>=0.4.1
[36mn076r1dscw

## 9-1. SageMaker managed training
축하합니다. 이제 SageMaker 환경에서 대용량 컴퓨팅 리소스를 활용하여 더 많은 Epoch를 수행하도록 하겠습니다. 이번에는 transfer learning을 하지 않고 from the scratch 방식으로 학습을 진행해 보겠습니다.

In [23]:
# 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:<tag>

from sagemaker import image_uris
image_uri = image_uris.retrieve(framework='pytorch',
                                region='us-east-1',
                                version='1.8.1',
                                py_version='py3',
                                image_scope='training', 
                                instance_type='ml.p3.2xlarge')
image_uri

'763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.8.1-gpu-py3'

In [24]:
role = sagemaker.get_execution_role()
sagemaker_session = sagemaker.Session()

In [25]:
hyperparameters_managed = {
    'data': 'data_sm.yaml',
    'cfg': 'yolov5s.yaml',
    'weights': 'weights/yolov5s.pt',
    'batch-size': 128,
    'epochs': 100,
#     'epochs': 1,
    'project': '/opt/ml/model',
    'workers': 8,
    'freeze': 10
}

In [26]:
experiment_name = 'yolov5-BCCD'
instance_count = 1

# instance_type = 'ml.p3.16xlarge'
instance_type = 'ml.p2.8xlarge'
# instance_type = 'ml.p3dn.24xlarge' 
# instance_type = 'ml.p4d.24xlarge'
# instance_type = 'ml.m5.2xlarge'

do_spot_training = True
max_wait = 3*60*60
max_run = 3*60*60

In [27]:
# all input configurations, parameters, and metrics specified in estimator 
# definition are automatically tracked
estimator_managed = PyTorch(
    entry_point='train_sm.py',
    source_dir=source_dir,
    base_job_name='yolov5-on-sagemaker',
    role=role,
    sagemaker_session=sagemaker_session,
    framework_version='1.8.1',
    py_version='py36',
    instance_count=instance_count,
    instance_type=instance_type,
    volume_size=256,
#     code_location = code_location,
    output_path=output_path,
    hyperparameters=hyperparameters_managed,
#     distribution=distribution,
    metric_definitions=metric_definitions,
    max_run=max_run,
    checkpoint_s3_uri=checkpoint_s3_bucket,
#     use_spot_instances=do_spot_training,  # spot instance 활용
#     max_wait=max_wait # spot instance 활용
)

In [28]:
inputs = {'yolov5_input': s3_data_path}
inputs

{'yolov5_input': 's3://sagemaker-us-east-1-889750940888/sinjoonk/yolov5'}

In [29]:
create_experiment(experiment_name)
job_name = create_trial(experiment_name, hyperparameters_managed, instance_type, instance_count, do_spot_training)
job_name

'yolov5-BCCD-p2-1-dp-s-1108-12311636374662'

In [30]:
estimator_managed.fit(inputs=inputs,
                      experiment_config={
                          'TrialName': job_name,
                          'TrialComponentDisplayName': job_name,
                        },
                      wait=False)

INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: latest.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker:Creating training-job with name: yolov5-on-sagemaker-2021-11-08-12-31-03-296


In [31]:
job_name=estimator_managed.latest_training_job.name

In [32]:
sagemaker_session.logs_for_job(job_name=job_name, wait=True)

2021-11-08 12:31:06 Starting - Starting the training job...
2021-11-08 12:31:30 Starting - Launching requested ML instancesProfilerReport-1636374663: InProgress
.........
2021-11-08 12:33:01 Starting - Preparing the instances for training............
2021-11-08 12:34:53 Downloading - Downloading input data......
2021-11-08 12:35:57 Training - Downloading the training image..................
2021-11-08 12:39:05 Training - Training image download completed. Training in progress..[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2021-11-08 12:39:06,766 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2021-11-08 12:39:06,844 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2021-11-08 12:39:09,867 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2021-11-08 12:39:11,422 sag

## 10. 학습 결과 확인

In [34]:
artifacts_dir = estimator_managed.model_data.replace('model.tar.gz', '')
print(artifacts_dir)
!aws s3 ls --human-readable {artifacts_dir}

s3://sagemaker-us-east-1-889750940888/sinjoonk/yolov5/output/yolov5-on-sagemaker-2021-11-08-12-31-03-296/output/
2021-11-08 12:55:47   29.5 MiB model.tar.gz


In [36]:
model_dir = './model'

!rm -rf $model_dir

import json , os

if not os.path.exists(model_dir):
    os.makedirs(model_dir)

!aws s3 cp {artifacts_dir}model.tar.gz {model_dir}/model.tar.gz
!tar -xvzf {model_dir}/model.tar.gz -C {model_dir}

download: s3://sagemaker-us-east-1-889750940888/sinjoonk/yolov5/output/yolov5-on-sagemaker-2021-11-08-12-31-03-296/output/model.tar.gz to model/model.tar.gz
exp/
exp/results.csv
exp/results.png
exp/R_curve.png
exp/train_batch0.jpg
exp/PR_curve.png
exp/labels_correlogram.jpg
exp/train_batch1.jpg
exp/confusion_matrix.png
exp/val_batch0_pred.jpg
exp/labels.jpg
exp/events.out.tfevents.1636375290.algo-1.61.0
exp/F1_curve.png
exp/val_batch0_labels.jpg
exp/weights/
exp/weights/best.pt
exp/weights/last.pt
exp/P_curve.png
exp/train_batch2.jpg
exp/hyp.yaml
exp/opt.yaml


# Optional: BYOC
만약 SageMaker prebuild docker container image가 여러분들의 usecase에 맞지 않다면 직접 container image를 만들고 SageMaker 환경에서 학습/추론에 활용할 수 있습니다.

> Sagemaker training toolkit:
https://docs.aws.amazon.com/sagemaker/latest/dg/adapt-training-container.html  
> Custom SDK framework estimator: https://github.com/giuseppeporcelli/sagemaker-custom-training-containers/blob/master/script-mode-container-2/notebook/script-mode-container-2.ipynb

## Container image build and push
YOLOv5의 공식 Dockerfile에 SageMaker training toolkit을 설치하고 학습코드가 저장될 `/opt/ml/code`를 만듭니다. 
```
#Dockerfile
...
# Install sagemaker-training toolkit that contains the common functionality necessary to create a container compatible with SageMaker and the Python SDK.
RUN pip3 install sagemaker-training

RUN mkdir -p /opt/ml/code
WORKDIR /opt/ml/code
...
```

`build_and_push.sh` [YOUR_ECR_REPOSITORY_NAMAE] 명령을 수행하여 Container이미지를 만들어 ECR에 Push합니다.

In [None]:
%cd yolov5-sm

In [None]:
%cd yolov5
!sh + build_and_push.sh sinjoonk-yolov5

%cd ..

## Local mode training
`Framework` class를 상속하여 `CustomFramework` class를 정의합니다. `Framework` class는 `sagemaker.tensorflow.estimator.TensorFlow`, `sagemaker.tensorflow.estimator.PyTorch`, `sagemaker.sklearn.estimator.SKLearn`의 부모 class입니다.

**References**
> Amazon SageMaker SDK 2.x 사용법 (5가지 핵심 오브젝트) – 강성문:: AWS Innovate 2021
: https://www.youtube.com/watch?v=n2Ky1nZXyWo&ab_channel=AmazonWebServicesKorea  
> sagemaker.estimator.Framework: https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.Framework

In [None]:
from sagemaker.estimator import Framework

class CustomFramework(Framework):
    def __init__(
        self,
        entry_point,
        framework_version=None,
        py_version=None,
        source_dir=None,
        hyperparameters=None,
        image_uri=None,
        distribution=None,
        **kwargs
    ):
        super(CustomFramework, self).__init__(
            entry_point, source_dir, hyperparameters, image_uri=image_uri, **kwargs
        )
    
    def _configure_distribution(self, distributions):
        return None
    
    def create_model(
        self,
        model_server_workers=None,
        role=None,
        vpc_config_override=None,
        entry_point=None,
        source_dir=None,
        dependencies=None,
        image_uri=None,
        **kwargs
    ):
        return None

In [None]:
role = sagemaker.get_execution_role()

In [None]:
hyperparameters_local_custom = {
    'data': 'data_sm.yaml',
    'cfg': 'yolov5s.yaml',
    #'weights': 'weights/yolov5s.pt', # Transfer learning
    'batch-size': 64,
    'epochs': 1,
    'project': '/opt/ml/model',
    'workers': 0, # To avoid shm OOM issue
    #'freeze': freeze, # For transfer learning, freeze all Layers except for the final output convolution layers.
}

In [None]:
hyperparameters_local_custom

In [None]:
!ls {source_dir}

In [None]:
import sagemaker
from sagemaker import get_execution_role
from sagemaker.estimator import Estimator
from sagemaker.local import LocalSession
sagemaker_session = LocalSession()

byoc_image_uri = '889750940888.dkr.ecr.us-east-1.amazonaws.com/sinjoonk-yolov5'
instance_type = 'local_gpu'

estimator_local_custom = CustomFramework(
    image_uri=byoc_image_uri,
    entry_point='train_sm.py',
    source_dir=source_dir,
    base_job_name='yolov5-on-sagemaker',
    role=role,
    sagemaker_session=sagemaker_session,
    instance_count=1,
    instance_type='local_gpu',
    volume_size=256,
    output_path=output_path,
    hyperparameters=hyperparameters_local_custom,
#     metric_definitions=metric_definitions,
    max_run=3*60*60,
)

In [None]:
import os
train_dir = os.path.join(os.getcwd(), 'BCCD')

inputs = {'yolov5_input': 'file://{}'.format(train_dir)}
inputs

In [None]:
# start training
estimator_local_custom.fit(inputs=inputs)

## Managed training

In [None]:
role = sagemaker.get_execution_role()
sagemaker_session = sagemaker.Session()

In [None]:
experiment_name = 'yolov5-BCCD'
instance_count = 1

# instance_type = 'ml.g4dn.xlarge'
# instance_type = 'ml.p3.2xlarge'
instance_type = 'ml.p2.8xlarge'
# instance_type = 'ml.m5.2xlarge' # Completed, pytorch-training-2021-11-03-09-12-19-947

do_spot_training = True
max_wait = 3*60*60
max_run = 3*60*60

In [None]:
hyperparameters_managed = {
    'data': 'data_sm.yaml',
    'cfg': 'yolov5s.yaml',
#     'weights': 'weights/yolov5s.pt',
    'batch-size': 128,
    'epochs': 300,
    'project': '/opt/ml/model',
    'weights': 'weights/yolov5s.pt',
    'workers': 8,
#     'freeze': 24
}

In [None]:
byoc_image_uri = '889750940888.dkr.ecr.us-east-1.amazonaws.com/sinjoonk-yolov5'


estimator_custom_managed = CustomFramework(image_uri=byoc_image_uri,
                                           role=role,
                                           entry_point='train_sm.py',
                                           source_dir='yolov5',
                                           instance_count=1, 
                                           instance_type=instance_type,
                                           base_job_name='yolov5-on-sagemaker',
                                           volume_size=256,
                                           output_path=output_path,
                                           checkpoint_s3_uri=checkpoint_s3_bucket,
                                           hyperparameters=hyperparameters_managed)

In [None]:
inputs = {'yolov5_input': s3_data_path}
inputs

In [None]:
create_experiment(experiment_name)
job_name = create_trial(experiment_name, hyperparameters_managed, instance_type, instance_count, do_spot_training)
job_name

In [None]:
estimator_custom_managed.fit(inputs=inputs,
                      experiment_config={
                          'TrialName': job_name,
                          'TrialComponentDisplayName': job_name,
                        },
                      wait=False)

In [None]:
job_name=estimator_custom_managed.latest_training_job.name
sagemaker_session.logs_for_job(job_name=job_name, wait=True)

In [None]:
sagemaker_session.logs_for_job(job_name=job_name, wait=True)