We start by installing all the necessary libraries and importing then to our notebook...

In [116]:
!pip install boto3 docker

You should consider upgrading via the '/home/ec2-user/anaconda3/envs/pytorch_latest_p36/bin/python -m pip install --upgrade pip' command.[0m


In [212]:
import boto3
import docker
import pathlib
import base64
import time
from sagemaker import get_execution_role
from sagemaker.session import Session
from sagemaker import utils

In [213]:
MODEL_BASE_NAME = 'yolov5'
YOLOV5_IMAGE_NAME = f'{MODEL_BASE_NAME}-sagemaker'
SAGEMAKER_IMAGES_REGISTRY_ID = '763104351884'

session = boto3.session.Session()
aws_region = session.region_name

sts_client = boto3.client('sts')
account_id = sts_client.get_caller_identity().get('Account')

ecr_client = boto3.client('ecr')
docker_client = docker.from_env()

sg_client = boto3.client('sagemaker')
s3_client = boto3.client('s3')

# Creating the SageMaker image

The first step is to create an Amazon SageMaker compatible docker image which will run the training.
The `container` folder contains both the Dockerfile and the necessary scripts to train and validate the generated model

#### During the docker build we will simply clone the yolov5 repository and add the necessary SageMaker scripts

In [288]:
!pygmentize container/Dockerfile

[34mARG[39;49;00m [31mBASE_IMG[39;49;00m=[33m${[39;49;00m[31mBASE_IMG[39;49;00m[33m}[39;49;00m
[34mFROM[39;49;00m [33m${BASE_IMG}[39;49;00m 

[34mENV[39;49;00m [31mPATH[39;49;00m=[33m"[39;49;00m[33m/opt/code:[39;49;00m[33m${[39;49;00m[31mPATH[39;49;00m[33m}[39;49;00m[33m"[39;49;00m

[34mRUN[39;49;00m [36mcd[39;49;00m opt && git clone https://github.com/ultralytics/yolov5
[34mRUN[39;49;00m pip install -r /opt/yolov5/requirements.txt

[34mENV[39;49;00m [31mPATH[39;49;00m=[33m"[39;49;00m[33m/opt/yolov5:[39;49;00m[33m${[39;49;00m[31mPATH[39;49;00m[33m}[39;49;00m[33m"[39;49;00m

[34mWORKDIR[39;49;00m[33m /opt/code[39;49;00m
[34mCOPY[39;49;00m train /opt/code
[34mCOPY[39;49;00m predict /opt/code


#### The train script will load the training parameters (more below) and will invoke the yolov5 training script

In [289]:
!pygmentize container/train -l python

[37m#!/usr/bin/env python3[39;49;00m

[34mimport[39;49;00m [04m[36msys[39;49;00m
[34mimport[39;49;00m [04m[36mjson[39;49;00m

sys.path.append([33m'[39;49;00m[33m/opt/yolov5[39;49;00m[33m'[39;49;00m)

[34mimport[39;49;00m [04m[36mtrain[39;49;00m

[34mwith[39;49;00m [36mopen[39;49;00m([33m'[39;49;00m[33m/opt/ml/input/data/config/params.json[39;49;00m[33m'[39;49;00m) [34mas[39;49;00m params_file:
    params = json.load(params_file)
    train_params = params[[33m'[39;49;00m[33mtrain[39;49;00m[33m'[39;49;00m]

opt = train.parse_opt([34mTrue[39;49;00m)
[34mfor[39;49;00m p [35min[39;49;00m train_params:
    value = train_params[p]
    [34mif[39;49;00m value:
        [36msetattr[39;49;00m(opt, p, value)

train.main(opt)


#### The predict script will also load the parameters and will invoke the yolov5 detect script

In [290]:
!pygmentize container/predict -l python

[37m#!/usr/bin/env python3[39;49;00m

[34mimport[39;49;00m [04m[36msys[39;49;00m
[34mimport[39;49;00m [04m[36mjson[39;49;00m

sys.path.append([33m'[39;49;00m[33m/opt/yolov5[39;49;00m[33m'[39;49;00m)

[34mimport[39;49;00m [04m[36mdetect[39;49;00m

[34mwith[39;49;00m [36mopen[39;49;00m([33m'[39;49;00m[33m/opt/ml/input/data/config/params.json[39;49;00m[33m'[39;49;00m) [34mas[39;49;00m params_file:
    params = json.load(params_file)
    predict_params = params[[33m'[39;49;00m[33mpredict[39;49;00m[33m'[39;49;00m]

opt = detect.parse_opt()
[34mfor[39;49;00m p [35min[39;49;00m predict_params:
    value = predict_params[p]
    [34mif[39;49;00m value:
        [36msetattr[39;49;00m(opt, p, value)

detect.main(opt)


Also, let's inicialize all the necessary clients and get the necessary information.

## Building the image

To build the training image we will use the [PyTorch 1.7.1 SageMaker Deep Learning image](https://github.com/aws/deep-learning-containers/blob/master/available_images.md). 

To receive this image we will need to login on the SageMaker ECR repository in order to fetch the specific image, more information about this process is available [here](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-ecs.html).

We will also need to create our own private ECR repository and push our image to it after the building process. More information is available [here](https://docs.aws.amazon.com/AmazonECR/latest/userguide/repository-create.html).

The entire process is executed below, with comments to help you understanding each step of the process.

We will login on the SageMaker ECR registry to be able to pull the base image, and also on our own ECR to later be able to push our newly generated image

In [293]:
response = ecr_client.get_authorization_token(
    registryIds=[
        SAGEMAKER_IMAGES_REGISTRY_ID,
        account_id
    ]
)

for auth in response['authorizationData']:
    registry_address = auth['proxyEndpoint']
    encoded_token = auth['authorizationToken']
    credentials = base64.b64decode(encoded_token).decode('utf-8')
    username, password = credentials.split(':')
    login = docker_client.login(username, password, registry=registry_address, dockercfg_path='$HOME/.docker/config.json')    
    print(f'Logged in at {registry_address}')

Logged in at https://763104351884.dkr.ecr.eu-central-1.amazonaws.com
Logged in at https://354767016111.dkr.ecr.eu-central-1.amazonaws.com


#### Now we pull the base image from ECR, this process can take some time

In [294]:
base_image = f'{SAGEMAKER_IMAGES_REGISTRY_ID}.dkr.ecr.{aws_region}.amazonaws.com/pytorch-training:1.7.1-gpu-py36-cu110-ubuntu18.04'
docker_client.images.pull(base_image)

<Image: '763104351884.dkr.ecr.eu-central-1.amazonaws.com/pytorch-training:1.7.1-gpu-py36-cu110-ubuntu18.04'>

#### Before building the image, we will create our own ECR repository and get the URL to push the image later.

In [295]:
try:
    repository = ecr_client.describe_repositories(
        repositoryNames=[
            YOLOV5_IMAGE_NAME,
        ]
    )['repositories'][0]
except:
    print('Repository does not exists, creating it')
    repository = ecr_client.create_repository(
        repositoryName=YOLOV5_IMAGE_NAME
    )['repository']
    
target_image = repository['repositoryUri']
print(f'New image will be tagged: {target_image}')

New image will be tagged: 354767016111.dkr.ecr.eu-central-1.amazonaws.com/yolov5-sagemaker


#### And finally, we build the new image...

In [346]:
container_path = f'{pathlib.Path().resolve()}/container'
image, build_log = docker_client.images.build(path=container_path, buildargs={'BASE_IMG': base_image}, tag=YOLOV5_IMAGE_NAME)
image.tag(target_image, tag='latest')
image.reload()
print(f'New image created: {image.tags}')

New image created: ['354767016111.dkr.ecr.eu-central-1.amazonaws.com/yolov5-sagemaker:latest', 'yolov5-sagemaker:latest']


#### Now we push the recently created image to our ECR registry

In [347]:
for l in docker_client.images.push(target_image, tag='latest', stream=True, decode=True):
    status = l.get('status')
    progress = l.get('progressDetail')
    if progress == None and status:
        print('')
        print(status)
    elif progress.get('current'):
        print('.', end = '')


The push refers to repository [354767016111.dkr.ecr.eu-central-1.amazonaws.com/yolov5-sagemaker]
....
latest: digest: sha256:d542b6a1cc4a92ff94553d59a104b8c06ecfab362f1595fbeba6e84124cecb32 size: 8304


We have our SageMaker image ready to train our model!
But first...

# Let's prepare the data

#### First we will download the data and place it on the specific directories

The yolov5 weights will be downloaded to `input/weights` directory

In [308]:
# weights
!wget -P input/data/weights https://github.com/ultralytics/yolov5/releases/download/v5.0/yolov5s.pt 

--2021-07-16 14:42:59--  https://github.com/ultralytics/yolov5/releases/download/v5.0/yolov5s.pt
Resolving github.com (github.com)... 140.82.121.3
Connecting to github.com (github.com)|140.82.121.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github-releases.githubusercontent.com/264818686/56dd3480-9af3-11eb-9c92-3ecd167961dc?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20210716%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20210716T144103Z&X-Amz-Expires=300&X-Amz-Signature=0d9cf988e7e57637b272180b078fd4d63199e45f5ae2e226c1e1eba1b642a850&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=264818686&response-content-disposition=attachment%3B%20filename%3Dyolov5s.pt&response-content-type=application%2Foctet-stream [following]
--2021-07-16 14:42:59--  https://github-releases.githubusercontent.com/264818686/56dd3480-9af3-11eb-9c92-3ecd167961dc?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F2021071

After that we will download the [coco128 dataset](https://www.kaggle.com/ultralytics/coco128), with the training images and the labels. 
If you would like to train on your own dataset, you can upload your own images and labels, we will configure the specific data sources on the params file on the upcoming steps.

In [334]:
!rm -rf input/data/images
!rm -rf input/data/labels
!wget -P input/data https://github.com/ultralytics/yolov5/releases/download/v1.0/coco128.zip
!unzip -q input/data/coco128.zip 'coco128/labels/*' 'coco128/images/*' -d input/data
!mv input/data/coco128/* input/data
!rm -rf input/data/coco128*

--2021-07-16 15:29:01--  https://github.com/ultralytics/yolov5/releases/download/v1.0/coco128.zip
Resolving github.com (github.com)... 140.82.121.4
Connecting to github.com (github.com)|140.82.121.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github-releases.githubusercontent.com/264818686/7a208a00-e19d-11eb-94cf-5222600cc665?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20210716%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20210716T152902Z&X-Amz-Expires=300&X-Amz-Signature=e139bed06249f3d7cf5e406256f5e9a9872927d3fe58190567062eaa1bbfc73d&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=264818686&response-content-disposition=attachment%3B%20filename%3Dcoco128.zip&response-content-type=application%2Foctet-stream [following]
--2021-07-16 15:29:02--  https://github-releases.githubusercontent.com/264818686/7a208a00-e19d-11eb-94cf-5222600cc665?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20210

With the necessary data downloaded (either from coco128 dataset or from your own dataset) the `input` directory will be synced with S3 later, this folder looks like this:
```
input
    |-data
    |---config
    |-----params.json    
    |-----coco128.yaml
    |-----hyp.finetune.yaml
    |-----yolo5s.yaml    
    |---images
    |-----train2017
    |-------images.jpg
    |---labels
    |-----train2017
    |-------labels.txt
    |---weights
    |-----yolo5s.pt
    
```

The `input/config` folder contains contains the [training dataset configurations](https://github.com/ultralytics/yolov5/blob/master/data/coco128.yaml) (coco128.yaml), the [hyperparameters for model training](https://github.com/ultralytics/yolov5/issues/607) (hyp.finetune.yaml) and the [model configuration itself](https://github.com/ultralytics/yolov5/blob/master/models/yolov5s.yaml) (yolo5s.yaml).

You can adjust those files as you wish, the links above provide more information regarding each file.

You can also add new files with diferent naming convention, those inputs will be configured on the `input/data/config/params.json`.

In [335]:
!pygmentize input/data/config/params.json

{
  [94m"train"[39;49;00m: {
    [94m"weights"[39;49;00m: [33m"/opt/ml/input/data/weights/yolov5s.pt"[39;49;00m,
    [94m"cfg"[39;49;00m: [33m"/opt/ml/input/data/config/yolo5s.yaml"[39;49;00m,
    [94m"data"[39;49;00m: [33m"/opt/ml/input/data/config/coco128.yml"[39;49;00m,
    [94m"hyp"[39;49;00m: [33m"/opt/ml/input/data/config/hyp.finetune.yaml"[39;49;00m,
    [94m"epochs"[39;49;00m: [34m300[39;49;00m,
    [94m"batch_size"[39;49;00m: [34m16[39;49;00m,
    [94m"img_size"[39;49;00m: [[34m640[39;49;00m, [34m640[39;49;00m],
    [94m"rect"[39;49;00m: [34mfalse[39;49;00m,
    [94m"resume"[39;49;00m: [34mfalse[39;49;00m,
    [94m"nosave"[39;49;00m: [34mfalse[39;49;00m,
    [94m"noval"[39;49;00m: [34mfalse[39;49;00m,
    [94m"noautoanchor"[39;49;00m: [34mfalse[39;49;00m,
    [94m"evolve"[39;49;00m: [33m""[39;49;00m,
    [94m"bucket"[39;49;00m: [33m""[39;49;00m,
    [94m"cache_images"[39;49;00m: [34mfalse[39;49;00m,
    [94m"imag

This file contains all the input configurations used to both, train and predict the model inside the SageMaker container. You can edit this file on your S3 at any point in time, and just need to run your training job again.

With the local folder structure ready, let's sync this with the S3 bucket we will use during the training.

In [336]:
s3_path = utils.name_from_base(MODEL_BASE_NAME)

sg_exec_role = get_execution_role()
sg_session = Session()
s3_region = sg_session.boto_region_name
sg_bucket = sg_session.default_bucket()
s3_input_destination = f's3://{sg_bucket}/{s3_path}/input/'
s3_output_destination = f's3://{sg_bucket}/{s3_path}/output/'
print(f'Data will be placed at {s3_input_destination}')

Data will be placed at s3://sagemaker-eu-central-1-354767016111/yolov5-2021-07-16-15-29-43-562/input/


In [349]:
%%bash -s "$s3_input_destination"
aws s3 sync ./input $1 --quiet
echo "$1"
aws s3 ls "$1"

s3://sagemaker-eu-central-1-354767016111/yolov5-2021-07-16-15-29-43-562/input/
                           PRE data/


with the data in place and the docker image saved on our ECR...

# Let's Start the training job

To run the SageMaker training job we will need to pass some parameters. The first is the [InputDataConfig](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html#API_CreateTrainingJob_RequestParameters).

We will use a series of [SageMaker Channels](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_Channel.html) to map our S3 folders to the SageMaker training image.

In [350]:
config_channel_src = f'{s3_input_destination}data/config/'
images_channel_src = f'{s3_input_destination}data/images/'
labels_channels_src= f'{s3_input_destination}data/labels/'
weights_channel_src = f'{s3_input_destination}data/weights/'

print('Configuration will be fetched from:', config_channel_src)
print('Images will be fetched from:', images_channel_src)
print('Labels will be fetched from:', labels_channels_src)
print('Weights will be fetched from:', weights_channel_src)
print('Output will be placed at:', s3_output_destination)

Configuration will be fetched from: s3://sagemaker-eu-central-1-354767016111/yolov5-2021-07-16-15-29-43-562/input/data/config/
Images will be fetched from: s3://sagemaker-eu-central-1-354767016111/yolov5-2021-07-16-15-29-43-562/input/data/images/
Labels will be fetched from: s3://sagemaker-eu-central-1-354767016111/yolov5-2021-07-16-15-29-43-562/input/data/labels/
Weights will be fetched from: s3://sagemaker-eu-central-1-354767016111/yolov5-2021-07-16-15-29-43-562/input/data/weights/
Output will be placed at: s3://sagemaker-eu-central-1-354767016111/yolov5-2021-07-16-15-29-43-562/output/


Let's look how our input channels look like on S3:

In [351]:
%%bash -s "$config_channel_src" "$images_channel_src" "$labels_channels_src" "$weights_channel_src"
for b in "$@"; do
    echo "$b"
    aws s3 ls "$b"
    echo ''
done

s3://sagemaker-eu-central-1-354767016111/yolov5-2021-07-16-15-29-43-562/input/data/config/
                           PRE .ipynb_checkpoints/
2021-07-16 15:30:20       1061 coco128.yml
2021-07-16 15:29:48        861 hyp.finetune.yaml
2021-07-16 15:41:02       1556 params.json
2021-07-16 15:29:48       1454 yolo5s.yaml

s3://sagemaker-eu-central-1-354767016111/yolov5-2021-07-16-15-29-43-562/input/data/images/
                           PRE train2017/
2021-07-16 15:29:48       6148 .DS_Store

s3://sagemaker-eu-central-1-354767016111/yolov5-2021-07-16-15-29-43-562/input/data/labels/
                           PRE train2017/
2021-07-16 15:29:49       6148 .DS_Store

s3://sagemaker-eu-central-1-354767016111/yolov5-2021-07-16-15-29-43-562/input/data/weights/
2021-07-16 15:29:49   14795158 yolov5s.pt



With everything in place, let's submit the training job to SageMaker

In [352]:
job_name = utils.name_from_base(MODEL_BASE_NAME)
submited_job = sg_client.create_training_job(
      TrainingJobName=job_name,
      AlgorithmSpecification={
          'TrainingImage': target_image,
          'TrainingInputMode': 'File',
      },
      RoleArn=sg_exec_role,
      InputDataConfig=[
          {
              'ChannelName': 'config',
              'DataSource': {
                  'S3DataSource': {
                      'S3DataType': 'S3Prefix',
                      'S3Uri': config_channel_src,
                      'S3DataDistributionType': 'FullyReplicated',
                  },
              },
              'InputMode': 'File'
          },
          {
              'ChannelName': 'images',
              'DataSource': {
                  'S3DataSource': {
                      'S3DataType': 'S3Prefix',                      
                      'S3Uri': images_channel_src,
                      'S3DataDistributionType': 'FullyReplicated',
                  },
              },
              'InputMode': 'File'
          },
          {
              'ChannelName': 'labels',
              'DataSource': {
                  'S3DataSource': {
                      'S3DataType': 'S3Prefix',                      
                      'S3Uri': labels_channels_src,
                      'S3DataDistributionType': 'FullyReplicated',
                  },
              },
              'InputMode': 'File'
          },
          {
              'ChannelName': 'weights',
              'DataSource': {
                  'S3DataSource': {
                      'S3DataType': 'S3Prefix',                      
                      'S3Uri': weights_channel_src,
                      'S3DataDistributionType': 'FullyReplicated',
                  },
              },
              'InputMode': 'File'
          }
      ],
      OutputDataConfig={
          'S3OutputPath': s3_output_destination
      },
      ResourceConfig={
          'InstanceType': 'ml.p3.2xlarge',
          'InstanceCount': 1,
          'VolumeSizeInGB': 10,
      },
      StoppingCondition={
        'MaxRuntimeInSeconds': 60*60*5,
      }
  )
submited_job

{'TrainingJobArn': 'arn:aws:sagemaker:eu-central-1:354767016111:training-job/yolov5-2021-07-16-15-41-21-527',
 'ResponseMetadata': {'RequestId': '409b0200-bdac-419b-b545-50d7fc9a5b07',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '409b0200-bdac-419b-b545-50d7fc9a5b07',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '108',
   'date': 'Fri, 16 Jul 2021 15:41:21 GMT'},
  'RetryAttempts': 0}}

In [None]:
print('Waiting job to finish:', end='')
running = True
previous_status = ''
while running:
    job_status = sg_client.describe_training_job(
        TrainingJobName=job_name
    )
    running = job_status['TrainingJobStatus'] == 'InProgress'
    current_status = job_status['SecondaryStatus']
    if previous_status != current_status:
        print('')
        print(f'{current_status} ', end='')
        previous_status = current_status
    print('. ', end='')
    time.sleep(1)

print('')
final_status = job_status['TrainingJobStatus']
if final_status == 'Failed':
    print(f'Job Failed!: {job_status["FailureReason"]}')
else:
    print(f'Job Finished! {final_status}')

Waiting job to finish:
Starting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 
Downloading . . . . . . 
Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 