# Amazon SageMaker Object Detection using the augmented manifest file format

1. [Introduction](#Introduction)
2. [Setup](#Setup)
3. [Specifying input Dataset](#Specifying-input-Dataset)
4. [Training](#Training)

## Introduction

Object detection is the process of identifying and localizing objects in an image. A typical object detection solution takes in an image as input and provides a bounding box on the image where an object of interest is, along with identifying what object the box encapsulates. But before we have this solution, we need to process a training dataset, create and setup a training job for the algorithm so that the aglorithm can learn about the dataset and then host the algorithm as an endpoint, to which we can supply the query image.

This notebook focuses on using the built-in SageMaker Single Shot multibox Detector ([SSD](https://arxiv.org/abs/1512.02325)) object detection algorithm to train model on your custom dataset. For dataset prepration or using the model for inference, please see other scripts in [this folder](./)

## Setup

To train the Object Detection algorithm on Amazon SageMaker, we need to setup and authenticate the use of AWS services. To begin with we need an AWS account role with SageMaker access. This role is used to give SageMaker access to your data in S3. In this example, we will use the same role that was used to start this SageMaker notebook.

In [None]:
BUCKET = '<S3 Bucket Name>' # Valid name for S3 bucket.
IMG_FOLDER = 'images' # Any valid S3 prefix.
MANIFEST_FOLDER = 'manifest' # Any valid S3 prefix.
CLASS_NAME = '<Target object label name>' # The single label that will be annotated in the Ground Truth job.
OUTPUT_PREFIX = 'output'

In [None]:
# testing

BUCKET = 'robcost-potato' # Valid name for S3 bucket.
IMG_FOLDER = 'images' # Any valid S3 prefix.
MANIFEST_FOLDER = 'manifest' # Any valid S3 prefix.
CLASS_NAME = 'potatohead' # The single label that will be annotated in the Ground Truth job.
OUTPUT_PREFIX = 'output'

In [None]:
%%time
import sagemaker
import boto3
from sagemaker import get_execution_role

role = get_execution_role()
print(role)

We also need the S3 bucket that has the training manifests and will be used to store the tranied model artifacts. 

## Specifying input Dataset

This notebook assumes you already have prepared two [Augmented Manifest Files](https://docs.aws.amazon.com/sagemaker/latest/dg/augmented-manifest.html) as training and validation input data for the object detection model.  

There are many advantages to using **augmented manifest files** for your training input

* No format conversion is required if you are using SageMaker Ground Truth to generate the data labels
* Unlike the traditional approach of providing paths to the input images separately from its labels, augmented manifest file already combines both into one entry for each input image, reducing complexity in algorithm code for matching each image with labels. (Read this [blog post](https://aws.amazon.com/blogs/machine-learning/easily-train-models-using-datasets-labeled-by-amazon-sagemaker-ground-truth/) for more explanation.) 
* When splitting your dataset for train/validation/test, you don't need to rearrange and re-upload image files to different s3 prefixes for train vs validation. Once you upload your image files to S3, you never need to move it again. You can just place pointers to these images in your augmented manifest file for training and validation. More on the train/validation data split in this post later. 
* When using augmented manifest file, the training input images is loaded on to the training instance in *Pipe mode,* which means the input data is streamed directly to the training algorithm while it is running (vs. File mode, where all input files need to be downloaded to disk before the training starts). This results in faster training performance and less disk resource utilization. Read more in this [blog post](https://aws.amazon.com/blogs/machine-learning/accelerate-model-training-using-faster-pipe-mode-on-amazon-sagemaker/) on the benefits of pipe mode.


In [None]:
s3_train_data= "s3://{}/{}/all_augmented.manifest".format(BUCKET, CLASS_NAME)
# s3_train_data= "s3://{}/{}/train.manifest".format(BUCKET, MANIFEST_FOLDER)
s3_validation_data = "s3://{}/{}/validation.manifest".format(BUCKET, MANIFEST_FOLDER)
print("Train data: {}".format(s3_train_data) )
print("Validation data: {}".format(s3_validation_data) )

In [None]:
train_input = {
    "ChannelName": "train",
    "InputMode": "Pipe",
    "DataSource": {
        "S3DataSource": {
            "S3DataType": "AugmentedManifestFile",  
            "S3Uri": s3_train_data,
            "S3DataDistributionType": "FullyReplicated",
            # This must correspond to the JSON field names in your augmented manifest.
            "AttributeNames": ['source-ref', CLASS_NAME]
        }
    },
    "ContentType": "application/x-recordio",
    "RecordWrapperType": "RecordIO",
    "CompressionType": "None"
}


In [None]:
validation_input = {
    "ChannelName": "validation",
    "InputMode": "Pipe",
    "DataSource": {
        "S3DataSource": {
            "S3DataType": "AugmentedManifestFile",  
            "S3Uri": s3_validation_data,
            "S3DataDistributionType": "FullyReplicated",
            #  This must correspond to the JSON field names in your augmented manifest.
            "AttributeNames": ['source-ref', CLASS_NAME]
        }
    },
    "ContentType": "application/x-recordio",
    "RecordWrapperType": "RecordIO",
    "CompressionType": "None"
}


In [None]:
print(train_input)

Below code computes the number of training samples, required in the training job request.

In [None]:
import json
import os 

def read_manifest_file(file_path):
    with open(file_path, 'r') as f:
        output = [json.loads(line.strip()) for line in f.readlines()]
        return output
    
!aws s3 cp $s3_train_data .    
train_data = read_manifest_file(os.path.split(s3_train_data)[1])
num_training_samples =  len(train_data)
num_training_samples

In [None]:
s3_output_path = 's3://{}/{}'.format(BUCKET, OUTPUT_PREFIX)
s3_output_path

## Training
Now that we are done with all the setup that is needed, we are ready to train our object detector. 

In [None]:
from sagemaker.amazon.amazon_estimator import get_image_uri

# This retrieves a docker container with the built in object detection SSD model. 
training_image = sagemaker.amazon.amazon_estimator.get_image_uri(boto3.Session().region_name, 'object-detection', repo_version='latest')
print (training_image)

Create a unique job name

In [None]:
import time 

job_name_prefix = CLASS_NAME.lower() + '-object-detection'
timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())
model_job_name = job_name_prefix + timestamp
model_job_name

The object detection algorithm at its core is the [Single-Shot Multi-Box detection algorithm (SSD)](https://arxiv.org/abs/1512.02325). This algorithm uses a `base_network`, which is typically a [VGG](https://arxiv.org/abs/1409.1556) or a [ResNet](https://arxiv.org/abs/1512.03385). (resnet is typically faster so for edge inferences, I'd recommend using this base network). The Amazon SageMaker object detection algorithm supports VGG-16 and ResNet-50 now. It also has a lot of options for hyperparameters that help configure the training job. The next step in our training, is to setup these hyperparameters and data channels for training the model. See the SageMaker Object Detection [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/object-detection.html) for more details on the hyperparameters.

To figure out which works best for your data, run a hyperparameter tuning job. There's some example notebooks at [https://github.com/awslabs/amazon-sagemaker-examples](https://github.com/awslabs/amazon-sagemaker-examples) that you can use for reference. 

In [None]:
# This is where transfer learning happens. We use the pre-trained model and nuke the output layer by specifying
# the num_classes value. You can also run a hyperparameter tuning job to figure out which values work the best. 
hyperparams = { 
            "base_network": 'resnet-50',
            "use_pretrained_model": "1",
            "num_classes": "1",   
            "mini_batch_size": "1",
            "epochs": "30",
            "learning_rate": "0.001",
            "lr_scheduler_step": "10,20",
            "lr_scheduler_factor": "0.25",
            "optimizer": "sgd",
            "momentum": "0.9",
            "weight_decay": "0.0005",
            "overlap_threshold": "0.5",
            "nms_threshold": "0.45",
            "image_shape": "512",
            "label_width": "150",
            "num_training_samples": str(num_training_samples)
        }

Now that the hyperparameters are set up, we configure the rest of the training job parameters

In [None]:
training_params = \
    {
        "AlgorithmSpecification": {
            "TrainingImage": training_image,
            "TrainingInputMode": "Pipe"
        },
        "RoleArn": role,
        "OutputDataConfig": {
            "S3OutputPath": s3_output_path
        },
        "ResourceConfig": {
            "InstanceCount": 1,
            "InstanceType": "ml.p2.xlarge",
            "VolumeSizeInGB": 200
        },
        "TrainingJobName": model_job_name,
        "HyperParameters": hyperparams,
        "StoppingCondition": {
            "MaxRuntimeInSeconds": 86400
        },
        "InputDataConfig": [
            train_input,
            validation_input
        ]
    }


Now we create the SageMaker training job.

In [None]:
client = boto3.client(service_name='sagemaker')
client.create_training_job(**training_params)

# Confirm that the training job has started
status = client.describe_training_job(TrainingJobName=model_job_name)['TrainingJobStatus']
print('Training job current status: {}'.format(status))

To check the progess of the training job, you can repeatedly evaluate the following cell. When the training job status reads 'Completed', move on to the next part of the tutorial.


In [None]:
client = boto3.client(service_name='sagemaker')
print("Training job status: ", client.describe_training_job(TrainingJobName=model_job_name)['TrainingJobStatus'])
print("Secondary status: ", client.describe_training_job(TrainingJobName=model_job_name)['SecondaryStatus'])

**Do not continue until the job has completed**

--------------

## Once complete get the path to the generated model

In [None]:
client = boto3.client(service_name='sagemaker')
s3_model_artifacts = client.describe_training_job(TrainingJobName='model_job_name')['ModelArtifacts']['S3ModelArtifacts']
print('Model Artifacts: ' + s3_model_artifacts)

# Create Deployable Model

In [None]:
import os 

def make_tmp_folder(folder_name):
    try:
        os.makedirs(folder_name)
    except OSError as e:
        print("{} folder already exists".format(folder_name))

In [None]:
TMP_FOLDER = 'trained-model'
make_tmp_folder(TMP_FOLDER)

In [None]:
!aws s3 cp $s3_model_artifacts $TMP_FOLDER/.

In [None]:
!tar -xvzf $TMP_FOLDER/model.tar.gz -C $TMP_FOLDER/

The model output produced by the built-in object detection model leaves the loss layer in place and does not include a non-max suppression (NMS) layer. To make it ready for inference on our machine, we need to remove the loss layer and add the NMS layer. We will be using a script from this GitHub repo: https://github.com/zhreshold/mxnet-ssd

Make sure to clone this Git repo to your ~/SageMaker folder

cd ~/SageMaker
git clone https://github.com/zhreshold/mxnet-ssd.git

In [None]:
%%sh
cd ~/SageMaker
git clone https://github.com/zhreshold/mxnet-ssd.git

In [None]:
!pip install opencv-python
!pip install gluoncv
!pip install mxnet

In [None]:
from matplotlib import pyplot as plt
from gluoncv.utils import download, viz
import numpy as np
import mxnet as mx
import json
import boto3
import cv2

Check to make sure the conversion script will work. If this errors then check the notebook is running with the conda_mxnet_p27 kernel.

In [None]:
!python ~/SageMaker/mxnet-ssd/deploy.py -h

Check the current hyperparameter settings for the model.

In [31]:
!cat $TMP_FOLDER/hyperparams.json

{"label_width": "150", "early_stopping_min_epochs": "10", "epochs": "30", "overlap_threshold": "0.5", "lr_scheduler_factor": "0.25", "_num_kv_servers": "auto", "weight_decay": "0.0005", "mini_batch_size": "1", "use_pretrained_model": "1", "freeze_layer_pattern": "", "lr_scheduler_step": "10,20", "early_stopping": "False", "early_stopping_patience": "5", "momentum": "0.9", "num_training_samples": "160", "optimizer": "sgd", "_tuning_objective_metric": "", "early_stopping_tolerance": "0.0", "learning_rate": "0.001", "kv_store": "device", "nms_threshold": "0.45", "num_classes": "1", "base_network": "resnet-50", "nms_topk": "400", "_kvstore": "device", "image_shape": "512"}

Execute script to remove loss layer and add NMS layer.

In [None]:
!python /home/ec2-user/SageMaker/mxnet-ssd/deploy.py --network resnet50 --num-class 1 --nms .45 --data-shape 512 --prefix $TMP_FOLDER/model_algo_1

Check to confirm the new "deployable" params and symbol files are available.

In [32]:
!ls -alh $TMP_FOLDER

total 306M
drwxrwxr-x 2 ec2-user ec2-user 4.0K Apr 29 00:59 .
drwxrwxr-x 4 ec2-user ec2-user 4.0K Apr 29 02:44 ..
-rw-rw-r-- 1 ec2-user ec2-user 105M Apr 29 00:59 deploy_model_algo_1-0000.params
-rw-rw-r-- 1 ec2-user ec2-user 129K Apr 29 00:59 deploy_model_algo_1-symbol.json
-rw-r--r-- 1 ec2-user ec2-user  677 Apr 28 14:52 hyperparams.json
-rw-r--r-- 1 ec2-user ec2-user 105M Apr 28 14:52 model_algo_1-0000.params
-rw-r--r-- 1 ec2-user ec2-user 130K Apr 28 14:52 model_algo_1-symbol.json
-rw-rw-r-- 1 ec2-user ec2-user  97M Apr 28 14:52 model.tar.gz


Upload the deployable model files to S3.

In [None]:
params_file = '{}/deploy_model_algo_1-0000.params'.format(TMP_FOLDER)
symbols_file = '{}/deploy_model_algo_1-symbol.json'.format(TMP_FOLDER)

s3_client = boto3.client('s3')
with open(params_file) as file:
    object = file.read()
    s3_client.put_object(Body=object, Bucket=BUCKET, Key=OUTPUT_PREFIX + "/deploy_model_algo_1-0000.params")

with open(symbols_file) as file:
    object = file.read()
    s3_client.put_object(Body=object, Bucket=BUCKET, Key=OUTPUT_PREFIX + "/deploy_model_algo_1-symbol.json")

# Outputs required for next stage

#  Next step

Once the training job completes, move on to the [next notebook](./03_local_inference_post_training.ipynb) to convert the trained model to a deployable format and run local inference