# Training MMDetection Mask-RCNN Model on Sagemaker Distributed Cluster

## Motivation
[MMDetection](https://github.com/open-mmlab/mmdetection) is a popular open-source Deep Learning framework focused on Computer Vision models and use cases. MMDetection provides to higher level APIs for model training and inference. It demonstrates [state-of-the-art benchmarks](https://github.com/open-mmlab/mmdetection#benchmark-and-model-zoo) for variety of model architecture and extensive Model Zoo.

In this notebook, we will build a custom training container with MMdetection library and then train Mask-RCNN model from scratch on [COCO2017 dataset](https://cocodataset.org/#home) using Sagemaker distributed [training feature](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-training.html) in order to reduce training time.

### Preconditions
- To execute this notebook, you will need to have COCO 2017 training and validation datasets uploaded to S3 bucket available for Amazon Sagemaker service.


## Building Training Container

Amazon Sagemaker allows to BYO containers for training, data processing, and inference. In our case, we need to build custom training container which will be pushed to your AWS account [ECR service](https://aws.amazon.com/ecr/). 

For this, we need to login to public ECR with Sagemaker base images and private ECR reposity.

In [1]:
import sagemaker, boto3

session = sagemaker.Session()
region = session.boto_region_name
account = boto3.client('sts').get_caller_identity().get('Account')
bucket = session.default_bucket()

In [2]:
# login to Sagemaker ECR with Deep Learning Containers
!aws ecr get-login-password --region {region} | docker login --username AWS --password-stdin 763104351884.dkr.ecr.{region}.amazonaws.com
# login to your private ECR
!aws ecr get-login-password --region {region} | docker login --username AWS --password-stdin {account}.dkr.ecr.{region}.amazonaws.com

https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded
https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded


Now, let review training container:
- use Sagemaker PyTorch 1.5.0 container as base image;
- install latest version of Pytorch libraries and MMdetection dependencies;
- build MMDetection from sources;
- configure Sagemaker env variables, specifically, what script to use at training time.

In [None]:
! pygmentize -l docker Dockerfile.training

<br>
<br>
Next, we build and push custom training container to private ECR
<br>
<br>

In [26]:
! ./build_and_push.sh mmdetection-training latest Dockerfile.training

https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded
Sending build context to Docker daemon  5.338MB
Step 1/13 : FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.5.0-gpu-py36-cu101-ubuntu16.04
 ---> 47cd15520b75
Step 2/13 : LABEL author="vadimd@amazon.com"
 ---> Using cache
 ---> 78da0851d3c4
Step 3/13 : WORKDIR /opt/ml/code
 ---> Using cache
 ---> 07bd9a0de06d
Step 4/13 : RUN pip install --upgrade --force-reinstall  torch torchvision cython
 ---> Using cache
 ---> b13c99508ae7
Step 5/13 : RUN pip install mmcv-full==latest+torch1.7.0+cu101 -f https://download.openmmlab.com/mmcv/dist/index.html
 ---> Running in 02ad64588c07
Looking in links: https://download.openmmlab.com/mmcv/dist/index.html
Collecting mmcv-full==latest+torch1.7.0+cu101
  Downloading https://download.openmmlab.com/mmcv/dist/latest/torch1.7.0/cu101/mmcv_full-latest%2Btorch1.7.0%2Bcu101-cp36-cp36m-manylinux1_x86_64.whl (24.1 MB)
Collecting yapf
  Downloading y

### Training script

At training time, Sagemaker executes training script defined in `SAGEMAKER_PROGRAM` variable. In our case, this script does following
- parses user parameters passed via Sagemaker Hyperparameter dictionary;
- based on parameters constructs launch command;
- uses `torch.distributed.launch` utility to launch distributed training;
- uses MMDetection `tools/train.py` to configure trianing process.


In [None]:
! pygmentize container_training/mmdetection_train.py

## Start Sagemaker Training 

In [4]:
# Define IAM role
import boto3
import re

import os
import numpy as np
import pandas as pd
from sagemaker import get_execution_role

role = get_execution_role()

In [5]:
from time import gmtime, strftime

prefix_input = 'mmdetection-input'
prefix_output = 'mmdetection-ouput'

In [6]:
container = "mmdetection-training" # your container name
tag = "latest"
image = '{}.dkr.ecr.{}.amazonaws.com/{}:{}'.format(account, region, container, tag)

In [7]:
# algorithm parameters

hyperparameters = {
    "config-file" : "configs/mask_rcnn/mask_rcnn_r50_fpn_1x_coco.py", # config path is relative to MMDetection root directory
    "dataset" : "coco",
    "auto-scale" : "false", # whether to scale LR and Warm Up time
    "validate" : "true", # whether to run validation after training is done
    
    # 'options' allows to override individual config values
    "options" : "total_epochs=1; optimizer.lr=0.08; evaluation.gpu_collect=True",
}

In [8]:
# Sagemaker will parse metrics from STDOUT and store/visualize them as part of training job
metrics = [
    {
        "Name": "loss",
        "Regex": ".*loss:\s([0-9\\.]+)\s*"
    },
    {
        "Name": "loss_rpn_cls",
        "Regex": ".*loss_rpn_cls:\s([0-9\\.]+)\s*"
    },
    {
        "Name": "loss_rpn_bbox",
        "Regex": ".*loss_rpn_bbox:\s([0-9\\.]+)\s*"
    },
    {
        "Name": "loss_cls",
        "Regex": ".*loss_cls:\s([0-9\\.]+)\s*"
    },
    {
        "Name": "acc",
        "Regex": ".*acc:\s([0-9\\.]+)\s*"
    },
    {
        "Name": "loss_bbox",
        "Regex": ".*loss_bbox:\s([0-9\\.]+)\s*"
    },
    {
        "Name": "loss_mask",
        "Regex": ".*loss_mask:\s([0-9\\.]+)\s*"
    },
    {
        "Name": "lr",
        "Regex": "lr: (-?\d+.?\d*(?:[Ee]-\d+)?)"
    }
]

<br>
<br>

Execute cell below to start training on Sagemaker.
<br>
<br>

In [17]:
# !aws s3 cp s3://fast-ai-coco/coco_tiny.tgz .
!aws s3 cp s3://fast-ai-coco/train2017.zip .
!aws s3 cp s3://fast-ai-coco/val2017.zip .
!aws s3 cp s3://fast-ai-coco/test2017.zip .
!aws s3 cp s3://fast-ai-coco/annotations_trainval2017.zip .

download: s3://fast-ai-coco/train2017.zip to ./train2017.zip        
download: s3://fast-ai-coco/val2017.zip to ./val2017.zip            
download: s3://fast-ai-coco/test2017.zip to ./test2017.zip         


In [None]:
# !tar -xvf coco_tiny.tgz
!unzip train2017.zip
!unzip val2017.zip
!unzip test2017.zip
!unzip annotations_trainval2017.zip

In [None]:
# !aws s3 cp --recursive coco_tiny s3://$bucket/coco_tiny
!aws s3 cp --recursive train2017 s3://$bucket/coco/train2017
!aws s3 cp --recursive val2017 s3://$bucket/coco/val2017
!aws s3 cp --recursive test2017 s3://$bucket/coco/test2017
!aws s3 cp --recursive annotations s3://$bucket/coco/annotations

In [None]:
est = sagemaker.estimator.Estimator(image,
                                          role=role,
                                          train_instance_count=2,
                                          train_instance_type='ml.p3.8xlarge',
                                          train_volume_size=100,
                                          output_path="s3://{}/{}".format(bucket, prefix_output),
                                          metric_definitions = metrics,
                                          hyperparameters = hyperparameters, 
                                          sagemaker_session=session
)

# est.fit({"training" : "s3://"+bucket+"/coco_tiny/"})
est.fit({"training" : "s3://"+bucket+"/coco/"})

train_instance_count has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_volume_size has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


2020-12-10 10:39:49 Starting - Starting the training job...
2020-12-10 10:40:15 Starting - Launching requested ML instancesProfilerReport-1607596789: InProgress
.........
2020-12-10 10:41:41 Starting - Preparing the instances for training.........
2020-12-10 10:43:17 Downloading - Downloading input data....