# Training MMAction3 Mask-RCNN Model on Sagemaker Distributed Cluster

## Motivation
[MMDetection](https://github.com/open-mmlab/mmdetection) is a popular open-source Deep Learning framework focused on Computer Vision models and use cases. MMDetection provides to higher level APIs for model training and inference. It demonstrates [state-of-the-art benchmarks](https://github.com/open-mmlab/mmdetection#benchmark-and-model-zoo) for variety of model architecture and extensive Model Zoo.

In this notebook, we will build a custom training container with MMdetection library and then train Mask-RCNN model from scratch on [COCO2017 dataset](https://cocodataset.org/#home) using Sagemaker distributed [training feature](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-training.html) in order to reduce training time.

### Preconditions
- To execute this notebook, you will need to have COCO 2017 training and validation datasets uploaded to S3 bucket available for Amazon Sagemaker service.


## Building Training Container

Amazon Sagemaker allows to BYO containers for training, data processing, and inference. In our case, we need to build custom training container which will be pushed to your AWS account [ECR service](https://aws.amazon.com/ecr/). 

For this, we need to login to public ECR with Sagemaker base images and private ECR reposity.

In [8]:
import sagemaker, boto3

session = sagemaker.Session()
region = session.boto_region_name
account = boto3.client('sts').get_caller_identity().get('Account')
bucket = session.default_bucket()

In [9]:
# login to Sagemaker ECR with Deep Learning Containers
!aws ecr get-login-password --region {region} | docker login --username AWS --password-stdin 763104351884.dkr.ecr.{region}.amazonaws.com
# login to your private ECR
!aws ecr get-login-password --region {region} | docker login --username AWS --password-stdin {account}.dkr.ecr.{region}.amazonaws.com

https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded
https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded


Now, let review training container:
- use Sagemaker PyTorch 1.5.0 container as base image;
- install latest version of Pytorch libraries and MMdetection dependencies;
- build MMDetection from sources;
- configure Sagemaker env variables, specifically, what script to use at training time.

In [None]:
! pygmentize -l docker Dockerfile.training

<br>
<br>
Next, we build and push custom training container to private ECR
<br>
<br>

In [16]:
! ./build_and_push.sh mmaction2-training latest Dockerfile.training

https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded
Sending build context to Docker daemon  37.11MB
Step 1/14 : FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.5.0-gpu-py36-cu101-ubuntu16.04
 ---> 47cd15520b75
Step 2/14 : LABEL author="vadimd@amazon.com"
 ---> Using cache
 ---> 4f170cf38f05
Step 3/14 : WORKDIR /opt/ml/code
 ---> Using cache
 ---> 8e584c805601
Step 4/14 : RUN pip install --upgrade --force-reinstall torch torchvision cython
 ---> Using cache
 ---> ec8831a42789
Step 5/14 : RUN pip install mmcv-full==latest+torch1.7.0+cu101 -f https://download.openmmlab.com/mmcv/dist/index.html
 ---> Using cache
 ---> 56f1a1893ae8
Step 6/14 : RUN git clone https://github.com/open-mmlab/mmaction2.git
 ---> Using cache
 ---> 2057ac623727
Step 7/14 : RUN cd mmaction2/ &&     pip install -r requirements/build.txt &&     pip install -e .
 ---> Using cache
 ---> e546e0935879
Step 8/14 : RUN pip install decord
 ---> Running in cc754a

### Training script

At training time, Sagemaker executes training script defined in `SAGEMAKER_PROGRAM` variable. In our case, this script does following
- parses user parameters passed via Sagemaker Hyperparameter dictionary;
- based on parameters constructs launch command;
- uses `torch.distributed.launch` utility to launch distributed training;
- uses MMDetection `tools/train.py` to configure trianing process.


In [None]:
! pygmentize container_training/mmaction2_train.py

## Start Sagemaker Training 

In [10]:
# Define IAM role
import boto3
import re

import os
import numpy as np
import pandas as pd
from sagemaker import get_execution_role

role = get_execution_role()

In [11]:
from time import gmtime, strftime

prefix_input = 'mmaction2-input'
prefix_output = 'mmaction2-ouput'

In [12]:
container = "mmaction2-training" # your container name
tag = "latest"
image = '{}.dkr.ecr.{}.amazonaws.com/{}:{}'.format(account, region, container, tag)

In [13]:
# algorithm parameters

hyperparameters = {
    "config-file" : "configs/recognition/tsn/tsn_r50_video_1x1x8_100e_kinetics400_rgb.py", # config path is relative to MMDetection root directory
    "dataset" : "kinetics400_tiny",
    "auto-scale" : "false", # whether to scale LR and Warm Up time
    "validate" : "true", # whether to run validation after training is done
    
    # 'options' allows to override individual config values
    "options" : "total_epochs=1; optimizer.lr=0.08; evaluation.gpu_collect=True",
}

In [14]:
# Sagemaker will parse metrics from STDOUT and store/visualize them as part of training job
metrics = [
    {
        "Name": "top_k_accuracy",
        "Regex": ".*top_k_accuracy:\s([0-9\\.]+)\s*"
    },
    {
        "Name": "mean_class_accuracy",
        "Regex": ".*mean_class_accuracy:\s([0-9\\.]+)\s*"
    },
]

<br>
<br>

Execute cell below to start training on Sagemaker.
<br>
<br>

In [9]:
# download, decompress the data
!rm kinetics400_tiny.zip*
!rm -rf kinetics400_tiny
!wget https://download.openmmlab.com/mmaction/kinetics400_tiny.zip
!unzip kinetics400_tiny.zip > /dev/null

rm: cannot remove ‘kinetics400_tiny.zip*’: No such file or directory
--2020-12-10 12:43:21--  https://download.openmmlab.com/mmaction/kinetics400_tiny.zip
Resolving download.openmmlab.com (download.openmmlab.com)... 47.252.96.35
Connecting to download.openmmlab.com (download.openmmlab.com)|47.252.96.35|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18308682 (17M) [application/zip]
Saving to: ‘kinetics400_tiny.zip’


2020-12-10 12:43:24 (8.76 MB/s) - ‘kinetics400_tiny.zip’ saved [18308682/18308682]



In [9]:
# Check the directory structure of the tiny data

# Install tree first
!sudo yum update -y && sudo yum install -y tree
!tree kinetics400_tiny

Loaded plugins: dkms-build-requires, priorities, update-motd, upgrade-helper,
              : versionlock
amzn-main                                                | 2.1 kB     00:00     
amzn-updates                                             | 3.8 kB     00:00     
Resolving Dependencies
[plugin/dkms-build-requires]: Found kernels in transaction, adding corresponding devel and gcc package(s)
--> Running transaction check
---> Package curl.x86_64 0:7.61.1-12.94.amzn1 will be updated
---> Package curl.x86_64 0:7.61.1-12.95.amzn1 will be an update
---> Package docker.x86_64 0:19.03.6ce-4.58.amzn1 will be updated
---> Package docker.x86_64 0:19.03.13ce-1.62.amzn1 will be an update
---> Package kernel.x86_64 0:4.14.203-116.332.amzn1 will be installed
---> Package libcurl.x86_64 0:7.61.1-12.94.amzn1 will be updated
---> Package libcurl.x86_64 0:7.61.1-12.95.amzn1 will be an update
---> Package libcurl-devel.x86_64 0:7.61.1-12.94.amzn1 will be updated
---> Package libcurl-devel.x86_64 0:7.6

In [2]:
# After downloading the data, we need to check the annotation format
!cat kinetics400_tiny/kinetics_tiny_train_video.txt

D32_1gwq35E.mp4 0
iRuyZSKhHRg.mp4 1
oXy-e_P_cAI.mp4 0
34XczvTaRiI.mp4 1
h2YqqUhnR34.mp4 0
O46YA8tI530.mp4 0
kFC3KY2bOP8.mp4 1
WWP5HZJsg-o.mp4 1
phDqGd0NKoo.mp4 1
yLC9CtWU5ws.mp4 0
27_CSXByd3s.mp4 1
IyfILH9lBRo.mp4 1
T_TMNGzVrDk.mp4 1
TkkZPZHbAKA.mp4 0
PnOe3GZRVX8.mp4 1
soEcZZsBmDs.mp4 1
FMlSTTpN3VY.mp4 1
WaS0qwP46Us.mp4 0
A-wiliK50Zw.mp4 1
oMrZaozOvdQ.mp4 1
ZQV4U2KQ370.mp4 0
DbX8mPslRXg.mp4 1
h10B9SVE-nk.mp4 1
P5M-hAts7MQ.mp4 0
R8HXQkdgKWA.mp4 0
D92m0HsHjcQ.mp4 0
RqnKtCEoEcA.mp4 0
LvcFDgCAXQs.mp4 0
xGY2dP0YUjA.mp4 0
Wh_YPQdH1Zg.mp4 0


In [6]:
!aws s3 cp --recursive kinetics400_tiny s3://$bucket/kinetics400_tiny

upload: kinetics400_tiny/kinetics_tiny_train_video.txt to s3://sagemaker-us-east-1-579019700964/kinetics400_tiny/kinetics_tiny_train_video.txt
upload: kinetics400_tiny/kinetics_tiny_val_video.txt to s3://sagemaker-us-east-1-579019700964/kinetics400_tiny/kinetics_tiny_val_video.txt
upload: kinetics400_tiny/train/LvcFDgCAXQs.mp4 to s3://sagemaker-us-east-1-579019700964/kinetics400_tiny/train/LvcFDgCAXQs.mp4
upload: kinetics400_tiny/train/A-wiliK50Zw.mp4 to s3://sagemaker-us-east-1-579019700964/kinetics400_tiny/train/A-wiliK50Zw.mp4
upload: kinetics400_tiny/train/DbX8mPslRXg.mp4 to s3://sagemaker-us-east-1-579019700964/kinetics400_tiny/train/DbX8mPslRXg.mp4
upload: kinetics400_tiny/train/IyfILH9lBRo.mp4 to s3://sagemaker-us-east-1-579019700964/kinetics400_tiny/train/IyfILH9lBRo.mp4
upload: kinetics400_tiny/train/D92m0HsHjcQ.mp4 to s3://sagemaker-us-east-1-579019700964/kinetics400_tiny/train/D92m0HsHjcQ.mp4
upload: kinetics400_tiny/train/27_CSXByd3s.mp4 to s3://sagemaker-us-east-1-57901970

In [None]:
est = sagemaker.estimator.Estimator(image,
                                          role=role,
                                          train_instance_count=2,
                                          train_instance_type='ml.p3.8xlarge',
                                          train_volume_size=100,
                                          output_path="s3://{}/{}".format(bucket, prefix_output),
                                          metric_definitions = metrics,
                                          hyperparameters = hyperparameters, 
                                          sagemaker_session=session
)

est.fit({"training" : "s3://"+bucket+"/kinetics400_tiny/"})

train_instance_count has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_volume_size has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


2020-12-15 06:50:19 Starting - Starting the training job...
2020-12-15 06:50:42 Starting - Launching requested ML instancesProfilerReport-1608015019: InProgress
......
2020-12-15 06:51:43 Starting - Preparing the instances for training.........
2020-12-15 06:53:21 Downloading - Downloading input data
2020-12-15 06:53:21 Training - Training in-progress...
2020-12-15 06:53:45 Training - Downloading the training image.................