Skip to content

Files

Latest commit

 

History

History

distributed_tensorflow_mask_rcnn

Distributed TensorFlow training using Amazon SageMaker

Prerequisites

  1. Create and activate an AWS Account

  2. Manage your SageMaker service limits. You will need a minimum limit of 2 ml.p3.16xlarge and 2 ml.p3dn.24xlarge instance types, but a service limit of 4 for each instance type is recommended. Keep in mind that the service limit is specific to each AWS region. We recommend using us-west-2 region for this tutorial.

  3. Create or use an existing Amazon S3 bucket in the AWS region where you would like to execute this tutorial. Save the S3 bucket name. You will need it later.

Mask-RCNN training

In this tutorial, our focus is distributed TensorFlow training using Amazon SageMaker.

Concretely, we will discuss distributed TensorFlow training for TensorPack Mask/Faster-RCNN and AWS Samples Mask R-CNN models using COCO 2017 dataset.

This tutorial has two key steps:

  1. We use Amazon CloudFormation to create a new Sagemaker notebook instance in an Amazon Virtual Private Network (VPC). We also automatically create an Amazon EFS file-system, and an Amazon FSx for Lustre file-system, and mount both the file-systems on the notebook instance.

  2. We use the SageMaker notebook instance to launch distributed training jobs in the VPC using Amazon S3, Amazon EFS, or Amazon FSx for Lustre as data source for input training data.

Create SageMaker notebook instance in a VPC

Our objective in this step is to create a SageMaker notebook instance in a VPC. We have two options. We can create a SageMaker notebook instance in a new VPC, or we can create the notebook instance in an existing VPC. We cover both options below.

Create SageMaker notebook instance in a new VPC

The AWS IAM User or AWS IAM Role executing this step requires AWS IAM permissions consistent with Network Administrator job function.

The CloudFormation template cfn-sm.yaml can be used to create a CloudFormation stack that creates a SageMaker notebook instance in a new VPC.

You can create the CloudFormation stack using cfn-sm.yaml directly in CloudFormation service console.

Alternatively, you can customize variables in stack-sm.sh script and execute the script anywhere you have AWS Command Line Interface (CLI) installed. The CLI option is detailed below:

  • Install AWS CLI
  • In stack-sm.sh, set AWS_REGION to your AWS region and S3_BUCKET to your S3 bucket . These two variables are required.
  • Optionally, you can set EFS_ID variable if you want to use an existing EFS file-system. If you leave EFS_ID blank, a new EFS file-system is created. If you chose to use an existing EFS file-system, make sure the existing file-system does not have any existing mount targets.
  • Optionally, you can specify GIT_URL to add a Git-hub repository to the SageMaker notebook instance. If the Git-hub repository is private, you can specify GIT_USER and GIT_TOKEN variables.
  • Execute the customized stack-sm.sh script to create a CloudFormation stack using AWS CLI.

The estimated time for creating this CloudFormation stack is 30 minutes. The stack will create following AWS resources:

  1. A SageMaker execution role
  2. A Virtual Private Network (VPC) with Internet Gateway (IGW), 1 public subnet, 3 private subnets, a NAT gateway, a Security Group, and a VPC Gateway Endpoint to S3
  3. Amazon EFS file system with mount targets in each private subnet in the VPC.
  4. Amazon FSx for Lustre file system in the VPC.
  5. A SageMaker Notebook instance in the VPC:
    • The EFS file-system is mounted on the SageMaker notebook instance
    • The FSx for Lustre file-system is mounted on the SageMaker notebook instance
    • The SageMaker execution role attached to the notebook instance provides appropriate IAM access to AWS resources

Create SageMaker notebook instance in an existing VPC

This option is only recommended for advanced AWS users. Make sure your existing VPC has following:

  • One or more security groups
  • One or more private subnets with NAT Gateway access and existing EFS file-system mount targets
  • Endpoint gateway to S3

Create a SageMaker notebook instance in a VPC using AWS SageMaker console. When you are creating the SageMaker notebook instance, add at least 200 GB of local EBS volume under advanced configuration options. You will also need to mount your EFS file-system on the SageMaker notebook instance, mount your FSx for Lustre file-system on the SageMaker notebook instance.

Launch SageMaker training jobs

Jupyter notebooks for training Mask R-CNN are listed below: