-
Manage your SageMaker service limits. You will need a minimum limit of 2
ml.p3.16xlarge
and 2ml.p3dn.24xlarge
instance types, but a service limit of 4 for each instance type is recommended. Keep in mind that the service limit is specific to each AWS region. We recommend usingus-west-2
region for this tutorial. -
Create or use an existing Amazon S3 bucket in the AWS region where you would like to execute this tutorial. Save the S3 bucket name. You will need it later.
In this tutorial, our focus is distributed TensorFlow training using Amazon SageMaker.
Concretely, we will discuss distributed TensorFlow training for TensorPack Mask/Faster-RCNN and AWS Samples Mask R-CNN models using COCO 2017 dataset.
This tutorial has two key steps:
-
We use Amazon CloudFormation to create a new Sagemaker notebook instance in an Amazon Virtual Private Network (VPC). We also automatically create an Amazon EFS file-system, and an Amazon FSx for Lustre file-system, and mount both the file-systems on the notebook instance.
-
We use the SageMaker notebook instance to launch distributed training jobs in the VPC using Amazon S3, Amazon EFS, or Amazon FSx for Lustre as data source for input training data.
Our objective in this step is to create a SageMaker notebook instance in a VPC. We have two options. We can create a SageMaker notebook instance in a new VPC, or we can create the notebook instance in an existing VPC. We cover both options below.
The AWS IAM User or AWS IAM Role executing this step requires AWS IAM permissions consistent with Network Administrator job function.
The CloudFormation template cfn-sm.yaml can be used to create a CloudFormation stack that creates a SageMaker notebook instance in a new VPC.
You can create the CloudFormation stack using cfn-sm.yaml directly in CloudFormation service console.
Alternatively, you can customize variables in stack-sm.sh script and execute the script anywhere you have AWS Command Line Interface (CLI) installed. The CLI option is detailed below:
- Install AWS CLI
- In
stack-sm.sh
, setAWS_REGION
to your AWS region andS3_BUCKET
to your S3 bucket . These two variables are required. - Optionally, you can set
EFS_ID
variable if you want to use an existing EFS file-system. If you leaveEFS_ID
blank, a new EFS file-system is created. If you chose to use an existing EFS file-system, make sure the existing file-system does not have any existing mount targets. - Optionally, you can specify
GIT_URL
to add a Git-hub repository to the SageMaker notebook instance. If the Git-hub repository is private, you can specifyGIT_USER
andGIT_TOKEN
variables. - Execute the customized
stack-sm.sh
script to create a CloudFormation stack using AWS CLI.
The estimated time for creating this CloudFormation stack is 30 minutes. The stack will create following AWS resources:
- A SageMaker execution role
- A Virtual Private Network (VPC) with Internet Gateway (IGW), 1 public subnet, 3 private subnets, a NAT gateway, a Security Group, and a VPC Gateway Endpoint to S3
- Amazon EFS file system with mount targets in each private subnet in the VPC.
- Amazon FSx for Lustre file system in the VPC.
- A SageMaker Notebook instance in the VPC:
- The EFS file-system is mounted on the SageMaker notebook instance
- The FSx for Lustre file-system is mounted on the SageMaker notebook instance
- The SageMaker execution role attached to the notebook instance provides appropriate IAM access to AWS resources
This option is only recommended for advanced AWS users. Make sure your existing VPC has following:
- One or more security groups
- One or more private subnets with NAT Gateway access and existing EFS file-system mount targets
- Endpoint gateway to S3
Create a SageMaker notebook instance in a VPC using AWS SageMaker console. When you are creating the SageMaker notebook instance, add at least 200 GB of local EBS volume under advanced configuration options. You will also need to mount your EFS file-system on the SageMaker notebook instance, mount your FSx for Lustre file-system on the SageMaker notebook instance.
Jupyter notebooks for training Mask R-CNN are listed below:
- Mask R-CNN notebook that uses S3 bucket, or EFS, as data source: mask-rcnn-scriptmode-experiment-trials.ipynb
- Mask R-CNN notebook that uses S3 bucker, or FSx for Lustre file-system as data source:
mask-rcnn-scriptmode-fsx.ipynb