Guide to GitHub and AWS SageMaker for the Open Problems Jamboree

This document is a guide to using the Open Problems in Single-Cell Analysis GitHub repository to edit Docker containers and prototype code in AWS SageMaker. In this guide, you will perform the following steps

Steps

Fork the SingleCellOpenProblems GitHub repository
Create a custom Dockerfile
Push this image to the Elastic Container Registry using GitHub Actions workflows
Attach this image to an AWS SageMaker Studio domain
Launch a notebook using the custom image

In addition to this guide, we've produced a video tutorial, Using GitHub and AWS for Open Problems in Single-Cell Analysis, which walks through every step of the process. In the video, we provide details that are helpful for troubleshooting if you're having issues. The video also performs all of the steps in the process from start to finish so you can see what needs to happen at each step. However, because of the troubleshooting tips and included wait times, the video is 40 minutes long. To make it quicker to get started, we've written out the steps in the process here and included time-stamped links to the video in case you'd like to see a specific step performed live.

You also might find it helpful to consult AWS Documentation for bringing your own image to AWS SageMaker. You have two options using either the command-line interface (CLI) or a web interface (Console):

Custom Image with SageMaker Studio (CLI)
Custom Image with SageMaker Studio (Console).

There is a 1:1 correspondence between the steps to set up SageMaker using the CLI and the Console UI. We use the Console in our video and this guide, but feel free to use the CLI if you are more comfortable with it.

Introduction to the GitHub repository
- Why Docker containers?
- Prototyping within Docker containers
Getting started and forking the GitHub repository
Editing a Dockerfile
Find your Docker container on the Elastic Container Registry
Attach your Image to SageMaker Studio
Add user to SageMaker Studio
Open SageMaker Studio and Launch a Notebook using a Custom Image
- Selecting an instance type
- Kernel not found error

Introduction to the GitHub repository

Watch this section of the tutorial starting at [0:00]

Open Problems in Single-Cell Analysis is a project to aggregate and benchmark solutions to formalized problems in single-cell analysis.

To facilitate comparing methods designed for a particular task, we've created a platform that runs benchmarks on AWS servers. This allows individuals to upload code to a central repository and have their method benchmarked using a set of standardized datasets without needing access to compute resources to run the full pipeline.

Why Docker containers?

Because methods developers use different packages with their own sets dependencies and language requirements, we run each method, metric, and dataset loader in a Docker container. You can configure these containers to use a specific operating system, programming language, and any dependencies needed for a method. Each Docker containers is specified by a Dockerfile within the docker/ directory of the Open Problems GitHub repository. Consult the Docker documentation for guidance on creating a Dockerfile.

You can find instructions for adding a new Docker image in our Docker README.md.

Prototyping within Docker containers

To facilitate prototyping within Docker containers, we've set up the Open Problems workflows to automatically compile containers and upload them to the Amazon Web Services (AWS) Elastic Container Registry.

You can attach these Images to an AWS SageMaker Domain. From there, you can launch Jupyter notebooks using your custom image. You can then test out code contributions within the image before committing to the GitHub repository.

Getting started and forking the GitHub repository

Watch this section of the tutorial starting at [1:20]

The goal for this section is to create a fork, activate GitHub Actions, configure your AWS secrets, edit the GitHub repository locally, and push to your fork.

Before you get started, please read our Contributor Guide. It contains important up-to-date information about how to best contribute to Open Problems.

Detailed instructions for the forking process can be found in the Submitting new features section of our Contributor Guide.

Editing a Dockerfile

Watch this section of the tutorial starting at [10:08]

All the Open Problems Docker containers are specified by Dockerfiles in the docker/ folder.

To edit your custom Docker image on your fork, follow the instructions for editing Docker images in our Docker Guide.

Note, if you're interested in creating a container for R, you need to inherit from the openproblems-r-base image.

Additional resources

SageMaker Studio Custom Image Samples - This GitHub contains example images that already work with SageMaker Studio
Bringing your own R environment to Amazon SageMaker Studio - This AWS ML Blog details the start to finish process of creating an R environment for SageMaker

Find your Docker container on the Elastic Container Registry

Watch this section starting at [15:08]

Once the Run Benchmarks workflow has successfully completed the Upload Docker job, your docker images are uploaded to the AWS Elastic Container Registry (ECR).

You can navigate to the ECR from the AWS Console. You should have received an email with your AWS User Credentials that includes a Console Login URL. Enter your username and password to login. You can then search for the Elastic Container Registry service in the search bar at the top of the screen.

Next, follow the steps to locate your image on the ECR in the Building Docker Images using Github Actions workflows section of our Docker guide.

Attach your Image to SageMaker Studio

Watch this section of the tutorial starting at [18:50]

To use your image in a SageMake Studio notebook, you need to attach the image to the domain running the SageMaker app.

First, navigate to the SageMaker Studio control panel from the AWS Console. You should have received an email with your AWS User Credentials that includes a Console Login URL. You can then search for the Amazon SageMaker service in the search bar at the top of the screen.

Next, follow the steps in the AWS Documentation to Attach a Custom Image to an Existing SageMaker Studio Domain.

If you have already launched SageMaker Studio, you will need to restart your SageMaker Studio app so it can see the new image:

Follow the steps in the Shut Down SageMaker Resources tutorial.
Follow the steps below to Open SageMaker Studio.

A few caveats to watch out for during this step:

The Image name should be the Image Tag from the ECR. Use only lower case letters and dashes (should match RegEx [a-z\-]). No underscores or other special characters are allowed.
When editing the Image name and Image display name, you must click outside the box to validate the text.
The IAM role should be the AmazonSageMaker-ExecutionRole available within the dropdown.
The EFS Mount Path should be /home/sagemaker-user. As long as your image inherits from the base openproblems image, this folder should exist.
The Kernel name is set within the Docker Image. If you’re using one of the Python images, your kernel name should be python3. If you’re using an R image and want an R kernel, the image should be ir. These Kernel names must be available to Jupyter KernelGateway. More information about Jupyter Kernels can be found in the Making kernels for Jupyter documentation.
If you see an error after you click Submit stating that the domain already being updated, just wait a few seconds. Someone else tried to attach an image to the same domain at the same time as you, and the domain can only process one request at a time.

Additional resources

Custom SageMaker image specifications - specifications for SageMaker Studio Images
Field Notes: Accelerate Research with Managed Jupyter on Amazon SageMaker - a start to finish blog on creating a SageMaker environment with custom images
SageMaker Custom Images - Issue #9 - an issue @dburkhardt encountered while creating images for this project. SageMaker does some wonky modifications of the images before launching a notebook. This may be a hangup for you if you try to install a custom Python environment.

Add user to SageMaker Studio

Watch this section of the tutorial starting at [27:16]

You only need to do this step if you haven't already created a user on the SageMaker Studio console for the domain you're using.

To add a new user, go to the SageMaker Studio Control Panel and click Add User in the upper right hand corder.

Your user name should be the first letter of the your name followed by your last name. E.g. Wes Lewis becomes wlewis.

Execution role should be AmazonSageMaker-ExecutionRole. This is the same as used when Attaching the image.

The user configuration page should look like this:

To create the user, hit Submit. If you see an error about the domain status, wait a few seconds and try again. One only user can be added to a domain at a given time.

Open SageMaker Studio and Launch a Notebook using a Custom Image

Watch this section of the tutorial starting at [28:06]

To launch SageMaker, click on the Open Studio button next to your user name.

Note, you can access anyone's Studio to facilitate collaboration. Please be aware that this means you can also overwrite their notebooks. Be careful.

If you haven’t launched the Studio before, you should see a loading screen for SageMaker Studio. It may take 2-3 minutes for the server to load.

Once it’s working, you should see a Jupyter Lab interface. For more information on the UI, please consult the AWS Guide to the SageMaker Studio Interface.

To launch a Notebook with your custom image:

Scroll down to the "Notebooks and Compute Resources"
Click on the Select a SageMaker Image dropdown
Select your custom image

Select either a Notebook or a Console depending on which you'd like to use. You can also launch an Image terminal, which will start a Bash environment from the Docker image.
If you click Notebook, SageMaker will launch a Jupyter Notebook using your custom image. You can see the Kernel being used in the upper right corner of the notebook.

While the image is loading, you will see "Unknown" next to the Kernel name. This will change once SageMaker has requisitioned a virtual instance for your notebook to run in. After the notebook has spun up, it will display information about the vCPU and RAM available in your instance. You may click on this information to select a new instance type.

You can now start programming in this Notebook using the installed packages!

Note, you can also start up R notebooks by selecting an image using the R ipython kernel.

When you're done, be sure to Shut Down SageMaker Resources to avoid incurring excess costs.

Selecting an instance type

When you open a new notebook for the first time, you are assigned a default Amazon Elastic Compute Cloud (Amazon EC2) instance type to run the notebook. When you open additional notebooks on the same instance type, the notebooks run on the same instance as the first notebook, even if the notebooks use different kernels.

We've selected three instances to use during the Jamboree. Note, it is possible to select any kind of instance, but please only select from the following instances. If you need access to a different instance type, please contact an @organizer on Discord.

Instance Type	vCPU	Memory	GPU	$/hr	Intended Use
`ml.t3.medium`	`2`	`4 GiB`	`0`	`$0.0582`	General prototyping with test datasets
`ml.t3.2xlarge`	`8`	`32 GiB`	`0`	`$0.4659`	Production prototyping with full datasets
`ml.g4dn.2xlarge`	`8`	`32 GiB`	`1`	`$1.0528`	Prototyping requiring a GPU

To change your instance, follow the Change Instance Type tutorial from AWS.

Kernel not found error

If you see the following error:

Don't fret! This means that the image you were using has been deleted from ECR. We do this when new images are uploaded with the same tag to save space.

To add the new image version

Go back to the ECR, and find the URI for the newest version of the image that you want to work on.
From the SageMaker Studio Control Panel, find the attached image that you'd like to revise at the bottom of the window.
Click on "Attach version" next to the image.
Select "New Image Version"
Copy the URI into the textbox
Click "Next"
Don't change anything on the Image properties or Studio configuration pages (unless the new image has new configurations you need to update)
Submit the new version
You should now see the "Latest version attached" number increase by 1 on the "Custom images attached to domain" box.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SAGEMAKER.md

SAGEMAKER.md

Guide to GitHub and AWS SageMaker for the Open Problems Jamboree

Steps

Table of Contents

Introduction to the GitHub repository

Why Docker containers?

Prototyping within Docker containers

Getting started and forking the GitHub repository

Editing a Dockerfile

Find your Docker container on the Elastic Container Registry

Attach your Image to SageMaker Studio

Add user to SageMaker Studio

Open SageMaker Studio and Launch a Notebook using a Custom Image

Selecting an instance type

Kernel not found error

Files

SAGEMAKER.md

Latest commit

History

SAGEMAKER.md

File metadata and controls

Guide to GitHub and AWS SageMaker for the Open Problems Jamboree

Steps

Table of Contents

Introduction to the GitHub repository

Why Docker containers?

Prototyping within Docker containers

Getting started and forking the GitHub repository

Editing a Dockerfile

Find your Docker container on the Elastic Container Registry

Attach your Image to SageMaker Studio

Add user to SageMaker Studio

Open SageMaker Studio and Launch a Notebook using a Custom Image

Selecting an instance type

Kernel not found error