# MNIST training with PyTorch

MNIST is a widely used dataset for handwritten digit classification. It consists of 70,000 labeled 28x28 pixel grayscale images of hand-written digits. The dataset is split into 60,000 training images and 10,000 test images. There are 10 classes (one for each of the 10 digits). This tutorial will show how to train and test an MNIST model on SageMaker using PyTorch. 



In [1]:
import os
import json

import sagemaker
from sagemaker.pytorch import PyTorch
from sagemaker import get_execution_role


sess = sagemaker.Session()

role = get_execution_role()

output_path = "s3://" + sess.default_bucket() + "/mnist"

## PyTorch Estimator

The `PyTorch` class allows you to run your training script on SageMaker
infrastracture in a containerized environment. In this notebook, we
refer to this container as *training container*. 

You need to configure
it with the following parameters to set up the environment:

- entry_point: A user defined python file to be used by the training container as the 
instructions for training. We further discuss this file in the next subsection.

- role: An IAM role to make AWS service requests

- instance_type: The type of SageMaker instance to run your training script. 
Set it to `local` if you want to run the training job on 
the SageMaker instance you are using to run this notebook

- instance count: The number of instances you need to run your training job. 
Multiple instances are needed for distributed training.

- output_path: 
S3 bucket URI to save training output (model artifacts and output files)

- framework_version: The version of PyTorch you need to use.

- py_version: The python version you need to use

For more information, see [the API reference](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.EstimatorBase)



## Implement the entry point for training

The entry point for training is a python script that provides all 
the code for training a PyTorch model. It is used by the SageMaker 
PyTorch Estimator (`PyTorch` class above) as the entry point for running the training job.

Under the hood, SageMaker PyTorch Estimator creates a docker image
with runtime environemnts 
specified by the parameters you used to initiated the
estimator class and it injects the training script into the 
docker image to be used as the entry point to run the container.

In the rest of the notebook, we use *training image* to refer to the 
docker image specified by the PyTorch Estimator and *training container*
to refer to the container that runs the training image. 

This means your training script is very similar to a training script
you might run outside Amazon SageMaker, but it can access the useful environment 
variables provided by the training image. Checkout [the short list of environment variables provided by the SageMaker service](https://sagemaker.readthedocs.io/en/stable/frameworks/mxnet/using_mxnet.html?highlight=entry%20point) to see some common environment 
variables you might used. Checkout [the complete list of environment variables](https://github.com/aws/sagemaker-training-toolkit/blob/master/ENVIRONMENT_VARIABLES.md) for a complete 
description of all environment variables your training script
can access to. 

In this example, we use the training script `code/train.py`
as the entry point for our PyTorch Estimator.


In [2]:
!pygmentize 'code/train.py'

[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mgzip[39;49;00m
[34mimport[39;49;00m [04m[36mjson[39;49;00m
[34mimport[39;49;00m [04m[36mlogging[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36msys[39;49;00m

[34mimport[39;49;00m [04m[36mnumpy[39;49;00m [34mas[39;49;00m [04m[36mnp[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36mnn[39;49;00m [34mas[39;49;00m [04m[36mnn[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36mnn[39;49;00m[04m[36m.[39;49;00m[04m[36mfunctional[39;49;00m [34mas[39;49;00m [04m[36mF[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36moptim[39;49;00m [34mas[39;49;00m [04m[36moptim[39;49;00m
[34mfrom[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36mutils[

### Set hyperparameters

In addition, PyTorch estimator allows you to parse command line arguments
to your training script via `hyperparameters`.

<span style="color:red"> Note: local mode is not supported in SageMaker Studio </span>

In [3]:
# set local_mode to be True if you want to run the training script
# on the machine that runs this notebook

local_mode = False

if local_mode:
    instance_type = "local"
else:
    instance_type = "ml.c4.xlarge"

est = PyTorch(
    entry_point="train.py",
    source_dir="code",  # directory of your training script
    role=role,
    framework_version="1.5.0",
    py_version="py3",
    instance_type=instance_type,
    instance_count=1,
    output_path=output_path,
    hyperparameters={"batch-size": 128, "epochs": 1, "learning-rate": 1e-3, "log-interval": 100},
)

The training container executes your training script like

```
python train.py --batch-size 100 --epochs 1 --learning-rate 1e-3 \
    --log-interval 100
```

## Set up channels for training and testing data

You need to tell `PyTorch` estimator where to find your training and 
testing data. It can be a link to an S3 bucket or it can be a path
in your local file system if you use local mode. In this example,
we download the MNIST data from a public S3 bucket and upload it 
to your default bucket. 

In [4]:
import logging
import boto3
from botocore.exceptions import ClientError


# Download training and testing data from a public S3 bucket


def download_from_s3(data_dir="/tmp/data", train=True):
    """Download MNIST dataset and convert it to numpy array

    Args:
        data_dir (str): directory to save the data
        train (bool): download training set

    Returns:
        None
    """

    # Get global config
    with open("code/config.json", "r") as f:
        CONFIG = json.load(f)

    if not os.path.exists(data_dir):
        os.makedirs(data_dir)

    if train:
        images_file = "train-images-idx3-ubyte.gz"
        labels_file = "train-labels-idx1-ubyte.gz"
    else:
        images_file = "t10k-images-idx3-ubyte.gz"
        labels_file = "t10k-labels-idx1-ubyte.gz"

    # download objects
    s3 = boto3.client("s3")
    bucket = CONFIG["public_bucket"]
    for obj in [images_file, labels_file]:
        key = os.path.join("datasets/image/MNIST", obj)
        dest = os.path.join(data_dir, obj)
        if not os.path.exists(dest):
            s3.download_file(bucket, key, dest)
    return


download_from_s3("/tmp/data", True)
download_from_s3("/tmp/data", False)

In [5]:
# upload to the default bucket

prefix = "mnist"
bucket = sess.default_bucket()
loc = sess.upload_data(path="/tmp/data", bucket=bucket, key_prefix=prefix)

channels = {"training": loc, "testing": loc}

The keys of the dictionary `channels` are parsed to the training image
and it creates the environment variable `SM_CHANNEL_<key name>`. 

In this example, `SM_CHANNEL_TRAINING` and `SM_CHANNEL_TESTING` are created in the training image (checkout 
how `code/train.py` access these variables). For more information,
see: [SM_CHANNEL_{channel_name}](https://github.com/aws/sagemaker-training-toolkit/blob/master/ENVIRONMENT_VARIABLES.md#sm_channel_channel_name)

If you want, you can create a channel for validation:
```
channels = {
    'training': train_data_loc,
    'validation': val_data_loc,
    'test': test_data_loc
    }
```
You can then access this channel within your training script via
`SM_CHANNEL_VALIDATION`


## Run the training script on SageMaker
Now, the training container has everything to execute your training
script. You can start the container by calling `fit` method.

In [6]:
est.fit(inputs=channels)

2021-08-09 23:59:23 Starting - Starting the training job...
2021-08-09 23:59:24 Starting - Launching requested ML instancesProfilerReport-1628553563: InProgress
...
2021-08-10 00:00:19 Starting - Preparing the instances for training.........
2021-08-10 00:01:50 Downloading - Downloading input data...
2021-08-10 00:02:17 Training - Downloading the training image...
2021-08-10 00:02:54 Training - Training image download completed. Training in progress..[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2021-08-10 00:02:55,541 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2021-08-10 00:02:55,554 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2021-08-10 00:02:55,564 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2021-08-10 00:02:56,985 sagemaker_pytorch_contai

## Inspect and store model data

Now, the training is finished, the model artifact has been saved in 
the `output_path`. We 

In [7]:
pt_mnist_model_data = est.model_data
print("Model artifact saved at:\n", pt_mnist_model_data)

Model artifact saved at:
 s3://sagemaker-us-east-1-804604702169/mnist/pytorch-training-2021-08-09-23-59-23-082/output/model.tar.gz


We store the variable `model_data` in the current notebook kernel. 
In the [next notebook](get_started_with_mnist_deploy.ipynb), you will learn how to retrieve the model artifact and deploy to a SageMaker
endpoint.

In [8]:
%store pt_mnist_model_data

Stored 'pt_mnist_model_data' (str)


## Test and debug the entry point before executing the training container

The entry point `code/train.py` provided here has been tested and it can be executed in the training container. 
When you do develop your own training script, it is a good practice to simulate the container environment 
in the local shell and test it before sending it to SageMaker, because debugging in a containerized environment
is rather cumbersome. The following script shows how you can test your training script:

In [9]:
!pygmentize code/test_train.py

[34mimport[39;49;00m [04m[36mjson[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36msys[39;49;00m

[34mimport[39;49;00m [04m[36mboto3[39;49;00m
[34mfrom[39;49;00m [04m[36mtrain[39;49;00m [34mimport[39;49;00m parse_args, train

dirname = os.path.dirname(os.path.abspath([31m__file__[39;49;00m))

[34mwith[39;49;00m [36mopen[39;49;00m(os.path.join(dirname, [33m"[39;49;00m[33mconfig.json[39;49;00m[33m"[39;49;00m), [33m"[39;49;00m[33mr[39;49;00m[33m"[39;49;00m) [34mas[39;49;00m f:
    CONFIG = json.load(f)


[34mdef[39;49;00m [32mdownload_from_s3[39;49;00m(data_dir=[33m"[39;49;00m[33m/tmp/data[39;49;00m[33m"[39;49;00m, train=[34mTrue[39;49;00m):
    [33m"""Download MNIST dataset and convert it to numpy array[39;49;00m
[33m[39;49;00m
[33m    Args:[39;49;00m
[33m        data_dir (str): directory to save the data[39;49;00m
[33m        train (bool): download training set[39;49;00