# Convolutional Neural Network Training using PyTorch in AWS SageMaker

# Overview

Amazon Web Services (AWS) offers a wide range of tools and functionalities for enterprise and individual developers. Among which, SageMaker is a fully managed machine learning service that allows data scientists and developers to build and train machine learning models, and S3, is a data storage service providing the capability to store the dataset that powers our model training process. In this assigment, we are going to use PyTorch in AWS SageMaker to implement and train a Convolutional Neural Network (CNN) on the MNIST dataset.

MNIST is a widely used dataset for handwritten digit classification. It consists of 70,000 labeled 28x28 pixel grayscale images of hand-written digits. The dataset is split into 60,000 training images and 10,000 test images. There are 10 classes (one for each of the 10 digits). 

## Setup

1. Let's start by creating a SageMaker session as shown in the screenshot below. If any of these steps are confusing, we reccomend watching this video we created: https://youtu.be/s1M8P9X6j_8 or this tutorial on Amazon SageMaker: https://www.youtube.com/watch?v=pfjhNe1M2t4 

> ![SageMaker Studio Session](https://raw.githubusercontent.com/zhoujc999/CS189-Final-Project-T/master/Assets/Screenshot_1.png)

---

2. Configure the IAM role to access the S3 bucket. This is necessary to store and retrieve the dataset use for training and evaluation. The S3 bucket should be within the same region as the Notebook Instance, training, and hosting.

> ![IAM role](https://raw.githubusercontent.com/zhoujc999/CS189-Final-Project-T/master/Assets/Screenshot_2.png)

---

3. Finally, create the Jupyter Notebook instance.

> ![SageMaker Studio Session](https://raw.githubusercontent.com/zhoujc999/CS189-Final-Project-T/master/Assets/Screenshot_3.png)

## Data

### Getting the data



In [None]:
import sagemaker
from torchvision import datasets, transforms


sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()
role = sagemaker.get_execution_role()

!pip install --upgrade jupyter_client

datasets.MNIST('data', download=True, transform=transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
]))

### Uploading the data to S3
We are going to use the `sagemaker.Session.upload_data` function to upload our datasets to an S3 location. The return value inputs identifies the location -- we will use later when we start the training job.


In [None]:
bucket = sagemaker_session.default_bucket()
prefix = 'sagemaker/pytorch-mnist'
inputs = sagemaker_session.upload_data(path='data', bucket=bucket, key_prefix=prefix)
print('S3 path: {}'.format(inputs))


## Convolutional Neural Network

You will be writing some code to complete the implementation of the CNN model to recognise handwritten digits. In the provided `mnist.py` script, there are a few parts that needs to be filled before we can start training the model.


### Part (a) & (b)

First, we need to define the CNN model. Since the input data are $28 \times 28$ pixel grayscale images, the input to the CNN has a single channel. We suggest the following architecture:

1. A convolutional layer with 10 output channels and kernel size of 5
2. A max pool layer with kernel of size 2 and stride of 2
3. A ReLU activation layer
4. A convolutional layer with 20 output channels and kernel size of 5
5. A dropout layer with probability 0.5
6. A max pool layer with stride equals 2
7. A ReLU activation layer
8. 2 fully connected layers with a dropout layer in between
9. A softmax output layer

### Part (c)

Next, we focus on the `train()` function. After setting up the function, we define an optimizer to start training the model. In part (c), we use batch Stochastic Gradient Descent to optimize our CNN model. Here are the steps we need to accomplish:

1. For each batch, reset the optimizer gradients to 0.
2. Feed the data into the model and generate the output.
3. Compute the loss of the model
4. Back propagate the weights of the model

### Part (d)

Finally, we can implement the `test()` function in part (d). To complete the function:

1. Feed the data into the model and generate the output.
2. Compute the negative log likelihood loss.
3. Increment the correct counter if the prediction was correct.


## Run training in SageMaker

Now, we are ready to train our model on SageMaker. The `PyTorch` class allows us to run our training function as a training job on SageMaker infrastructure. We need to configure it with our training script, an IAM role, the number of training instances, the training instance type, and hyperparameters. This example can be ran on one or multiple, cpu or gpu instances. The hyperparameters parameter is a dict of values that will be passed to your training script -- you can see how to access these values in the `mnist.py` script above.


In [None]:
from sagemaker.pytorch import PyTorch

estimator = PyTorch(entry_point='mnist.py',
                    role=role,
                    framework_version='1.4.0',
                    train_instance_count=2,
                    train_instance_type='ml.c4.xlarge',
                    hyperparameters={
                        'epochs': 6,
                        'backend': 'gloo'
                    })

After we've constructed our `PyTorch` object, we can fit it using the data we uploaded to S3. SageMaker makes sure our data is available in the local filesystem, so our training script can simply read the data from disk.


In [None]:
estimator.fit({'training': inputs})

## Host
### Create endpoint
After training, we use the `PyTorch` estimator object to build and deploy a `PyTorchPredictor`. This creates a Sagemaker Endpoint -- a hosted prediction service that we can use to perform inference.

The arguments to the deploy function allow us to set the number and type of instances that will be used for the Endpoint. These do not need to be the same as the values we used for the training job. For example, you can train a model on a set of GPU-based instances, and then deploy the Endpoint to a fleet of CPU-based instances, but you need to make sure that you return or save your model as a cpu model similar to what we did in `mnist.py`.

In [None]:
predictor = estimator.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')