## Move your code from local Jupyter to Amazon SageMaker Studio - Part 2

This notebook is a combination of the [Official PyTorch QuickStart Tutorial Guide](https://pytorch.org/tutorials/beginner/basics/quickstart_tutorial.html) and code snippets required to run it in Amazon SageMaker.

**NOTE**: if you have already run the part -1 notebook, you can directly skip to [Changes required to run it in Amazon SageMaker](#sagemaker-changes)

### Outline

1. Notebook taken as-is from the [Official PyTorch QuickStart Tutorial Guide](https://pytorch.org/tutorials/beginner/basics/quickstart_tutorial.html)
2. [Changes required to run it in Amazon SageMaker](#sagemaker-changes)

In [None]:
%matplotlib inline

<a id='outline-1'></a>
# 1. Notebook from the official pytorch QuickStart tutorials


[Learn the Basics](intro.html) ||
**Quickstart** ||
[Tensors](tensorqs_tutorial.html) ||
[Datasets & DataLoaders](data_tutorial.html) ||
[Transforms](transforms_tutorial.html) ||
[Build Model](buildmodel_tutorial.html) ||
[Autograd](autogradqs_tutorial.html) ||
[Optimization](optimization_tutorial.html) ||
[Save & Load Model](saveloadrun_tutorial.html)

# Quickstart
This section runs through the API for common tasks in machine learning. Refer to the links in each section to dive deeper.

## Working with data
PyTorch has two [primitives to work with data](https://pytorch.org/docs/stable/data.html):
``torch.utils.data.DataLoader`` and ``torch.utils.data.Dataset``.
``Dataset`` stores the samples and their corresponding labels, and ``DataLoader`` wraps an iterable around
the ``Dataset``.


In [None]:
import torch
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor

PyTorch offers domain-specific libraries such as [TorchText](https://pytorch.org/text/stable/index.html),
[TorchVision](https://pytorch.org/vision/stable/index.html), and [TorchAudio](https://pytorch.org/audio/stable/index.html),
all of which include datasets. For this tutorial, we  will be using a TorchVision dataset.

The ``torchvision.datasets`` module contains ``Dataset`` objects for many real-world vision data like
CIFAR, COCO ([full list here](https://pytorch.org/vision/stable/datasets.html)). In this tutorial, we
use the FashionMNIST dataset. Every TorchVision ``Dataset`` includes two arguments: ``transform`` and
``target_transform`` to modify the samples and labels respectively.



In [None]:
# Download training data from open datasets.
training_data = datasets.FashionMNIST(
    root="data",
    train=True,
    download=True,
    transform=ToTensor(),
)

# Download test data from open datasets.
test_data = datasets.FashionMNIST(
    root="data",
    train=False,
    download=True,
    transform=ToTensor(),
)

We pass the ``Dataset`` as an argument to ``DataLoader``. This wraps an iterable over our dataset, and supports
automatic batching, sampling, shuffling and multiprocess data loading. Here we define a batch size of 64, i.e. each element
in the dataloader iterable will return a batch of 64 features and labels.



In [None]:
batch_size = 64

# Create data loaders.
train_dataloader = DataLoader(training_data, batch_size=batch_size)
test_dataloader = DataLoader(test_data, batch_size=batch_size)

for X, y in test_dataloader:
    print(f"Shape of X [N, C, H, W]: {X.shape}")
    print(f"Shape of y: {y.shape} {y.dtype}")
    break

Read more about [loading data in PyTorch](data_tutorial.html).




--------------




## Creating Models
To define a neural network in PyTorch, we create a class that inherits
from [nn.Module](https://pytorch.org/docs/stable/generated/torch.nn.Module.html). We define the layers of the network
in the ``__init__`` function and specify how data will pass through the network in the ``forward`` function. To accelerate
operations in the neural network, we move it to the GPU if available.



In [None]:
# Get cpu or gpu device for training.
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using {device} device")

# Define model
class NeuralNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28*28, 512),
            nn.ReLU(),
            nn.Linear(512, 512),
            nn.ReLU(),
            nn.Linear(512, 10)
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits

model = NeuralNetwork().to(device)
print(model)

Read more about [building neural networks in PyTorch](buildmodel_tutorial.html).




--------------




## Optimizing the Model Parameters
To train a model, we need a [loss function](https://pytorch.org/docs/stable/nn.html#loss-functions)
and an [optimizer](https://pytorch.org/docs/stable/optim.html).



In [None]:
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)

In a single training loop, the model makes predictions on the training dataset (fed to it in batches), and
backpropagates the prediction error to adjust the model's parameters.



In [None]:
def train(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset)
    model.train()
    for batch, (X, y) in enumerate(dataloader):
        X, y = X.to(device), y.to(device)

        # Compute prediction error
        pred = model(X)
        loss = loss_fn(pred, y)

        # Backpropagation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if batch % 100 == 0:
            loss, current = loss.item(), batch * len(X)
            print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")

We also check the model's performance against the test dataset to ensure it is learning.



In [None]:
def test(dataloader, model, loss_fn):
    size = len(dataloader.dataset)
    num_batches = len(dataloader)
    model.eval()
    test_loss, correct = 0, 0
    with torch.no_grad():
        for X, y in dataloader:
            X, y = X.to(device), y.to(device)
            pred = model(X)
            test_loss += loss_fn(pred, y).item()
            correct += (pred.argmax(1) == y).type(torch.float).sum().item()
    test_loss /= num_batches
    correct /= size
    print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")

The training process is conducted over several iterations (*epochs*). During each epoch, the model learns
parameters to make better predictions. We print the model's accuracy and loss at each epoch; we'd like to see the
accuracy increase and the loss decrease with every epoch.



In [None]:
epochs = 5
for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    train(train_dataloader, model, loss_fn, optimizer)
    test(test_dataloader, model, loss_fn)
print("Done!")

Read more about [Training your model](optimization_tutorial.html).




--------------




## Saving Models
A common way to save a model is to serialize the internal state dictionary (containing the model parameters).



In [None]:
torch.save(model.state_dict(), "model.pth")
print("Saved PyTorch Model State to model.pth")

## Loading Models

The process for loading a model includes re-creating the model structure and loading
the state dictionary into it.



In [None]:
model = NeuralNetwork()
model.load_state_dict(torch.load("model.pth"))

This model can now be used to make predictions.



In [None]:
classes = [
    "T-shirt/top",
    "Trouser",
    "Pullover",
    "Dress",
    "Coat",
    "Sandal",
    "Shirt",
    "Sneaker",
    "Bag",
    "Ankle boot",
]

model.eval()
x, y = test_data[0][0], test_data[0][1]
with torch.no_grad():
    pred = model(x)
    predicted, actual = classes[pred[0].argmax(0)], classes[y]
    print(f'Predicted: "{predicted}", Actual: "{actual}"')

## Official PyTorch QuickStart Tutorial Ends here

# 2. <a id = 'sagemaker-changes'>Changes required to run it in Amazon SageMaker with BYOC</a>

In this section, we add Amazon SageMaker related code.

First we import the sagemaker python SDK and the SageMaker managed PyTorch Framework Estimator. We also specify the IAM execution role to be used for the Training

In [None]:
#Make sure you have at least the below version of the sagemaker sdk
#!pip install "sagemaker>=2.127.0"

In [None]:
import sagemaker # importing sagemaker python SDK
from sagemaker.pytorch.estimator import PyTorch # import PyTorch Estimator class 
from sagemaker import get_execution_role # import fn to fetch execution role

#Store the execution role. 
#Here, the same role used which was used to create a sagemaker studio user profile
execution_role = get_execution_role()

In this section, we will supply the data from an S3 location to the Estimator and Training script. First thing is to create the sagemaker session object. We use the default sagemaker created bucket. Next, create specific paths for storing training and testing data separately.

In [None]:
session = sagemaker.Session()
bucket = session.default_bucket()

train_prefix = "/pytorch/fashion-mnist/train"
test_prefix = "/pytorch/fashion-mnist/test"

Upload the Training and test data to S3

In [None]:
from sagemaker.s3 import S3Uploader

#Upload training data
S3Uploader.upload(local_path = "data/FashionMNIST/raw/train-images-idx3-ubyte.gz", 
                  desired_s3_uri = "s3://"+bucket+train_prefix, 
                  kms_key=None, 
                  sagemaker_session=session)

S3Uploader.upload(local_path = "data/FashionMNIST/raw/train-labels-idx1-ubyte.gz", 
                  desired_s3_uri = "s3://"+bucket+train_prefix, 
                  kms_key=None, 
                  sagemaker_session=session)

#Upload test data
S3Uploader.upload(local_path = "data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz", 
                  desired_s3_uri = "s3://"+bucket+test_prefix, 
                  kms_key=None, 
                  sagemaker_session=session)

S3Uploader.upload(local_path = "data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz", 
                  desired_s3_uri = "s3://"+bucket+test_prefix, 
                  kms_key=None, 
                  sagemaker_session=session)

Create SageMaker training input channels to point to train and test data location from S3.

In [None]:
from sagemaker.inputs import TrainingInput

train_input = TrainingInput(s3_data="s3://"+bucket+train_prefix)
test_input = TrainingInput(s3_data="s3://"+bucket+test_prefix)

Build the custom container URI. In this case, the image is sitting in Amazon ECR, but it can also be your own private repository. You just have to set up your credentials right.

1. If you are using ECR, make sure the execution role used has permissions to call `get_caller_identity` API and also to pull from ECR repository.
2. If you are using your own private repository, make sure the execution role used has permissions to call `get_caller_identity` API and also to pull from your own private repository.

In [None]:
import boto3

training_image_name = "custom-pytorch-1-12"
training_image_version = "latest"

custom_image_uri = "{}.dkr.ecr.{}.amazonaws.com/{}:{}".format(
    boto3.client("sts").get_caller_identity().get("Account"),
    boto3.session.Session().region_name,
    training_image_name,
    training_image_version,
)

print(custom_image_uri)

Create the SageMaker PyTorch Estimator, point it to channel created earlier and trigger the training job. Note that training is happening on a SageMaker managed training cluster and not on the notebook itself. Note the new argument `hyperparameters` supplied. This is how you can supply any arguments/hyperparameters to the training script.

In [None]:
#Create the estimator object for PyTorch

from sagemaker.pytorch.estimator import PyTorch # import PyTorch Estimator class 

estimator = PyTorch(
    image_uri=custom_image_uri, #our custom pytorch image URI
    entry_point = "train.py", # training script
    instance_count = 1, #number of EC2 instances needed for training
    instance_type = "ml.c5.xlarge", #Type of EC2 instance/s needed for training
    disable_profiler = True, #Disable profiler, as it's not needed
    role = execution_role, #Execution role used by training job
    hyperparameters={'batch_size': 64}
)



inputs = {"train":train_input, "test": test_input}

#Start the training in the ephemeral remote compute 
estimator.fit(inputs, wait=True)