# Local PyTorch models using SageMaker


This example uses PyTorch quick start tutotial (https://pytorch.org/tutorials/beginner/basics/quickstart_tutorial.html) and trains a model using Amazon SageMaker Python SDK (https://sagemaker.readthedocs.io/en/stable/index.html) as in Amazon SageMaker Examples (https://github.com/aws/amazon-sagemaker-examples/tree/main/frameworks/pytorch) using Fashion MNIST dataset. The model can be trained in two ways: by running the prebuilt AWS SageMaker Pytorch container in AWS or by running it locally. For inference, the model is tested locally.

The development and test environment used was:
- Ubuntu with Docker Compose v2
- VSCode with conda and tensorboard 

#### References
- https://github.com/aws/amazon-sagemaker-examples/blob/main/frameworks/pytorch/get_started_mnist_train.ipynb
- https://github.com/aws/amazon-sagemaker-examples/tree/main/frameworks/pytorch 

In [1]:
# This example runs from notebook and does not have a UI. 
# Data is downloaded to S3 and fetched by the container during training.
# After training, the container saves the model to S3. After training, the model is manually downloaded to test inference

# Tensorboard is used to monitor training progress. It can be viewed in the browser or in VSCode using the Tensorboard extension.
 

In [2]:
## Directories and Files are:
# ./code_local/: local directory containing python scripts
# ./code_local/dataset.py: dataset definition using PyTorch Fashion MNIST dataset
# ./code_local/model_def.py: model definition using PyTorch nn.Module
# ./code_local/train.py: training script used by SM container  

# ./data/: Fashion MNIST dataset is downloaded here
# ./out_model/: trained model is saved here after manual download
# ./runs/: tensorboard logs are saved here

# ./env_sm_pytorch.yml: conda environment file

## Set Environment as Local or SageMaker

In [3]:
# 
## Set training mode
# Set local_training to True to run the SageMaker container for training on the machine that runs this notebook
# Set local_traiing to False to run the SageMaker container for training script in AWS
local_training = False

## Set inference mode
# Set local_inference to True to run the SageMaker container for inference endpoint on the machine that runs this notebook. 
# Set local_inference to False to run the inference endpoint in AWS
# local_inference = True

# For local training or local inference, docker is needed to run SageMaker containers

## SetUp SageMaker Access

In [4]:
import os
import json
import uuid

import boto3
import sagemaker
from sagemaker.pytorch import PyTorch
from sagemaker import get_execution_role
import os
from dotenv import load_dotenv


# uses ~/.aws/credentials and ~/.aws/config
identity = boto3.client('sts').get_caller_identity()
user_name = identity['Arn'].split(':')[5]
sm_session = sagemaker.Session()
sm_region = sm_session.boto_region_name
bucket = sm_session.default_bucket()

# SageMaker (SM) role is in format like "arn:aws:iam::111222333444:role/service-role/AmazonSageMaker-ExecutionRole-999900001111222"
# Get role using role = get_execution_role() or copy it from the console.
# Since this notebook runs on a local laptop, the role is set in a file with custom environment variables
dir_home = os.environ['HOME']
env_file = dir_home + '/.aws/env_custom'
load_dotenv(env_file)
role = os.getenv('role')

prefix = "DEMO-fashion-mnist-pytorch"
output_path = "s3://" + sm_session.default_bucket() + "/" + prefix

checkpoint_suffix = str(uuid.uuid4())[:8]
checkpoint_s3_path = 's3://{}/checkpoint-{}'.format(bucket, checkpoint_suffix)

# uncomment to print the values
# print(f'User name: {user_name}')
print(f'SageMaker version: {sagemaker.__version__}')
print(f'SageMaker session:{sm_session}')
print(f'SageMaker region: {sm_region}')
# print(f'SageMaker S3 bucket: {bucket}')
# print(f'SageMaker S3 bucket output path: {output_path}')
# print('Checkpointing Path: {}'.format(checkpoint_s3_path))


SageMaker version: 2.75.1
SageMaker session:<sagemaker.session.Session object at 0x7f9982436ef0>
SageMaker region: us-east-1


In [5]:
# define channels
loc = sm_session.upload_data(path="./data", bucket=bucket, key_prefix=prefix)
channels = {"training": loc, "testing": loc}

In [6]:
# set device  
import torch
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(f'Device is: {device}')

Device is: cpu


## Train

In [7]:
# define the estimator
if local_training:
    instance_type = "local"
else:
    instance_type = "ml.c4.xlarge"

estimator = PyTorch(
    entry_point="train.py",
    source_dir="code_local",   
    role=role,
    framework_version="1.5.0",
    py_version="py3",
    instance_type=instance_type,
    instance_count=1,
    volume_size=250,
    output_path=output_path,
    hyperparameters={"batch-size": 128, "epochs": 5, "learning-rate": 1e-3, "log-interval": 100},
)

In [8]:
%%capture
# uncomment capture before running to see the training log
# train. This downloads SM container in the instance_type and runs the training script provided in estimator's entry_point
estimator.fit(inputs=channels)
#check S3 bucket for the model.tar.gz file in output_path

In [9]:
%%capture
# model artifact is saved in S3
# uncomment capture and run the cell to see the S3 folder in which the model is saved
pt_fmnist_model_data = estimator.model_data
print("Model artifact saved at:\n", pt_fmnist_model_data)

## Test

In [10]:
%%capture
# get data for inference. In this case, it is the same as test data
from torchvision import datasets
from torchvision.transforms import ToTensor
infer_data = datasets.FashionMNIST(
            root="data",
            train=False,
            download=True,
            transform=ToTensor(),)

In [11]:
# Classes for Fashion MNIST dataset
classes={
    0: 'T-shirt',
    1: 'Trouser',
    2: 'Pullover',
    3: 'Dress',
    4: 'Coat',
    5: 'Sandal',
    6: 'Shirt',
    7: 'Sneaker',
    8: 'Bag',
    9: 'Ankle Boot',
}

In [12]:
# Manually download the model (model.tar.gz) from S3 into local folder out_model and extract model.pth file from the tar.gz file
local_model_folder_file = './out_model/' + "model.pth"


In [16]:
# make a single prediction
import numpy as np
from code_local import model_def

model_inf = model_def.NeuralNetwork()
model_inf.load_state_dict(torch.load(local_model_folder_file))
model_inf.to(device).eval()

x = infer_data[0][0]
y = infer_data[0][1]
with torch.no_grad():
    pred = model_inf(x)
    pred_index = np.argmax(pred[0])
    value = pred_index.item()
    predicted = classes[value]
    actual = classes[y]
    print(f'Predicted Class: "{predicted}", Actual Class: "{actual}"')

Predicted Class: "Ankle Boot", Actual Class: "Ankle Boot"


In [17]:
# make predictions on randomly selected examples
import random

length = len(infer_data)
num_samples = 10
random_rows = random.sample(range(length), num_samples)

# get the class names for these random rows
for i in random_rows:
    x = infer_data[i][0]
    y = infer_data[i][1]
    with torch.no_grad():
        pred = model_inf(x)
        pred_index = np.argmax(pred[0])
        value = pred_index.item()
        predicted = classes[value]
        actual = classes[y]
        print(f'Predicted Class: "{predicted}", Actual Class: "{actual}"')


Predicted Class: "Dress", Actual Class: "Dress"
Predicted Class: "Ankle Boot", Actual Class: "Ankle Boot"
Predicted Class: "Trouser", Actual Class: "Trouser"
Predicted Class: "Coat", Actual Class: "Pullover"
Predicted Class: "Bag", Actual Class: "Bag"
Predicted Class: "Bag", Actual Class: "Sandal"
Predicted Class: "Bag", Actual Class: "Bag"
Predicted Class: "Trouser", Actual Class: "Trouser"
Predicted Class: "Pullover", Actual Class: "T-shirt"
Predicted Class: "Trouser", Actual Class: "Trouser"


In [15]:
# end of notebook