## Step 1: Create custom container using SageMaker PyTorch Deep Learning Framework

Update `role` with your SageMaker role arn.

In [1]:
!pip --version

pip 20.2.1 from /Users/yihyap/anaconda3/envs/smv2/lib/python3.8/site-packages/pip (python 3.8)


In [2]:
import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.pytorch import PyTorch
import warnings
warnings.filterwarnings('ignore')

ecr_namespace = 'sagemaker-training-containers/'
prefix = 'pytorch-training'
ecr_repository_name = ecr_namespace + prefix


ecr_repository_name = ecr_namespace + prefix
role = "arn:aws:iam::342474125894:role/service-role/AmazonSageMaker-ExecutionRole-20190405T234154"
account_id = role.split(':')[4]
region = boto3.Session().region_name
sagemaker_session = sagemaker.session.Session()
bucket = sagemaker_session.default_bucket()

print('Account: {}'.format(account_id))
print('Region: {}'.format(region))
print('Role: {}'.format(role))
print('S3 Bucket: {}'.format(bucket))
print('Repo: {}'.format(ecr_repository_name))

Account: 342474125894
Region: ap-southeast-1
Role: arn:aws:iam::342474125894:role/service-role/AmazonSageMaker-ExecutionRole-20190405T234154
S3 Bucket: sagemaker-ap-southeast-1-342474125894
Repo: sagemaker-training-containers/pytorch-training


### Build training container

Next we will create a script that will build and upload the custom container image into ECR. It has to be in the same region where the job is run.

In [2]:
# ./build_and_push.sh 342474125894 ap-southeast-1 sagemaker-training-containers/pytorch-training
! ../scripts/build_and_push.sh $account_id $region $ecr_repository_name

invalid argument "../docker" for "-t, --tag" flag: invalid reference format
See 'docker build --help'.
"docker tag" requires exactly 2 arguments.
See 'docker tag --help'.

Usage:  docker tag SOURCE_IMAGE[:TAG] TARGET_IMAGE[:TAG]

Create a tag TARGET_IMAGE that refers to SOURCE_IMAGE
usage: aws [options] <command> <subcommand> [<subcommand> ...] [parameters]
To see help text, you can run:

  aws help
  aws <command> help
  aws <command> <subcommand> help
aws: error: argument --region: expected one argument
Error: Cannot perform an interactive login from a non TTY device

Parameter validation failed:
Invalid length for parameter repositoryNames, value: 0, valid range: 1-inf
usage: aws [options] <command> <subcommand> [<subcommand> ...] [parameters]
To see help text, you can run:

  aws help
  aws <command> help
  aws <command> <subcommand> help
aws: error: argument --repository-name: expected one argument
invalid reference format


In [10]:
train_image_uri = '{0}.dkr.ecr.{1}.amazonaws.com/{2}:latest'.format(account_id, region, ecr_repository_name)
print('ECR training container ARN: {}'.format(train_image_uri))

ECR training container ARN: 342474125894.dkr.ecr.ap-southeast-1.amazonaws.com/sagemaker-training-containers/pytorch-training:latest


The docker image is now pushed to ECR. In the next section, we will show how to train an acoustic classification model using the custom container.

## Step 2: Training on SageMaker PyTorch custom container

In [14]:
import sagemaker
import json

hyperparameters = {
    "seed": "1",
    "epochs": 50,
}

est = sagemaker.estimator.Estimator(train_image_uri,
                                    role,
                                    instance_count=1, 
                                    #instance_type='local', # we use local mode
                                    instance_type='ml.m5.xlarge',
                                    base_job_name=prefix,
                                    hyperparameters=hyperparameters)


est.fit()

#train_config = sagemaker.inputs.TrainingInput('s3://{0}/{1}/train/'.format(bucket, prefix), content_type='text/csv')
#val_config = sagemaker.inputs.TrainingInput('s3://{0}/{1}/val/'.format(bucket, prefix), content_type='text/csv')
#est.fit({'train': train_config, 'validation': val_config })

2020-08-11 12:33:43 Starting - Starting the training job...
2020-08-11 12:33:45 Starting - Launching requested ML instances......
2020-08-11 12:35:13 Starting - Preparing the instances for training......
2020-08-11 12:35:58 Downloading - Downloading input data
2020-08-11 12:35:58 Training - Downloading the training image......
2020-08-11 12:37:08 Uploading - Uploading generated training model
2020-08-11 12:37:08 Completed - Training job completed
[34m2020-08-11 12:36:57,534 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2020-08-11 12:37:03,777 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2020-08-11 12:37:03,790 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2020-08-11 12:37:03,801 sagemaker-containers INFO     Invoking user script
[0m
[34mTraining Env:
[0m
[34m{
    "additional_framework_parameters": {},
    "channel_input_dirs": {},
    "current_host": "algo-1",


### Retrieve model location

In [16]:
model_location = est.model_data
print(model_location)

s3://sagemaker-ap-southeast-1-342474125894/pytorch-training-2020-08-11-12-33-56-086/output/model.tar.gz


## Step 3: Inference

For inference, we will use default inference image. Mandatory `model_fn` is implemented in `inference.py`. PyTorchModel is used to deploy custom model that we trained previously.

### Deploy model

In [3]:
from sagemaker.pytorch import PyTorchModel

pytorch_model = PyTorchModel(model_data="s3://sagemaker-ap-southeast-1-342474125894/pytorch-training-2020-08-11-15-05-07-606/output/model.tar.gz", 
                             role=role, 
                             entry_point='inference.py',
                             source_dir='../docker/code',
                             py_version='py3',
                             framework_version='1.5',
                            )
predictor = pytorch_model.deploy(initial_instance_count=1, instance_type='ml.m5.xlarge', wait=True)


---------------!

In [None]:
pytorch_model.endpoint_name

### Get Predictor

In [38]:
from sagemaker.pytorch.model import PyTorchPredictor

endpoint_name = "pytorch-inference-2020-08-12-08-52-57-488"

# The PyTorch model uses a npy serializer and deserializer by default
predictor = PyTorchPredictor(endpoint_name)


In [41]:
import torch
import numpy as np

payload = torch.tensor(np.array([[1,2,3,4,5],[2,3,4,5,6]]), dtype=torch.float)
response = predictor.predict(payload)
print(response)

[[-9.78976059e+00 -8.42716694e+00 -2.74858845e-04]
 [-1.21343966e+01 -1.20159941e+01 -1.14440263e-05]]


## Step 4: Optional Cleanup

When you're done with the endpoint, you should clean it up.

All of the training jobs, models and endpoints we created can be viewed through the SageMaker console of your AWS account.

In [18]:
predictor.delete_endpoint()

### PyTorch model test

In [34]:
import torch 
from torch import nn
import torch.nn.functional as F

class Net(nn.Module):
    def __init__(self, input_features):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(input_features, 3)

    def forward(self, x):
        x = self.fc1(x)
        #x = x.reshape(-1,3)
        x = F.log_softmax(x, dim=1)
        return x

In [35]:
model = Net(5)

In [36]:
payload = torch.tensor(np.array([[1,2,3,4,5],[2,3,4,5,6]]), dtype=torch.float)
model(payload)

tensor([[-6.4697e+00, -6.5448e+00, -2.9917e-03],
        [-7.9158e+00, -8.4261e+00, -5.8419e-04]], grad_fn=<LogSoftmaxBackward>)

In [87]:
from sklearn.datasets import make_classification


X, Y = make_classification(
    n_samples=100,
    n_features=5,
    n_redundant=0,
    n_informative=2,
    n_clusters_per_class=1,
    n_classes=3,
)

features = torch.FloatTensor(X[0])
labels = torch.LongTensor(Y[0])

In [95]:
X[0]

array([-0.60699847, -0.25228405, -0.76545418,  1.23142814,  0.68585389])

In [96]:
torch.FloatTensor(X[0])

tensor([-0.6070, -0.2523, -0.7655,  1.2314,  0.6859])

In [97]:
Y[0]

1

In [99]:
torch.LongTensor(Y[0])

tensor([3977854284320629293])

In [100]:
torch.LongTensor(Y[2])

tensor([4503928797958963200,                   8])

In [98]:
torch.tensor(Y[0], dtype=torch.long)

tensor(1)

tensor([6, 0, 0])