# Model Training with Sagemaker Training

## Overview

This notebook is derived from the original, larger notebook, [Sagemaker bring your own model with script mode](
https://github.com/aws/amazon-sagemaker-examples/tree/main/sagemaker-script-mode)

In the previous notebook [pytorch-nn-multiclass-classifier](https://github.com/sjaffry/multiclass-classifier-pytorch-nn/blob/main/pytorch-nn-multiclass-classifier.ipynb), I built a simple neural net with PyTorch and trained it within the notebook that was running on a GPU instance. What I wanted to do next was to use Sagemaker managed infrasturcture for training my model so I can scale the process better and lay the foundations for consistent production deployment processes. This notebook will demonstrate how you can do that by bring your model into Sagemaker using custom training and inference scripts with SageMaker's prebuilt container for PyTorch.

This notebook will show my step by step journey of how I took something that was running completely on a notebook [pytorch-nn-multiclass-classifier](https://github.com/sjaffry/multiclass-classifier-pytorch-nn/blob/main/pytorch-nn-multiclass-classifier.ipynb) and refactored it to split training into Sagemaker's managed training infrastructure.

This example does not cover model deployment & inference. I will cover that in a subsequent example notebook.

## Pre-requisites  

1. An S3 bucket for storing your model training code & data

## Step 1: Understanding the architecture of Sagemaker training infrastructure

Before I started refactoring my code using the example notebooks in Sagemaker examoles, I wan to understand how the "magic" works behind the Sagemaker PyTorch SDK. So this is what I discovered:  

![Sagemaker PyTorch SDK](images/img_01.png)

## Step 2: Review required and redundant libraries & imports

Do I need to download and install into this notebook any of the libraries that I did originally?  
  
  Contents of my original "requirements.txt"

-f https://download.pytorch.org/whl/torch_stable.html   
celluloid==0.2.0   
d2l==v0.16.0  
IPython==7.16.1  
numpy==1.19.5  
matplotlib==3.3.4  
torch==1.8.1+cu101  
torchvision==0.9.1+cu101  
scikit-learn==0.24.1  
seaborn==0.11.1

If I inspect the [Docker file](https://github.com/aws/sagemaker-pytorch-training-toolkit/blob/master/docker/1.5.0/py3/Dockerfile.gpu) for the Sagemaker Pytorch container, I notice that it installs many of the libraries above such as:    
iPython, Pytorch with cuda, torchvision, scikit-learn, numpy 
  
  So for training in Sagemaker, I only need to then inject the missing libraries into the Sagemaker Pytorch container that my training code will depend on. The handy thing is that Sagemaker SDK can take a "requirements.txt" as input into the container startup and install specified libraries. All I need to do is to have the **"requirements.txt" file in the "pytorch_script" folder.**  The only additional library I need for trianing is d2l, which I'll include in my new "requirements.txt" file for training.
    
For this notebook, I will trim the list down to only those libraries that are required for data prep. So below is my new "requirements.txt" file contents:  
  
    
Contents of my new "requirements.txt" for this notebook instance  
IPython==7.16.1  
numpy==1.19.5   
scikit-learn==0.24.1  

In [None]:
!pip install -r ../requirements-notebook.txt

Next I want to validate which of the original imports are still required into this notebook vs which ones are now redundant because of externalizing the training to Sagemaker

In [None]:
# From Original imports

### [Delete from original] from sklearn.datasets import make_circles
from sklearn.model_selection import train_test_split

# Import plotting libraries [Don't need since not doing training in Notebook]
## import seaborn as sns 
## import matplotlib.pyplot as plt

# Import Pytorch [Don't need since not doing training in Notebook]
## import torch
## from torch import nn
## from d2l import torch as d2l
## import torchvision
## from torch.utils import data
## from torchvision import transforms
## import torch.nn.functional as F

# Import helper libraries
import random
import numpy as np
import pandas as pd
import boto3
import pickle
import time

In [None]:
# Add new imports
import sagemaker
import os
import subprocess
import sys
from sagemaker.pytorch import PyTorch

Update Sagemaker to latest version

In [None]:
pip install -U sagemaker

## Step 3: Prepare Training & Validation data for Sagemaker Training

The following 3 cells don't change from the original. I still need to process the PKL training file, split it into Test and Validation sets. However afterwards I'll need to upload the data to S3

In [None]:
# import the datasets
bucketname = '[YOUR S3 BUCKET]' # replace with your S3 bucket name

In [None]:
# local data paths for temp storage of train and validation files
train_dir = os.path.join(os.getcwd(), "data/train")
test_dir = os.path.join(os.getcwd(), "data/test")
os.makedirs(train_dir, exist_ok=True)
os.makedirs(test_dir, exist_ok=True)

# S3 paths
s3_prefix = "pytorch-multiclass"
numpy_train_s3_prefix = f"{s3_prefix}/data/train"
numpy_train_s3_uri = f"s3://{bucketname}/{numpy_train_s3_prefix}"
numpy_test_s3_prefix = f"{s3_prefix}/data/test"
numpy_test_s3_uri = f"s3://{bucketname}/{numpy_test_s3_prefix}"

In [None]:
# Assuming you have the original Training and Test data files in S3 (see my repo's main page for download links).
s3 = boto3.resource('s3')
s3.Bucket(bucketname).download_file('pytorch-multiclass/data/train/DL1_Train.pkl', '../data/DL1_Train.pkl')

final_train = pickle.load( open( "../data/DL1_Train.pkl", "rb" ), encoding='latin1')

td = {'DRUM & BASS':0, 'R&B':1, 'BLUES':2, 'VOCAL JAZZ':3, 'NATURE SOUNDS':4, 'BAROQUE':5, 'DISNEY':6, 'HARD ROCK':7}

X = np.array([final_train[key]['PACH'] for key in final_train.keys() if len(final_train[key]['text_genre']) == 1])
y = np.array([td[final_train[key]['text_genre'][0]] for key in final_train.keys() if len(final_train[key]['text_genre']) == 1])

X_train, X_val, y_train, y_val = train_test_split(X, y, random_state = 8675309)

However now I need to do an additional step of serializing the Train and Val datasets and storing them to S3. I need to do that so that the Sagemaker training container, which runs outside of this notebook, can access the training and validation data

In [None]:
# Save as Numpy
np.save(os.path.join(train_dir, "x_train.npy"), X_train)
np.save(os.path.join(test_dir, "x_val.npy"), X_val)
np.save(os.path.join(train_dir, "y_train.npy"), y_train)
np.save(os.path.join(test_dir, "y_val.npy"), y_val)

In [None]:
# Upload the Training and Validation data to S3 ready for use by Sagemaker training

s3_resource_bucket = boto3.Session().resource("s3").Bucket(bucketname)

s3_resource_bucket.Object(os.path.join(numpy_train_s3_prefix, "x_train.npy")).upload_file(
    "data/train/x_train.npy"
)
s3_resource_bucket.Object(os.path.join(numpy_train_s3_prefix, "y_train.npy")).upload_file(
    "data/train/y_train.npy"
)
s3_resource_bucket.Object(os.path.join(numpy_test_s3_prefix, "x_val.npy")).upload_file(
    "data/test/x_val.npy"
)
s3_resource_bucket.Object(os.path.join(numpy_test_s3_prefix, "y_val.npy")).upload_file(
    "data/test/y_val.npy"
)

## Step 4: Saparating training code and neural net definition from the notebook

Since Sagemaker training container will run outside of this notebook, the model definition and training code will need to be accessible via separate files:
- [pytorch_model_def.py](https://github.com/sjaffry/multiclass-classifier-pytorch-nn/blob/main/pytorch_script/pytorch_model_def.py)
- [train_deploy_pytorch_without_dependencies.py](https://github.com/sjaffry/multiclass-classifier-pytorch-nn/blob/main/pytorch_script/train_deploy_pytorch_without_dependencies.py)  
  
Sagemaker training expect the above files along with any requirements.txt file in a single tar.gz archive. Having this tar.gz file in S3 means that I can now create a standarized CI/CD pipeline (MLOps), that will create a new tar.gz file each time I push a commit to any of the 3 files into my repo  


Download the above two files and the requirements.txt file and create a tar.gz and upload that to S3

In [None]:
!wget -q https://raw.githubusercontent.com/sjaffry/multiclass-classifier-pytorch-nn/main/pytorch_script/pytorch_model_def.py
!wget -q https://raw.githubusercontent.com/sjaffry/multiclass-classifier-pytorch-nn/main/pytorch_script/train_deploy_pytorch_without_dependencies.py
!wget -q https://raw.githubusercontent.com/sjaffry/multiclass-classifier-pytorch-nn/main/pytorch_script/requirements.txt

In [None]:
!tar -czf pytorch_script.tar.gz pytorch_model_def.py train_deploy_pytorch_without_dependencies.py requirements.txt

In [None]:
s3_resource_bucket.Object(os.path.join(s3_prefix, "pytorch_script.tar.gz")).upload_file(
    "pytorch_script.tar.gz"
)

code_file_uri = f"s3://{bucketname}/{s3_prefix}/pytorch_script.tar.gz"

### Neural net
I will put the model definition in another separate file named: [**pytorch_model_def.py**](https://github.com/sjaffry/multiclass-classifier-pytorch-nn/blob/main/pytorch_script/pytorch_model_def.py) so I will not need the model definition in the notebook anymore  
*(showing commented out code to make the comparison with previous - non-Sagemaker notebook easier)*

In [None]:
# net = nn.Sequential(nn.Flatten(), 
#                     nn.Dropout(.2), 
#                     nn.Linear(4096, 1024), 
#                     nn.ReLU(),
#                     nn.BatchNorm1d(1024), 
#                     nn.Dropout(.5), 
#                     nn.Linear(1024, 8))

### Model accuracy function
I will merge this function into the separate file I will create for training named: [**train_deploy_pytorch_without_dependencies.py**](https://github.com/sjaffry/multiclass-classifier-pytorch-nn/blob/main/pytorch_script/train_deploy_pytorch_without_dependencies.py)  
*(showing commented out code to make the comparison with previous - non-Sagemaker notebook easier)*

In [None]:
# def evaluate_accuracy_gpu(net, data_iter, device=None): #@save
#    """Compute the accuracy for a model on a dataset using a GPU."""
#    if isinstance(net, torch.nn.Module):
#        net.eval()  # Set the model to evaluation mode
#        if not device:
#            device = next(iter(net.parameters())).device
#    # No. of correct predictions, no. of predictions
#    metric = d2l.Accumulator(2)
#    for X, y in data_iter:
#        if isinstance(X, list):
#            # Required for BERT Fine-tuning (to be covered later)
#            X = [x.to(device) for x in X]
#        else:
#            X = X.to(device)
#        y = y.to(device)
#        metric.add(d2l.accuracy(net(X), y), d2l.size(y))
#    return metric[0] / metric[1]

### Training
I will create a separate file named [**train_deploy_pytorch_without_dependencies.py**](https://github.com/sjaffry/multiclass-classifier-pytorch-nn/blob/main/pytorch_script/train_deploy_pytorch_without_dependencies.py), that will contain the training code below  
*(showing commented out code to make the comparison with previous - non-Sagemaker notebook easier)*

In [None]:
# def train_model(net, train_iter, test_iter, num_epochs = 20, device=d2l.try_gpu(), lrate=0.005):
#     """Train a model with a GPU (defined in Chapter 6)."""
#     def init_weights(m):
#         if type(m) == nn.Linear:
#             nn.init.kaiming_normal_(m.weight, mode='fan_out')
#     net.apply(init_weights)
#     print('training on', device)
#     net.to(device)
#     optimizer = torch.optim.Adam(net.parameters(), lr=lrate)
#     loss = nn.CrossEntropyLoss()
#     test_losses = []
#     animator = d2l.Animator(xlabel='epoch', xlim=[0, num_epochs],
#                             legend=['val acc'])
#     timer = d2l.Timer()
#     for epoch in range(num_epochs):
#         metric = d2l.Accumulator(2)
#         net.train()
#         for i, (X, y) in enumerate(final_train_loader):
#             timer.start()
#             optimizer.zero_grad()
#             X, y = X.to(device), y.to(device)
#             y_hat = net(X)
#             l = loss(y_hat, y)
#             l.backward()
#             optimizer.step()
#             metric.add(l.sum(),  X.shape[0])
#             timer.stop()
#             train_loss = metric[0]/metric[1]
#         test_acc = evaluate_accuracy_gpu(net, final_val_loader)
#         animator.add(epoch+1, (test_acc))
#     print('validation acc %.3f' % (test_acc))
#     print('%.1f examples/sec on %s' % (metric[1]*num_epochs/timer.sum(), device))

In [None]:
# torch.manual_seed(8675309)

In [None]:
# %%time
# train_model(net, final_train_loader, final_val_loader, num_epochs = 70)

## Setup parameters and start Sagemaker training job

### Your AWS env parameters

In [None]:
random.seed(42)

# Useful SageMaker variables
try:
    # You're using a SageMaker notebook
    sess = sagemaker.Session()
    bucket = bucketname
    role = sagemaker.get_execution_role()
except ValueError:
    # You're using a notebook somewhere else
    print("Setting role and SageMaker session manually...")
    bucket = bucketname
    region = "us-east-1"

    iam = boto3.client("iam")
    sagemaker_client = boto3.client("sagemaker")

    sagemaker_execution_role_name = (
        "AmazonSageMaker-ExecutionRole-20191005T132574"  # Change this to your role name
    )
    role = iam.get_role(RoleName=sagemaker_execution_role_name)["Role"]["Arn"]
    boto3.setup_default_session(region_name=region, profile_name="default")
    sess = sagemaker.Session(sagemaker_client=sagemaker_client, default_bucket=bucket)

# Endpoint names
pytorch_endpoint_name = "pytorch-endpoint"

### Start model training as a Sagemaker training job
I'm using the Sagemaker PyTorch SDK to create a training job within Sagemaker Training. The PyTorch SDK will do the following:
- create a new instance as per the parameter specification
- download your training and validation data from S3 (numpy_train_s3_uri, numpy_test_s3_uri)
- download your custom training code and model definition from S3 (pytorch_script.tar.gz)
- download and start a PyTorch container
- Inject your custom training code, mode definition and training and validation data to run training

In [None]:
hyperparameters = {"epochs": 50, "batch_size": 100, "learning_rate": 0.01}

train_instance_type = "ml.g4dn.xlarge"
inputs = {"train": numpy_train_s3_uri, "test": numpy_test_s3_uri}

estimator_parameters = {
    "entry_point": "train_deploy_pytorch_without_dependencies.py",
    "source_dir": code_file_uri,
    "instance_type": train_instance_type,
    "instance_count": 1,
    "hyperparameters": hyperparameters,
    "role": role,
    "base_job_name": "pytorch-model",
    "framework_version": "1.10",
    "py_version": "py38",
}

estimator = PyTorch(**estimator_parameters)
estimator.fit(inputs)