# Train and Deploy Custom Model in AWS

## Project: Train, Evaluate and Deploy Dog Identification App in SageMaker
---
### Why We're Here 
In this notebook, we will train and deploy a **custom model** in SageMaker. Specifically, the pretrained PyTorch model from  [Dog Breed Classifier](https://github.com/reedemus/dog_breed_classifier) project will be used as an example for this exercise. 
### The Road Ahead

We break the notebook into separate steps. Feel free to use the links below to navigate the notebook.

* [Step 0](#Step0): Install required packages
* [Step 1](#step1): Upload the dataset into an S3 bucket
* [Step 2](#step2): Create the custom model
* [Step 3](#step3): Completing a training script
* [Step 4](#step4): Training and deploying the custom model
* [Step 5](#step5): Evaluating the performance
---
<a id='step0'></a>
## Step 0: Install required packages
Install missing packages and dependencies in the instance.

In [1]:
!pip install -r code/requirements.txt

Collecting torch>=1.10.0
  Downloading torch-1.10.2-cp36-cp36m-manylinux1_x86_64.whl (881.9 MB)
     |████████████████████████████████| 881.9 MB 3.3 kB/s              |██████████████████████████████▎ | 836.0 MB 79.3 MB/s eta 0:00:01
[?25hCollecting torchvision>=0.11.0
  Downloading torchvision-0.11.2-cp36-cp36m-manylinux1_x86_64.whl (23.3 MB)
     |████████████████████████████████| 23.3 MB 30.3 MB/s            
Collecting torch>=1.10.0
  Downloading torch-1.10.1-cp36-cp36m-manylinux1_x86_64.whl (881.9 MB)
     |████████████████████████████████| 881.9 MB 3.8 kB/s             ��█████████████████████         | 634.0 MB 90.1 MB/s eta 0:00:03.2 MB 364 kB/s eta 0:08:05 
Installing collected packages: torch, torchvision
  Attempting uninstall: torch
    Found existing installation: torch 1.4.0
    Uninstalling torch-1.4.0:
      Successfully uninstalled torch-1.4.0
  Attempting uninstall: torchvision
    Found existing installation: torchvision 0.5.0
    Uninstalling torchvision-0.5.0:
     

---
<a id='step1'></a>
## Step 1: Upload the dataset to S3

We will import the AWS SageMaker libraries and define helper functions for handling the dataset. We will download the dog dataset from [this URL](https://s3-us-west-1.amazonaws.com/udacity-aind/dog-project/dogImages.zip) and extract it before uploading them into the bucket.

In [1]:
# import the required libraries
import requests
import boto3
import sagemaker
from zipfile import ZipFile
from tqdm.notebook import tqdm_notebook as tqdm

Define `downloadFile` and `extractFile` helper functions to download the dataset.

In [3]:
def downloadFile(file_url, file_name, dir=None, chunk_size=1024):
    '''Helper function to download file to specified directory

    :param file_url: file download URL
    :param file_name: file name to be saved.
    :param dir: path where file is saved other than current directory (Default = current working directory)
    :param chunk_size: size of file chunk to download (Default = 1024 bytes)
    :returns: None
    '''
    saved_file_path = file_name
    if dir != None and not os.path.exists(dir):
        os.mkdir(dir)
        saved_file_path = os.path.join(dir, file_name)

    r = requests.get(file_url, stream=True)
    total_size_in_bytes = len(r.content)
    progress_bar = tqdm(total=total_size_in_bytes, unit='iB', unit_scale=True, desc=file_name)
    
    with open(saved_file_path, 'wb') as f:
        for chunk in r.iter_content(chunk_size):
            progress_bar.update(len(chunk))
            # writing one chunk at a time to file
            if chunk:
                f.write(chunk)
    progress_bar.close()
    if total_size_in_bytes != 0 and progress_bar.n != total_size_in_bytes:
       print("ERROR, something went wrong")
       return

def extractFile(file_name):
    '''Extracts compressed file in zip format into current directory
    
    :param file_name: file name
    :returns: None
    '''
    # create a zipfile object and extract it to current directory
    print("Extracting file...")
    with ZipFile(file_name, 'r') as z:
        z.extractall()


Download the dataset into current directory. The default folder after extraction is `dogImages/`.

In [4]:
from glob import glob
import numpy as np

dog_url = 'https://s3-us-west-1.amazonaws.com/udacity-aind/dog-project/dogImages.zip'

downloadFile(dog_url, 'dogImages.zip')
extractFile('dogImages.zip')

# load filenames for human and dog images
dog_files = np.array(glob("dogImages/*/*/*"))

# print number of images in each dataset
print('There are %d total dog images.' % len(dog_files))

dogImages.zip:   0%|          | 0.00/1.13G [00:00<?, ?iB/s]

Extracting file...
There are 8351 total dog images.


Let's start by creating a SageMaker session and specifying:

- The S3 bucket and prefix that you want to use for training and model data. This should be within the same region as the Notebook Instance, training, and hosting.
- The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these. Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the `sagemaker.get_execution_role()` with a the appropriate full IAM role arn string(s).

In [2]:
# session and role
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

# create an S3 bucket
bucket = sagemaker_session.default_bucket()

# Name of the dataset directory
data_dir = 'dogImages'

# set prefix, a descriptive name for a directory  
prefix = 'dog-breed-classifier'

# upload all data to S3
input_dataset = sagemaker_session.upload_data(path=data_dir, bucket=bucket, key_prefix=prefix)
print(input_dataset)

s3://sagemaker-ap-south-1-461678052840/dog-breed-classifier


## Test cell
Test that our data has been successfully uploaded. The cell below prints out the items in the S3 bucket and will throw an error if it is empty. We should see the contents of ```data_dir``` and perhaps some checkpoints. If there are any other files listed, then we may have some old model files that can be deleted via the S3 console (though, additional files shouldn't affect the performance of model developed in this notebook).


In [None]:
# confirm that data is in S3 bucket
empty_check = []
for obj in boto3.resource('s3').Bucket(bucket).objects.all():
    empty_check.append(obj.key)
    print(obj.key)

assert len(empty_check) !=0, 'S3 bucket is empty.'
print('Test passed!')

---
<a id='step2'></a>
## Step 2: Create the custom model
Create a CNN model to classify dog breed using transfer learning. The model is defined in `model.py`.

In [7]:
# Print the implementation using a Python syntax highlighter package
!pygmentize 'code/model.py'

[34mimport[39;49;00m [04m[36mtorchvision[39;49;00m[04m[36m.[39;49;00m[04m[36mmodels[39;49;00m [34mas[39;49;00m [04m[36mmodels[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36mnn[39;49;00m [34mas[39;49;00m [04m[36mnn[39;49;00m

[37m# Using feature extraction approach[39;49;00m
[37m# Freeze the weights for all of the network except the final fully connected(FC) layer.[39;49;00m
[37m# This last FC layer is replaced with a new one with random weights and only this layer is trained.[39;49;00m
[37m# https://pytorch.org/tutorials/beginner/finetuning_torchvision_models_tutorial.html#initialize-and-reshape-the-networks[39;49;00m

[34mclass[39;49;00m [04m[32mDogBreedClassifier[39;49;00m:
    [33m'''[39;49;00m
[33m    Pretrained Resnet model with the output features in the last layer set to 133 nodes, which is the number of dog breed classes[39;49;00m
[33m    '''[39;49;00m
    [34mdef[39;49;00m [32m__init__[39;

---
<a id='step3'></a>
## Step 3: Create the training script
Once the model is developed, we implement the training script ```train.py```. The script does the following steps:

1. Loads training data from a specified directory
2. Parses any training & model hyperparameters (ex. nodes in a neural network, training epochs, etc.)
3. Instantiates a model of your design, with any specified hyperparams
4. Trains that model
5. Finally, saves the model so that it can be hosted/deployed later

From the code below, notice a few things:

- Model loading (`model_fn`) and saving code
- Getting SageMaker's default hyperparameters
- Loading the training data

If you'd like to read more about model saving with __[torch.save](https://pytorch.org/tutorials/beginner/saving_loading_models.html)__, click on the provided links.

In [8]:
# Print the implementation using a Python syntax highlighter package
!pygmentize 'code/train.py'

[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mjson[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mnumpy[39;49;00m [34mas[39;49;00m [04m[36mnp[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36mnn[39;49;00m [34mas[39;49;00m [04m[36mnn[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36moptim[39;49;00m [34mas[39;49;00m [04m[36moptim[39;49;00m
[34mfrom[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36mutils[39;49;00m[04m[36m.[39;49;00m[04m[36mdata[39;49;00m [34mimport[39;49;00m DataLoader
[34mfrom[39;49;00m [04m[36mtorchvision[39;49;00m [34mimport[39;49;00m datasets, transforms

[37m# the following import is required for training to be robust to truncated images[39;49;00m
[34mfrom[39;49;00m [04m[36mPIL[39;49;00m [34mim

---
<a id='step4'></a>
## Step 4: Create an Estimator
When a custom model is constructed in SageMaker, an entry point must be specified. We need to provide a training script `train.py` which will be executed when the model is trained. To run the script, create a PyTorch `Estimator` and fill in the appropriate constructor arguments:

- *entry_point*: The path to the Python script SageMaker runs for training and prediction.
- *source_dir*: The path to the training script directory source_sklearn OR source_pytorch.
- *role*: Role ARN, which was specified above.
- *py_version*: version of Python.
- *framework_version*: version of PyTorch.
- *instance_count*: The number of training instances (should be left at 1).
- *instance_type*: instantiate a new type of SageMaker instance for training.
>Note: we could use the same instance that is running this notebook if desired
- *sagemaker_session*: The session used to train on Sagemaker.
- *hyperparameters (optional)*: A dictionary { 'name':value, ... } passed to the train function as hyperparameters.

### Define the estimator

In [4]:
# Define a PyTorch estimator
from sagemaker.pytorch import PyTorch

# specify an output path
# prefix is specified above
output_path = 's3://{}/{}'.format(bucket, prefix)

# instantiate  the estimator
estimator = PyTorch(entry_point='train.py',
                    source_dir='code', # train.py at code directory
                    role=role,
                    py_version='py38',
                    framework_version='1.10.0', # PyTorch version
                    instance_count=1,
                    instance_type='ml.g4dn.xlarge',
                    output_path=output_path,
                    sagemaker_session=sagemaker_session,
                    hyperparameters={
                        'epochs': 50,
                        'batch-size': 64,
                        'lr': 0.001
                    })

### Train the estimator
Train your estimator on the training data stored in S3. This should create a training job that you can monitor in your SageMaker console.

In [5]:
%%time
import os

train_path = os.path.join(input_dataset, 'train')
valid_path = os.path.join(input_dataset, 'valid')
test_path = os.path.join(input_dataset, 'test')
print(train_path)
print(valid_path)
print(test_path)

# Train your estimator on S3 training data
estimator.fit({ 'train': train_path,
                'valid': valid_path,
                'test': test_path
              })

s3://sagemaker-ap-south-1-461678052840/dog-breed-classifier/train
s3://sagemaker-ap-south-1-461678052840/dog-breed-classifier/valid
s3://sagemaker-ap-south-1-461678052840/dog-breed-classifier/test
2022-05-20 07:12:53 Starting - Starting the training job...
2022-05-20 07:13:20 Starting - Preparing the instances for trainingProfilerReport-1653030772: InProgress
......
2022-05-20 07:14:20 Downloading - Downloading input data.........
2022-05-20 07:15:41 Training - Downloading the training image..................
2022-05-20 07:18:55 Training - Training image download completed. Training in progress..[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2022-05-20 07:18:57,591 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2022-05-20 07:18:57,614 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2022-05-20 07:18

### Deploy the estimator

After training, deploy your model to create a predictor. If you're using a PyTorch model, you'll need to create a trained PyTorchModel that accepts the trained <model>.model_data as an input parameter and points to the provided source_pytorch/predict.py file as an entry point.

To deploy a trained model, you'll use `model.deploy`, which takes in two arguments:

- initial_instance_count: The number of deployed instances (1).
- instance_type: The type of SageMaker instance for deployment.
>Note: If you run into an instance error, it may be because you chose the wrong training or deployment instance_type. It may help to refer to your previous exercise code to see which types of instances we used.

In [None]:
%%time

# from sagemaker.pytorch import PyTorchModel


# deploy your model to create a predictor
predictor = estimator.deploy(initial_instance_count=1, instance_type='ml.t2.medium')