# Skin Cancer Training using MONAI

## Overview

HAM10000 ("Human Against Machine with 10000 training images") is a popular data set of dermatoscopic images hosted by [Harvard Dataverse](https://dataverse.harvard.edu/) from different populations.  It consists of 10015 images consisting of several diagnositic categories including: Actinic keratoses and intraepithelial carcinoma / Bowen's disease (akiec), basal cell carcinoma (bcc), benign keratosis-like lesions (solar lentigines / seborrheic keratoses and lichen-planus like keratoses, bkl), dermatofibroma (df), melanoma (mel), melanocytic nevi (nv) and vascular lesions (angiomas, angiokeratomas, pyogenic granulomas and hemorrhage, vasc).

In this example we will demonstrate how to integrate the [MONAI](http://monai.io) framework into Amazon SageMaker using Pytorch and give example code of MONAI pre-processing transforms that can assist with imbalanced datasets and image transformations.  We will also show the code to invoke MONAI neural network architectures such as Densenet for image classification and explore structure of Pytorch code to train and serve the model within SageMaker.  Additionally, we will cover the SageMaker API calls to launch and manage the compute infrastructure for both model training and hosting for inference using the HAM10000 data set.

For more information about the PyTorch in SageMaker, please visit [sagemaker-pytorch-containers](https://github.com/aws/sagemaker-pytorch-containers) and [sagemaker-python-sdk](https://github.com/aws/sagemaker-python-sdk) github repositories.

---

## Setup

This notebook was created and tested on an ml.t2.medium notebook instance with 100 GB of EBS and conda_pytorch_p36 kernel.

Let's get started by creating a S3 bucket and uploading the HAM10000 dataset to the bucket.

<ol>
<li>Create an S3 bucket in the same account as the Sagemaker notebook instance.
<li>Download the skin cancer dataset at <a href="https://www.kaggle.com/datasets/kmader/skin-cancer-mnist-ham10000/">HAM10000</a>.
<li>Select "Access Dataset" in top right, and select "Original Format Zip".
<li>Upload the dataset to the S3 bucket created in step 1.
<li>Update the set.env file located in the current directory with the S3 location of the dataverse_files.zip.
</ol>

The code below will install MONAI framework and dependent packages and setup environment variables.

In [None]:
# Copyright 2020 Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0

!pip install -r prerequisite/dependency.txt

In [None]:
import os
from pathlib import Path
from dotenv import load_dotenv
env_path = Path('.') / 'environmentsettings.env'
load_dotenv(dotenv_path=env_path)

skin_cancer_bucket=os.environ.get('SKIN_CANCER_BUCKET')
skin_cancer_bucket_path=os.environ.get('SKIN_CANCER_BUCKET_PATH')
skin_cancer_files=os.environ.get('SKIN_CANCER_FILES')
skin_cancer_files_ext=os.environ.get('SKIN_CANCER_FILES_EXT')
base_dir = os.environ.get('BASE_DIR')

print('Skin Cancer Bucket: '+skin_cancer_bucket)
print('Skin Cancer Bucket Prefix: '+skin_cancer_bucket_path)
print('Skin Cancer Files: '+skin_cancer_files)
print('Skin Cancer Files Ext: '+skin_cancer_files_ext)
print('Base Dir: '+base_dir)

## HAM10000 Data Transformation

The transform_data.ipynb will download the dataverse_files.zip and perform transformations to build directories by class for training and validation sets from the meta-data.  It will also augment the data to create a more balanced data set across the classes for training.  The script will upload the transformed dataset HAM10000.tar.gz to the same S3 bucket identifed in set.env for model training.

In [None]:
%run prerequisite/datatransfer.ipynb

## Data

### Create Sagemaker session and S3 location for transformed HAM10000 dataset

In [None]:
import sagemaker

smSession = sagemaker.Session()
role = sagemaker.get_execution_role()

mydata = smSession.upload_data(path=base_dir+'HAM10000.tar.gz', bucket=skin_cancer_bucket, key_prefix=skin_cancer_bucket_path)
print('Input specification has been mentioned here: {}'.format(mydata))

## Train Model
### Training

The ```python_sc.py``` script provides all the code we need for training and hosting a SageMaker model (model_fn function to load a model). The training script is very similar to a training script you might run outside of SageMaker, but you can access useful properties about the training environment through various environment variables, such as:

* SM_MODEL_DIR: A string representing the path to the directory to write model artifacts to. These artifacts are uploaded to S3 for model hosting.
* SM_NUM_GPUS: The number of gpus available in the current container.
* SM_CURRENT_HOST: The name of the current container on the container network.
* SM_HOSTS: JSON encoded list containing all the hosts .
Supposing one input channel, 'training', was used in the call to the PyTorch estimator's fit() method, the following will be set, following the format SM_CHANNEL_[channel_name]:

* SM_CHANNEL_TRAINING: A string representing the path to the directory containing data in the 'training' channel.
For more information about training environment variables, please visit [SageMaker Containers](https://github.com/aws/sagemaker-containers).

A typical training script loads data from the input channels, configures training with hyperparameters, trains a model, and saves a model to model_dir so that it can be hosted later. Hyperparameters are passed to your script as arguments and can be retrieved with an argparse.ArgumentParser instance.

Because the SageMaker imports the training script, you should put your training code in a main guard (''if __name__=='__main__':'') if you are using the same script to host your model as we do in this example, so that SageMaker does not inadvertently run your training code at the wrong point in execution.

MONAI includes deep neural networks such as UNet, DenseNet, GAN and others and provides sliding window inferences for large medical image volumes.  In the skin cancer image classification model, we train the MONAI DenseNet model on the skin cancer images for thirty epochs while measuring loss.

In [None]:
!pygmentize prerequisite/python_sc.py

## Run training in SageMaker

The `PyTorch` class allows us to run our training function as a training job on SageMaker infrastructure.  We need to configure it with our training script, an IAM role, the number of training instances, the training instance type, and hyperparameters.  In this case we are going to run our training job on ```ml.p3.8xlarge``` instance.  But this example can be ran on one or multiple, cpu or gpu instances ([full list of available instances](https://aws.amazon.com/sagemaker/pricing/instance-types/)).  The hyperparameters parameter is a dict of values that will be passed to your training script -- you can see how to access these values in the ```python_sc.py``` script above.

In [None]:
from sagemaker.pytorch import PyTorch

estimator = PyTorch(entry_point='python_sc.py',
                    source_dir='prerequisite',
                    role=role,
                    framework_version='1.5.0',
                    py_version='py3',
                    instance_count=1,
                    instance_type='ml.p3.8xlarge',
                    hyperparameters={
                        'backend': 'gloo',
                        'epochs': 30
                    })

After we've constructed our PyTorch object, we can fit it using the HAM10000 dataset we uploaded to S3. SageMaker will download the data to the local filesystem, so our training script can simply read the data from disk.

In [None]:
estimator.fit({'train': mydata})

## HOST Model
### Create real-time endpoint

After training, we use the ``PyTorch`` estimator object to build and deploy a PyTorchPredictor. This creates a Sagemaker Endpoint -- a hosted prediction service that we can use to perform inference.

As mentioned above we have implementation of `model_fn` in the python_sc.py script that is required. We are going to use default implementations of `input_fn`, `predict_fn`, `output_fn` and `transform_fm` defined in [sagemaker-pytorch-containers](https://github.com/aws/sagemaker-pytorch-containers).

The arguments to the deploy function allow us to set the number and type of instances that will be used for the Endpoint. These do not need to be the same as the values we used for the training job. For example, you can train a model on a set of GPU-based instances, and then deploy the Endpoint to a fleet of CPU-based instances, but you need to make sure that you return or save your model as a cpu model similar to what we did in python_sc.py. Here we will deploy the model to a single ```ml.m5.xlarge``` instance.

In [None]:
modeldetector = estimator.deploy(initial_instance_count=1, instance_type='ml.m5.xlarge')

### Load Validation Images for Inference 

In [None]:
from PIL import Image



folderdata = os.path.join(base_dir, 'HAM10000/folderdata')
namedata = sorted([x for x in os.listdir(folderdata) if os.path.isdir(os.path.join(folderdata, x))])
numberdata = len(namedata)
imagedata = [[os.path.join(folderdata, class_name, x)
                for x in os.listdir(os.path.join(folderdata, class_name))[:1]] 
               for class_name in namedata]
myimagefile = []
myimagefilelabel = []

for i, class_name in enumerate(namedata):
    myimagefile.extend(imagedata[i])
    myimagefilelabel.extend([i] * len(imagedata[i]))
        
total = len(myimagefilelabel)
image_width, image_height = Image.open(myimagefile[0]).size

### MONAI Transform Image using Compose and Skin Cancer Dataset

MONAI has transforms that support both Dictionary and Array format and are specialized for the high-dimensionality of medical images.  The transforms include several categories such as Crop & Pad, Intensity, IO, Post-processing, Spatial, and Utilities.  In the following excerpt, the Compose class chains a series of image transforms together and returns a single tensor of the image.

In [None]:
import torch
from torch.utils.data import DataLoader
from prerequisite.dataset_sc import SkinCancerDataset
from monai.transforms import Compose, LoadPNG, Resize, AsChannelFirst, ScaleIntensity, ToTensor

transforms = Compose([
        LoadPNG(image_only=True),
        AsChannelFirst(channel_dim=2),
        ScaleIntensity(),
        Resize(spatial_size=(64,64)),
        ToTensor()
])
    
myds = SkinCancerDataset(myimagefile, myimagefilelabel, transforms)
myloader = DataLoader(myds, batch_size=1, num_workers=1)

### Evaluate
We can now use the modeldetector to perform a real-time inference to classify skin cancer images.

In [None]:
print('Transformation of the training dataset is finished.')
for i, val_data in enumerate(myloader):
    response = modeldetector.predict(val_data[0])
    actual_label = val_data[1]
    pred = torch.nn.functional.softmax(torch.tensor(response), dim=1)
    top_p, top_class = torch.topk(pred, 1)
    print('actual class: '+namedata[actual_label.numpy()[0]])
    print('predicted class: '+namedata[top_class])
    print('predicted class probablity: '+str(round(top_p.item(),2)))    

### Remove endpoint (Optional)
Delete the prediction endpoint to release the instance(s) hosting the model once finished with example.

In [None]:
modeldetector.delete_endpoint()