# Deep Learning with Keras on Amazon SageMaker

Amazon SageMaker is a modular, fully managed Machine Learning service that lets you easily build, train and deploy models at any scale.

In this demo, we demonstrate using SageMaker's script mode, managed spot training, Debugger, Automatic Model Tuning, Experiments, and Model Monitor features.

We'll use Keras with the TensorFlow backend to build a simple Convolutional Neural Network (CNN) on Amazon SageMaker and train it to classify the Fashion-MNIST image data set.

Fashion-MNIST is a Zalando dataset consisting of a training set of 60,000 examples and a validation set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes: it's a drop-in replacement for MNIST.

Demo modified from AIM410R/R1 session at AWS re:Invent 2019. https://gitlab.com/juliensimon/aim410

## Resources
  * Amazon SageMaker documentation [ https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.html ]
  * SageMaker SDK 
    * Code [ https://github.com/aws/sagemaker-python-sdk ] 
    * Documentation [ https://sagemaker.readthedocs.io/ ]
  * Fashion-MNIST [ https://github.com/zalandoresearch/fashion-mnist ] 
  * Keras documentation [ https://keras.io/ ]
  * Numpy documentation [ https://docs.scipy.org/doc/numpy/index.html ]

## Install and import packages

In [None]:
%%sh
pip install --upgrade pip
pip install smdebug smdebug-rulesconfig # install SageMaker Debugger

In [None]:
# Restart kernel
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

In [None]:
import sagemaker

bucket = None # Specify bucket or leave 'None' to use default

sess = sagemaker.Session(default_bucket=bucket)
role = sagemaker.get_execution_role()

print(f"Session bucket: {sess.default_bucket()}")

## Download the Fashion-MNIST dataset

In [None]:
from IPython.display import Image
Image("fashion-mnist-sprite.png")

First, we need to download the data set from the Internet. Fortunately, Keras provides a simple way to do this. The data set is already split (training and validation), with separate Numpy arrays for samples and labels. 

We create a local directory, and save the training and validation data sets separately.

In [None]:
import os
import keras
import numpy as np
from keras.datasets import fashion_mnist

(x_train, y_train), (x_val, y_val) = fashion_mnist.load_data()

os.makedirs("./data", exist_ok = True)

np.savez('./data/training', image=x_train, label=y_train)
np.savez('./data/validation', image=x_val, label=y_val)

In [None]:
%%sh
ls data

## Take a look at our Keras code

What are these environment variables and why are they important? Well, they will be automatically passed to our script by SageMaker, so that we know where the data sets are, where to save the model, and how many GPUs we have. So, if you write your code this way, **there won't be anything to change** to run it on SageMaker.

This feature is called '**script mode**', it's the recommended way to work with built-in frameworks on SageMaker.

In [None]:
%%sh
pygmentize mnist_keras_tf.py

The main steps are:
  * receive and parse command line arguments: five hyper parameters, and four environment variables
  * load the data sets
  * make sure data sets have the right shape for TensorFlow (channels last)
  * normalize data sets, i.e. tranform [0-255] pixel values to [0-1] values
  * one-hot encode category labels (not familiar with this? More info: [ https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/ ])
  * Build a Sequential model in Keras: two convolution block with max pooling, followed by a fully connected layer with dropout, and a final classification layer. Don't worry if this sounds like gibberish, it's not our focus today
  * Train the model, leveraging multiple GPUs if they're available.
  * Print statistics
  * Save the model in TensorFlow serving format
  

## Train with Tensorflow on the notebook instance (aka 'local mode')

Let's test our code inside the built-in TensorFlow environment provided by SageMaker. For fast experimentation, let's use local mode to train on the local notebook instance.

In [None]:
from sagemaker.tensorflow import TensorFlow

tf_estimator = TensorFlow(entry_point='mnist_keras_tf.py',
                          output_path=f"s3://{sess.default_bucket()}",
                          role=role,
                          instance_count=1, 
                          instance_type='local',
                          framework_version='1.15', 
                          py_version='py3',
                          script_mode=True,
                          hyperparameters={'epochs': 1}
                         )

Now, let's define the local location of the training and validation data sets

In [None]:
local_training_input_path   = 'file://data/training.npz'
local_validation_input_path = 'file://data/validation.npz'

Let's train!

In [None]:
tf_estimator.fit({'training': local_training_input_path, 'validation': local_validation_input_path})

OK, our job runs fine locally. Let's now run the same job on a managed instance.

## Upload the data set to S3

SageMaker training instances expect data sets to be stored in Amazon S3, so let's upload them there. We could use boto3 to do this, but the SageMaker SDK includes a simple function: [Session.upload_data()](https://sagemaker.readthedocs.io/en/stable/session.html).



*Note: for high-performance workloads, Amazon EFS and Amazon FSx for Lustre are now also supported. More info [here](https://aws.amazon.com/blogs/machine-learning/speed-up-training-on-amazon-sagemaker-using-amazon-efs-or-amazon-fsx-for-lustre-file-systems/).

In [None]:
prefix = 'keras-fashion-mnist'

# Upload the training data set to 'keras-fashion-mnist/training'
training_input_path   = sess.upload_data('data/training.npz', key_prefix=prefix+'/training')

# Upload the validation data set to 'keras-fashion-mnist/validation'
validation_input_path = sess.upload_data('data/validation.npz', key_prefix=prefix+'/validation')

print(training_input_path)
print(validation_input_path)

We're done with our data set. Of course, in real life, much more work would be needed for data cleaning and preparation!

## Train with Managed Spot Training, and enable debugging with Amazon SageMaker Debugger

EC2 Spot Instances have long been a great cost optimization feature, and spot training is now available on SageMaker.
This blog [post](https://aws.amazon.com/blogs/aws/managed-spot-training-save-up-to-90-on-your-amazon-sagemaker-training-jobs/) has more info.

We're also using Amazon SageMaker Debugger to check for unwanted training conditions. **ZERO KERAS CODE NEEDED!**

In [None]:
# Configure a managed training job for 'mnist_keras_tf.py', 
# using a single p3.2xlarge instance running TensorFlow 1.15 in script mode

from sagemaker.tensorflow import TensorFlow
from sagemaker.debugger import Rule, rule_configs

tf_estimator = TensorFlow(entry_point='mnist_keras_tf.py', 
                          output_path=f"s3://{sess.default_bucket()}",
                          role=role,
                          instance_count=1, 
                          instance_type='ml.p3.2xlarge',
                          framework_version='1.15', 
                          py_version='py3',
                          script_mode=True,
                          hyperparameters={'epochs': 5},
                          use_spot_instances=True,        # Use spot instance
                          max_run=480,                    # Max training time
                          max_wait=720,                  # Max training time + spot waiting time
                          rules = [Rule.sagemaker(rule_configs.loss_not_decreasing()),
                                   Rule.sagemaker(rule_configs.overfit())]
                         )

Let's train!

In [None]:
# Train on the training and validation data sets stored in S3

tf_estimator.fit({'training': training_input_path, 'validation': validation_input_path})

This will take about 10 minutes. Please take a look at the training log. The first few lines show SageMaker preparing the managed instance. While the job is training, you can also look at metrics in the AWS console for SageMaker, and at the training log in the the AWS console for CloudWatch Logs.

Let's check the status of the debug rules we configured.

In [None]:
job_name = tf_estimator.latest_training_job.name
client = tf_estimator.sagemaker_session.sagemaker_client

description = client.describe_training_job(TrainingJobName=job_name)

In [None]:
import pprint 
for status in description['DebugRuleEvaluationStatuses']:
    status.pop('LastModifiedTime')
    status.pop('RuleEvaluationJobArn')
    pprint.pprint(status)

Let's also look at tensor information saved in S3.

In [None]:
s3_output_path = f"s3://{sess.default_bucket()}/{job_name}/debug-output"

print(s3_output_path)

In [None]:
import smdebug
from smdebug.trials import create_trial

trial = create_trial(s3_output_path)
trial

In [None]:
trial.tensor_names()

In [None]:
loss_values = trial.tensor('loss').values()

In [None]:
loss_values

## Automatic Model Tuning

Automatic model tuning is a great feature that helps you find automatically the best hyper parameters for your training job.

This blog [post](https://aws.amazon.com/blogs/machine-learning/amazon-sagemaker-automatic-model-tuning-now-supports-random-search-and-hyperparameter-scaling/) has more info.

First, let's define parameter ranges.

In [None]:
# Define parameter ranges :

from sagemaker.tuner import IntegerParameter, ContinuousParameter

hyperparameter_ranges = {
    'learning-rate': ContinuousParameter(0.001, 0.1, scaling_type='ReverseLogarithmic'), 
    'batch-size':    IntegerParameter(32, 1024),
    'filters':       IntegerParameter(4, 64),
    'dense-layer':   IntegerParameter(32, 1024),
    'dropout':       ContinuousParameter(0.2, 0.8)
}

The next step is to define the metric we're optimizing for, in this case we want to maximize the validation accuracy. We also grab other metrics from the training log.

In [None]:
objective_metric_name = 'validation_accuracy'

objective_type = 'Maximize'

metric_definitions = [
    {'Name': 'training_loss',        'Regex': 'loss: ([0-9\\.]+)'},
    {'Name': 'training_accuracy',    'Regex': 'acc: ([0-9\\.]+)'},
    {'Name': 'validation_loss',      'Regex': 'val_loss: ([0-9\\.]+)'},
    {'Name': 'validation_accuracy',  'Regex': 'val_acc: ([0-9\\.]+)'},
    {'Name': 'training_precision',   'Regex': 'precision: ([0-9\\.]+)'},
    {'Name': 'training_recall',      'Regex': 'recall: ([0-9\\.]+)'},
    {'Name': 'training_f1_score',    'Regex': 'f1_score: ([0-9\\.]+)'},
    {'Name': 'validation_precision', 'Regex': 'val_precision: ([0-9\\.]+)'},
    {'Name': 'validation_recall',    'Regex': 'val_recall: ([0-9\\.]+)'},
    {'Name': 'validation_f1_score',  'Regex': 'val_f1_score: ([0-9\\.]+)'}
]

Then, it's time to put everything together, and configure the tuning job. Same estimator as above, without the debugging job.

In [None]:
tf_estimator = TensorFlow(entry_point='mnist_keras_tf.py', 
                          output_path=f"s3://{sess.default_bucket()}",
                          role=role,
                          instance_count=1, 
                          instance_type='ml.p3.2xlarge',
                          framework_version='1.15', 
                          py_version='py3',
                          script_mode=True,
                          hyperparameters={'epochs': 5}
#                           use_spot_instances=True,        # Use spot instance
#                           max_run=600,                    # Max training time
#                           max_wait=720                   # Max training time + spot waiting time
                         )

In [None]:
from sagemaker.tuner import HyperparameterTuner

# Configure a training job using the Tensorflow estimator, the parameter ranges and the metric defined above.
# Let's run four individual jobs, two by two.

tuner = HyperparameterTuner(tf_estimator,
                            objective_metric_name,
                            hyperparameter_ranges,
                            metric_definitions,
                            max_jobs=4,
                            max_parallel_jobs=2,
                            objective_type=objective_type)

Finally, let's launch the tuning job, just like a normal estimator.

In [None]:
# Launch the tuning job, passing the location of the data sets in S3.

tuner.fit({'training': training_input_path, 'validation': validation_input_path})

While the job is running, you can view it in the AWS console for SageMaker: individual jobs (and their logs), best training job so far, etc.

Of course, you can also inspect the job programatically using [boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html) : *decribe_hyper_parameter_training_job()*, etc.

## Inspect jobs with Amazon SageMaker Experiments

Model tuning automatically creates a new experiment, and pushes information for each job. 

**ZERO KERAS CODE NEEDED!**

Run the following cell to see the status of the automatic model tuning job. It may take a minute for the initial job to appear. Try rerunning the cell to see updated job statuses. 

Note TrainingJobStatus may be 'InProgress' for one or more jobs. Status is 'Completed' when jobs are done.

In [None]:
from sagemaker.analytics import HyperparameterTuningJobAnalytics

exp = HyperparameterTuningJobAnalytics(
    sagemaker_session=sess, 
    hyperparameter_tuning_job_name=tuner.latest_tuning_job.name
)

df = exp.dataframe()
df

Pandas is the Swiss army knife for columnar data. Let's just look at the top job.

In [None]:
best_job = df.sort_values('FinalObjectiveValue', ascending=0)[:1]
best_job

In [None]:
best_job_name = best_job['TrainingJobName'].to_string(index=False).strip()
best_job_name

In [None]:
import boto3
sm = boto3.client('sagemaker')

In [None]:
best_job = sm.describe_training_job(TrainingJobName=best_job_name)

best_model_artefact = best_job['ModelArtifacts']['S3ModelArtifacts']
best_model_container = best_job['AlgorithmSpecification']['TrainingImage']

print(best_job_name)
print(best_model_artefact)
print(best_model_container)

## Deploy the best model, enabling data capture with Amazon SageMaker Model Monitor

This is where we want to save captured data.

In [None]:
prefix = '/ModelMonitorDEMO/'
s3_capture_path = 's3://' + sess.default_bucket() + prefix + best_job_name + '/'

print(s3_capture_path)

By default, we will capture 100% of model inputs and outputs. Of course, this is configurable.

And you guessed it... **ZERO KERAS CODE NEEDED!**

In [None]:
from sagemaker.model_monitor import DataCaptureConfig

cap = DataCaptureConfig(
    enable_capture=True,
    destination_s3_uri=s3_capture_path
)

In [None]:
endpoint_name = best_job_name + '-ep'

best_model_predictor = tuner.deploy(
    initial_instance_count=1, 
    instance_type='ml.m5.xlarge', 
    endpoint_name=endpoint_name,
    data_capture_config=cap)

## Predict with best model

In [None]:
%matplotlib inline
import random
import matplotlib.pyplot as plt

num_samples = 10
indices = random.sample(range(x_val.shape[0] - 1), num_samples)
images = x_val[indices]/255
labels = y_val[indices]

for i in range(num_samples):
    plt.subplot(1,num_samples,i+1)
    plt.imshow(images[i].reshape(28, 28), cmap='gray')
    plt.title(labels[i])
    plt.axis('off')
    
prediction = best_model_predictor.predict(images.reshape(num_samples, 28, 28, 1))['predictions']
prediction = np.array(prediction)
predicted_labels = prediction.argmax(axis=1)
print('Predicted labels are: {}'.format(predicted_labels))

Now let's predict the validation dataset 250 samples at a time, storing labels and predicted labels as we go.

In [None]:
%%time
num_samples = 250
all_labels=[]
all_predicted_labels=[]

import sys

for i in range(0,x_val.shape[0] - 1,num_samples):
    sys.stdout.write(str(i)+' ')
    indices = range(i,i+num_samples)
    images = x_val[indices]/255
    labels = y_val[indices]
    prediction = best_model_predictor.predict(images.reshape(num_samples, 28, 28, 1))['predictions']
    prediction = np.array(prediction)
    predicted_labels = prediction.argmax(axis=1)
    all_labels.extend(labels)
    all_predicted_labels.extend(predicted_labels)

Let's build the confusion matrix, to compare predicted labels with real labels for each class.

In [None]:
import sklearn
import itertools
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(all_labels, all_predicted_labels)
plt.matshow(cm)
plt.title('Confusion matrix')
fmt = 'd'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
    plt.text(j, i, format(cm[i, j], fmt),
            horizontalalignment="center",
            color="white" if cm[i, j] < thresh else "black")
plt.ylabel('True label')
plt.xlabel('Predicted label')
classes = range(10) # Labels are sorted 
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes)
plt.yticks(tick_marks, classes)
plt.grid(False)
plt.show()

Let's check that we captured data (you may have to wait a minute or two for files to show up).

In [None]:
%%sh -s "$s3_capture_path"

aws s3 ls --recursive $1

In [None]:
%%sh -s "$s3_capture_path"

aws s3 cp --recursive $1 .

In [None]:
# Copy local file name from the cell output above ("tensorflow-training...jsonl") and paste below to preview.
# Your code should look like:
# !head tensorflow-training-200922-0403-001-a3ab0f09-ep/AllTraffic/2020/09/22/04/23-33-408-af4b5c9d-540a-4fcc-9105-dc0eae6e417b.jsonl

!head #REPLACE ME WITH YOUR FILE NAME#

## Delete model endpoint

In [None]:
import boto3
sm = boto3.client('sagemaker')
sm.delete_endpoint(EndpointName=endpoint_name)