### Deep Learning with Keras on Amazon SageMaker

Last update: December 3rd, 2019

Amazon SageMaker is a modular, fully managed Machine Learning service that lets you easily build, train and deploy models at any scale.

In this notebook, we'll use Keras (with the TensorFlow backend) to build a simple Convolutional Neural Network (CNN). We'll then train it to classify the Fashion-MNIST image data set. Fashion-MNIST is a Zalando dataset consisting of a training set of 60,000 examples and a validation set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes: it's a drop-in replacement for MNIST.

Resources
  * Amazon SageMaker documentation [ https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.html ]
  * SageMaker SDK 
    * Code [ https://github.com/aws/sagemaker-python-sdk ] 
    * Documentation [ https://sagemaker.readthedocs.io/ ]
  * Fashion-MNIST [ https://github.com/zalandoresearch/fashion-mnist ] 
  * Keras documentation [ https://keras.io/ ]
  * Numpy documentation [ https://docs.scipy.org/doc/numpy/index.html ]
  
### https://gitlab.com/juliensimon/amazon-studio-demos
### Twitter: @julsimon

## Import the latest SageMaker SDK

In [None]:
%%sh
pip install -q --upgrade pip
pip install -q sagemaker smdebug-rulesconfig==0.1.2  keras pandas --upgrade

In [None]:
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

In [None]:
import sagemaker

print(sagemaker.__version__)
sess = sagemaker.Session()
role = sagemaker.get_execution_role()

## Download the Fashion-MNIST dataset

First, we need to download the data set from the Internet. Fortunately, Keras provides a simple way to do this. The data set is already split (training and validation), with separate Numpy arrays for samples and labels. 

We create a local directory, and save the training and validation data sets separately.

In [None]:
import os
import keras
import numpy as np
from keras.datasets import fashion_mnist

(x_train, y_train), (x_val, y_val) = fashion_mnist.load_data()

os.makedirs("./data", exist_ok = True)

np.savez('./data/training', image=x_train, label=y_train)
np.savez('./data/validation', image=x_val, label=y_val)

In [None]:
%%sh
ls -l data

## Take a look at our Keras code

In [None]:
keras_script_path = '/root/aim410/mnist_keras_tf.py'

In [None]:
%%sh -s $keras_script_path
pygmentize $1

The main steps are:
  * receive and parse command line arguments: five hyper parameters, and four environment variables (we'll get back to these in a moment)
  * load the data sets
  * make sure data sets have the right shape for TensorFlow (channels last)
  * normalize data sets, i.e. tranform [0-255] pixel values to [0-1] values
  * one-hot encode category labels (not familiar with this? More info: [ https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/ ])
  * Build a Sequential model in Keras: two convolution block with max pooling, followed by a fully connected layer with dropout, and a final classification layer. Don't worry if this sounds like gibberish, it's not our focus today
  * Train the model, leveraging multiple GPUs if they're available.
  * Print statistics
  * Save the model in TensorFlow serving format
  

## Upload the data set to S3

SageMaker training instances expect data sets to be stored in Amazon S3, so let's upload them there. We could use boto3 to do this, but the SageMaker SDK includes a simple function: [Session.upload_data()](https://sagemaker.readthedocs.io/en/stable/session.html).



*Note: for high-performance workloads, Amazon EFS and Amazon FSx for Lustre are now also supported. More info [here](https://aws.amazon.com/blogs/machine-learning/speed-up-training-on-amazon-sagemaker-using-amazon-efs-or-amazon-fsx-for-lustre-file-systems/).

In [None]:
prefix = 'keras-fashion-mnist'

# Upload the training data set to 'keras-fashion-mnist/training'
training_input_path   = sess.upload_data('data/training.npz', key_prefix=prefix+'/training')

# Upload the validation data set to 'keras-fashion-mnist/validation'
validation_input_path = sess.upload_data('data/validation.npz', key_prefix=prefix+'/validation')

print(training_input_path)
print(validation_input_path)

We're done with our data set. Of course, in real life, much more work would be needed for data cleaning and preparation!

## Train with On Demand instances

In [None]:
# Configure a managed training job for 'mnist_keras_tf.py', 
# using a single c5.2xlarge running TensorFlow 1.15 in script mode

from sagemaker.tensorflow import TensorFlow
from sagemaker.debugger import Rule, rule_configs

tf_estimator = TensorFlow(entry_point=keras_script_path, 
                          role=role,
                          train_instance_count=1, 
                          train_instance_type='ml.p3.2xlarge',
                          framework_version='1.15', 
                          script_mode=True,
                          py_version='py3'
                         )

## Train with Managed Spot Training

EC2 Spot Instances have long been a great cost optimization feature, and spot training is now available on SageMaker.
This blog [post](https://aws.amazon.com/blogs/aws/managed-spot-training-save-up-to-90-on-your-amazon-sagemaker-training-jobs/) has more info.

In [None]:
# Configure a managed training job for 'mnist_keras_tf.py', 
# using a single c5.2xlarge spot instance running TensorFlow 1.15 in script mode

from sagemaker.tensorflow import TensorFlow
from sagemaker.debugger import Rule, rule_configs

tf_estimator = TensorFlow(entry_point=keras_script_path, 
                          role=role,
                          train_instance_count=1, 
                          train_instance_type='ml.p3.2xlarge',
                          framework_version='1.15', 
                          script_mode=True,
                          py_version='py3',
                          train_use_spot_instances=True,        # Use spot instance
                          train_max_run=600,                    # Max training time
                          train_max_wait=3600,                  # Max training time + spot waiting time
                         )

Let's train!

In [None]:
# Train on the training and validation data sets stored in S3

tf_estimator.fit({'training': training_input_path, 'validation': validation_input_path})

This will take about 10 minutes. Please take a look at the training log. The first few lines show SageMaker preparing the managed instance. While the job is training, you can also look at metrics in the AWS console for SageMaker, and at the training log in the the AWS console for CloudWatch Logs.

Once the job is complete, the trained model is saved in S3, and is now ready to be deployed.