# Train a Keras Sequential Model
## Using experiment
This notebook shows how to train a Keras Sequential model on SageMaker, and use SageMaker Experiment Management Python SDK to organize, track, compare, and evaluate your machine learning (ML) model training experiments.

You can track artifacts for experiments, including data sets, algorithms, hyper-parameters, and metrics. Experiments executed on SageMaker such as SageMaker Autopilot jobs and training jobs will be automatically tracked. You can also track artifacts for additional steps within an ML workflow that come before/after model training e.g. data pre-processing or post-training model evaluation.

The APIs also let you search and browse your current and past experiments, compare experiments, and identify best performing models.

The model used for this notebook is a simple deep CNN that was extracted from [the Keras examples](https://github.com/keras-team/keras/blob/master/examples/cifar10_cnn.py). The experiment will be organized as follow:

1. Download and prepare the cifar10 dataset.
2. Train a Convolutional Neural Network (CNN) Model. Tune the hyper parameter that configures the optimization method in the model. Track the parameter configurations and resulting model accuracy using SageMaker Experiments Python SDK.
3. Finally use the search and analytics capabilities of Python SDK to search, compare and evaluate the performance of all model versions generated from model tuning in Step 2.
4. We will also see an example of tracing the complete linage of a model version i.e. the collection of all the data pre-processing and training configurations and inputs that went into creating that model version.

## The dataset
The [CIFAR-10 dataset](https://www.cs.toronto.edu/~kriz/cifar.html) is one of the most popular machine learning datasets. It consists of 60,000 32x32 images belonging to 10 different classes (6,000 images per class). Here are the classes in the dataset, as well as 10 random images from each:

![cifar10](https://maet3608.github.io/nuts-ml/_images/cifar10.png)

In this tutorial, we will train a deep CNN to recognize these images.

We'll compare trainig with file mode, pipe mode datasets and distributed training with Horovod

## Set up the environment

In [1]:
import time
import os
import sagemaker
import boto3
from sagemaker import get_execution_role
from sagemaker.tensorflow import TensorFlow
from sagemaker.analytics import ExperimentAnalytics

boto3_session = boto3.Session()
sm_client = boto3_session.client('sagemaker')
sm_session = sagemaker.Session(boto_session=boto3_session, sagemaker_client=sm_client)
role = get_execution_role()

In [2]:
import sys
!{sys.executable} -m pip install sagemaker-experiments

from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
from smexperiments.trial_component import TrialComponent
from smexperiments.tracker import Tracker

You should consider upgrading via the '/home/ec2-user/anaconda3/envs/tensorflow_p36/bin/python -m pip install --upgrade pip' command.[0m


## Download the CIFAR-10 dataset
Downloading the test and training data takes around 5 minutes.

In [4]:
#!pip install wget
# import wget # for TF2
!python generate_cifar10_tfrecords_v1.x.py --data-dir data/


Download from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz and extract.
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

Instructions for updating:
Please write your own downloading logic.
Instructions for updating:
Please use urllib or similar directly.
Successfully downloaded cifar-10-python.tar.gz 170498071 bytes.
Generating data//train/train.tfrecords


Generating data//validation/validation.tfrecords
Generating data//eval/eval.tfrecords
Done!


## Run on SageMaker cloud

### Uploading the data to s3

In [5]:
dataset_location = sm_session.upload_data(path='data', key_prefix='data/DEMO-cifar10-tf')
display(dataset_location)

's3://sagemaker-us-east-1-079329190341/data/DEMO-cifar10-tf'

Now lets track the parameters from the data pre-processing step.

In [23]:
with Tracker.create(display_name="Preprocessing", sagemaker_boto_client=sm_client) as tracker:
    tracker.log_parameters({
        "datatype": 'tfrecords',
        "image_size": 32,
    })
    # we can log the s3 uri to the dataset we just uploaded
    tracker.log_input(name="cifar10-dataset", media_type="s3/uri", value=dataset_location)

TrialComponent(sagemaker_boto_client=<botocore.client.SageMaker object at 0x7f80883ff160>,trial_component_name='TrialComponent-2020-05-21-172253-gorc',display_name='Preprocessing',trial_component_arn='arn:aws:sagemaker:us-east-1:079329190341:experiment-trial-component/trialcomponent-2020-05-21-172253-gorc',response_metadata={'RequestId': '3c28f92d-a5be-4023-bfba-05ffcff4eb11', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '3c28f92d-a5be-4023-bfba-05ffcff4eb11', 'content-type': 'application/x-amz-json-1.1', 'content-length': '129', 'date': 'Thu, 21 May 2020 17:22:53 GMT'}, 'RetryAttempts': 0},parameters={'datatype': 'tfrecords', 'image_size': 32},input_artifacts={'cifar10-dataset': TrialComponentArtifact(value='s3://sagemaker-us-east-1-079329190341/data/DEMO-cifar10-tf',media_type='s3/uri')},output_artifacts={},start_time=datetime.datetime(2020, 5, 21, 17, 22, 54, 91903, tzinfo=tzlocal()),status=TrialComponentStatus(primary_status='Completed',message=None),end_time=datetime

SageMaker can get training metrics directly from the logs and send them to CloudWatch metrics.

In [7]:
keras_metric_definition = [
    {'Name': 'train:loss', 'Regex': '.*loss: ([0-9\\.]+) - acc: [0-9\\.]+.*'},
    {'Name': 'train:accuracy', 'Regex': '.*loss: [0-9\\.]+ - acc: ([0-9\\.]+).*'},
    {'Name': 'validation:accuracy', 'Regex': '.*step - loss: [0-9\\.]+ - acc: [0-9\\.]+ - val_loss: [0-9\\.]+ - val_acc: ([0-9\\.]+).*'},
    {'Name': 'validation:loss', 'Regex': '.*step - loss: [0-9\\.]+ - acc: [0-9\\.]+ - val_loss: ([0-9\\.]+) - val_acc: [0-9\\.]+.*'},
    {'Name': 'sec/steps', 'Regex': '.* - \d+s (\d+)[mu]s/step - loss: [0-9\\.]+ - acc: [0-9\\.]+ - val_loss: [0-9\\.]+ - val_acc: [0-9\\.]+'}
]

### Step 1 - Set up the Experiment

Create an experiment to track all the model training iterations. Experiments are a great way to organize your data science work. You can create experiments to organize all your model development work for : [1] a business use case you are addressing (e.g. create experiment named “customer churn prediction”), or [2] a data science team that owns the experiment (e.g. create experiment named “marketing analytics experiment”), or [3] a specific data science and ML project. Think of it as a “folder” for organizing your “files”.

In [8]:
cifar10_experiment = Experiment.create(
    experiment_name=f"cifar10-image-classification-{int(time.time())}", 
    description="Classification of images", 
    sagemaker_boto_client=sm_client)
print(cifar10_experiment)

Experiment(sagemaker_boto_client=<botocore.client.SageMaker object at 0x7f80883ff160>,experiment_name='cifar10-image-classification-1590080767',description='Classification of images',experiment_arn='arn:aws:sagemaker:us-east-1:079329190341:experiment/cifar10-image-classification-1590080767',response_metadata={'RequestId': '9d5e880c-28bc-4f60-bf2c-959ef5c511e4', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '9d5e880c-28bc-4f60-bf2c-959ef5c511e4', 'content-type': 'application/x-amz-json-1.1', 'content-length': '111', 'date': 'Thu, 21 May 2020 17:06:07 GMT'}, 'RetryAttempts': 0})


### Step 2 - Track Experiment
### Now create a Trial for each training run to track the it's inputs, parameters, and metrics.
While training the CNN model on SageMaker, we will experiment with several values for optimization method in the model. We will create a Trial to track each training job run. We will also create a TrialComponent from the tracker we created before, and add to the Trial. This will enrich the Trial with the parameters we captured from the data pre-processing stage.

Note the execution of the following code takes a while. If you want to run the following training jobs asynchronously, you may need to increase your resource limit. Otherwise, you can run them sequentially.

In [22]:
opt_method_trial_name_map = {}
for i, opt_method in enumerate(['adam','sgd','rmsprop']):
    # create trial
    trial_name = f"cifar10-training-job-with-{opt_method}-optimization-{int(time.time())}"
    cifar10_trial = Trial.create(
        trial_name=trial_name, 
        experiment_name=cifar10_experiment.experiment_name,
        sagemaker_boto_client=sm_client,
    )
    opt_method_trial_name_map[opt_method] = trial_name
    
    # associate the proprocessing trial component with the current trial
    cifar10_trial.add_trial_component(tracker.trial_component)


    estimator = TensorFlow(base_job_name='cifar10-tf',
                           entry_point='cifar10_keras_main.py',
                           source_dir=os.path.join(os.getcwd(), 'source_dir'),
                           role=role,
                           framework_version='1.12.0',
                           py_version='py3',
                           hyperparameters={'epochs': 1, 'batch-size' : 256, 'optimizer' : opt_method},
                           train_instance_count=1, train_instance_type='ml.p3.2xlarge',
                           metric_definitions=keras_metric_definition)
    
    cifar10_training_job_name = "cifar-training-job-{}".format(int(time.time()))
    remote_inputs = {'train' : dataset_location+'/train', 'validation' : dataset_location+'/validation', 'eval' : dataset_location+'/eval'}
    estimator.fit(remote_inputs, job_name=cifar10_training_job_name,
        experiment_config={
            "TrialName": cifar10_trial.trial_name,
            "TrialComponentDisplayName": "Training",
        },
        wait=False,)
    # give it a while before dispatching the next training job
    time.sleep(2)

INFO:sagemaker:Creating training-job with name: cifar-training-job-1590081614
INFO:sagemaker:Creating training-job with name: cifar-training-job-1590081616
INFO:sagemaker:Creating training-job with name: cifar-training-job-1590081619


### Compare the model training runs for an experiment

Now we will use the analytics capabilities of Python SDK to query and compare the training runs for identifying the best model produced by our experiment. You can retrieve trial components by using a search expression.

### Some Simple Analyses

In [36]:
search_expression = {
    "Filters":[
        {
            "Name": "DisplayName",
            "Operator": "Equals",
            "Value": "Training",
        }
    ],
}

In [37]:
trial_component_analytics = ExperimentAnalytics(
    sagemaker_session=sm_session, 
    experiment_name=cifar10_experiment.experiment_name,
    search_expression=search_expression,
    sort_by="metrics.validation:accuracy.max",
    sort_order="Descending",
    metric_names=['train:accuracy', 'validation:accuracy'],
    parameter_names=['optimizer']
)

In [38]:
trial_component_analytics.dataframe()

Unnamed: 0,TrialComponentName,DisplayName,SourceArn,optimizer,validation:accuracy - Min,validation:accuracy - Max,validation:accuracy - Avg,validation:accuracy - StdDev,validation:accuracy - Last,validation:accuracy - Count,train:accuracy - Min,train:accuracy - Max,train:accuracy - Avg,train:accuracy - StdDev,train:accuracy - Last,train:accuracy - Count
0,cifar-training-job-1590081614-aws-training-job,Training,arn:aws:sagemaker:us-east-1:079329190341:train...,"""adam""",0.3625,0.3625,0.3625,0.0,0.3625,1,0.0703,0.2979,0.219347,0.054378,0.2974,59
1,cifar-training-job-1590081619-aws-training-job,Training,arn:aws:sagemaker:us-east-1:079329190341:train...,"""rmsprop""",0.3247,0.3247,0.3247,0.0,0.3247,1,0.1159,0.3041,0.221886,0.054574,0.3033,59
2,cifar-training-job-1590080826-aws-training-job,Training,arn:aws:sagemaker:us-east-1:079329190341:train...,"""adam""",0.3226,0.3226,0.3226,0.0,0.3226,1,0.0938,0.287,0.215607,0.048394,0.2864,59
3,cifar-training-job-1590081253-aws-training-job,Training,arn:aws:sagemaker:us-east-1:079329190341:train...,"""adam""",0.2943,0.2943,0.2943,0.0,0.2943,1,0.0885,0.2914,0.215005,0.054103,0.2906,59
4,cifar-training-job-1590081203-aws-training-job,Training,arn:aws:sagemaker:us-east-1:079329190341:train...,"""sgd""",0.2417,0.2417,0.2417,0.0,0.2417,1,0.1041,0.1795,0.138878,0.023298,0.1778,59
5,cifar-training-job-1590081306-aws-training-job,Training,arn:aws:sagemaker:us-east-1:079329190341:train...,"""adam""",0.0,0.0,0.0,0.0,0.2365,0,0.0977,0.2533,0.193624,0.042442,0.2506,45
6,cifar-training-job-1590081616-aws-training-job,Training,arn:aws:sagemaker:us-east-1:079329190341:train...,"""sgd""",0.0,0.0,0.0,0.0,0.2391,0,0.0938,0.1625,0.128262,0.019144,0.1612,45
7,cifar-training-job-1590081308-aws-training-job,Training,arn:aws:sagemaker:us-east-1:079329190341:train...,"""sgd""",0.0,0.0,0.0,0.0,0.252,0,0.1029,0.1577,0.127723,0.01696,0.156,47


Next let's look at an example of tracing the lineage of a model by accessing the data tracked by SageMaker Experiments for a `cifar-training-job` trial

In [42]:
lineage_table = ExperimentAnalytics(
    sagemaker_session=sm_session, 
    search_expression={
        "Filters":[{
            "Name": "Parents.TrialName",
            "Operator": "Equals",
            "Value": opt_method_trial_name_map['adam']
        }]
    },
    sort_by="CreationTime",
    sort_order="Ascending",
)
lineage_table.dataframe()

Unnamed: 0,TrialComponentName,DisplayName,datatype,image_size,SourceArn,SageMaker.ImageUri,SageMaker.InstanceCount,SageMaker.InstanceType,SageMaker.VolumeSizeInGB,batch-size,...,train:accuracy - Avg,train:accuracy - StdDev,train:accuracy - Last,train:accuracy - Count,train:loss - Min,train:loss - Max,train:loss - Avg,train:loss - StdDev,train:loss - Last,train:loss - Count
0,TrialComponent-2020-05-21-170603-avhn,Preprocessing,tfrecords,32.0,,,,,,,...,,,,,,,,,,
1,cifar-training-job-1590081614-aws-training-job,Training,,,arn:aws:sagemaker:us-east-1:079329190341:train...,520713654638.dkr.ecr.us-east-1.amazonaws.com/s...,1.0,ml.p3.2xlarge,30.0,256.0,...,0.219347,0.054378,0.2974,59.0,1.9211,4.2819,2.343629,0.462367,1.9231,59.0
