# Train a Model with SageMaker Autopilot

We will use Autopilot to predict the star rating of customer reviews. Autopilot implements a transparent approach to AutoML. 

For more details on Autopilot, have a look at this [Amazon Science Publication](https://assets.amazon.science/e8/8b/2366b1ab407990dec96e55ee5664/amazon-sagemaker-autopilot-a-white-box-automl-solution-at-scale.pdf)

<img src="img/autopilot-transparent.png" width="80%" align="left">

# Introduction

Amazon SageMaker Autopilot is a service to perform automated machine learning (AutoML) on your datasets.  Autopilot is available through the UI or AWS SDK.  In this notebook, we will use the AWS SDK to create and deploy a text processing and star rating classification machine learning pipeline.

# Setup

Let's start by specifying:

* The S3 bucket and prefix to use to train our model.  _Note:  This should be in the same region as this notebook._
* The IAM role of this notebook needs access to your data.

# Notes
* This notebook will take some time to finish. 

* You can start this notebook and continue to the next notebooks whenever you are waiting for the current notebook to finish.

# Checking Pre-Requisites From The Previous `01_Prepare_Dataset_Autopilot` Notebook

In [1]:
%store -r autopilot_train_s3_uri

In [2]:
try:
    autopilot_train_s3_uri
    print('[OK]')
except NameError:
    print('+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++')
    print('[ERROR] PLEASE RUN THE PREVIOUS 01_PREPARE_DATASET_AUTOPILOT NOTEBOOK.')
    print('+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++')

[OK]


In [3]:
print(autopilot_train_s3_uri)

s3://sagemaker-us-west-2-085964654406/data/amazon_reviews_us_Digital_Software_v1_00_autopilot.csv


In [4]:
if not autopilot_train_s3_uri:
    print('+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++')
    print('[ERROR] PLEASE RUN THE PREVIOUS 01_PREPARE_DATASET_AUTOPILOT NOTEBOOK.')
    print('+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++')
else:
    print('[OK]')

[OK]


In [5]:
import boto3
import sagemaker
import pandas as pd
import json

sess   = sagemaker.Session()
bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
region = boto3.Session().region_name

sm = boto3.Session().client(service_name='sagemaker', region_name=region)

# Training Data

In [6]:
print(autopilot_train_s3_uri)

s3://sagemaker-us-west-2-085964654406/data/amazon_reviews_us_Digital_Software_v1_00_autopilot.csv


In [7]:
!aws s3 ls $autopilot_train_s3_uri

2020-09-26 15:56:31   13636489 amazon_reviews_us_Digital_Software_v1_00_autopilot.csv


## See our prepared training data which we use as input for Autopilot

In [8]:
!aws s3 cp $autopilot_train_s3_uri ./tmp/

download: s3://sagemaker-us-west-2-085964654406/data/amazon_reviews_us_Digital_Software_v1_00_autopilot.csv to tmp/amazon_reviews_us_Digital_Software_v1_00_autopilot.csv


In [9]:
import csv

df = pd.read_csv('./tmp/amazon_reviews_us_Digital_Software_v1_00_autopilot.csv')
df.head()

Unnamed: 0,star_rating,review_body
0,5,We are a new church and we were looking for so...
1,5,I've been using TurboTax for several years. I...
2,2,"Worked as expected, however, I have only given..."
3,4,I have been using Quicken for a long time. I ...
4,2,its nothin but a few days sample


# Setup the S3 Location for the Autopilot-Generated Assets 
This include Jupyter Notebooks (Analysis), Python Scripts (Feature Engineering), and Trained Models.

In [10]:
prefix_model_output = 'models/autopilot'

model_output_s3_uri = 's3://{}/{}'.format(bucket, prefix_model_output)

print(model_output_s3_uri)

s3://sagemaker-us-west-2-085964654406/models/autopilot


In [11]:
max_candidates = 3

job_config = {
    'CompletionCriteria': {
      'MaxRuntimePerTrainingJobInSeconds': 600,
      'MaxCandidates': max_candidates,
      'MaxAutoMLJobRuntimeInSeconds': 3600
    },
}

input_data_config = [{
      'DataSource': {
        'S3DataSource': {
          'S3DataType': 'S3Prefix',
          'S3Uri': '{}'.format(autopilot_train_s3_uri)
        }
      },
      'TargetAttributeName': 'star_rating'
    }
]

output_data_config = {
    'S3OutputPath': '{}'.format(model_output_s3_uri)
}

# Check For Existing Autopilot Jobs

In [12]:
existing_jobs_response = sm.list_auto_ml_jobs()

In [13]:
existing_jobs_response

{'AutoMLJobSummaries': [],
 'ResponseMetadata': {'RequestId': 'cf06984f-b452-435d-91f0-4eeec50a37e8',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'cf06984f-b452-435d-91f0-4eeec50a37e8',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '25',
   'date': 'Sat, 26 Sep 2020 16:00:25 GMT'},
  'RetryAttempts': 0}}

In [14]:
num_existing_jobs = 0 
running_jobs = 0

if 'AutoMLJobSummaries' in existing_jobs_response.keys():
    job_list = existing_jobs_response['AutoMLJobSummaries']
    num_existing_jobs = len(job_list)
    print('[INFO] You already created {} Autopilot job(s) in this account.'.format(num_existing_jobs))
    for j in job_list:
        if 'AutoMLJobStatus' in j.keys():                
            if j['AutoMLJobStatus'] == 'InProgress':
                running_jobs = running_jobs + 1
    print('[INFO] There are currently {} Autopilot job(s) actively running.'.format(running_jobs))
else:
    print('[OK] Please continue.')

[INFO] You already created 0 Autopilot job(s) in this account.
[INFO] There are currently 0 Autopilot job(s) actively running.


# Launch the SageMaker Autopilot Job

## _Note: Please Only Run This Once._

In [15]:
from time import gmtime, strftime, sleep

In [16]:
%store -r auto_ml_job_name

try:
    auto_ml_job_name
except NameError:    
    timestamp_suffix = strftime('%d-%H-%M-%S', gmtime())
    auto_ml_job_name = 'automl-dm-' + timestamp_suffix
    print('Created AutoMLJobName: ' + auto_ml_job_name)

no stored variable or alias auto_ml_job_name
Created AutoMLJobName: automl-dm-26-16-00-25


In [17]:
print(auto_ml_job_name)

automl-dm-26-16-00-25


In [18]:
%store auto_ml_job_name

Stored 'auto_ml_job_name' (str)


In [19]:
print('Currently Running Jobs (Should be 0): {}'.format(running_jobs))

Currently Running Jobs (Should be 0): 0


In [20]:
max_running_jobs = 1

if running_jobs < max_running_jobs: # Limiting to max. 1 Jobs
    try:
        sm.create_auto_ml_job(AutoMLJobName=auto_ml_job_name,
                              InputDataConfig=input_data_config,
                              OutputDataConfig=output_data_config,
                              AutoMLJobConfig=job_config,
                              RoleArn=role)
        print('[OK] Autopilot Job {} created.'.format(auto_ml_job_name))
        running_jobs = running_jobs + 1
    except:
        print('[WARN] You have already launched an Autopilot job.  Please continue see the output of this job.'.format(running_jobs))
else:
    print('[WARN] You have already launched {} Autopilot running job(s).  Please continue see the output of the running job.'.format(running_jobs))

[OK] Autopilot Job automl-dm-26-16-00-25 created.


# Track the Progress of the Autopilot Job

SageMaker Autopilot job consists of the following high-level steps: 
* _Data Analysis_ where the data is summarized and analyzed to determine which feature engineering techniques, hyper-parameters, and models to explore.
* _Feature Engineering_ where the data is scrubbed, balanced, combined, and split into train and validation.
* _Model Training and Tuning_ where the top performing features, hyper-parameters, and models are selected and trained.

<img src="img/autopilot-steps.png" width="90%" align="left">

**Autopilot Research Paper: https://assets.amazon.science/e8/8b/2366b1ab407990dec96e55ee5664/amazon-sagemaker-autopilot-a-white-box-automl-solution-at-scale.pdf**

# Analyzing Data and Generate Notebooks

In [21]:
job_description_response = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)

while 'AutoMLJobStatus' not in job_description_response.keys() and 'AutoMLJobSecondaryStatus' not in job_description_response.keys():
    job_description_response = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
    print('[INFO] Autopilot Job has not yet started. Please wait. ')
    print(json.dumps(job_description_response, indent=4, sort_keys=True, default=str))
    print('[INFO] Waiting for Autopilot Job to start...')
    sleep(15)

print('[OK] AutoMLJob started.')

[OK] AutoMLJob started.


# Review the SageMaker `Processing Jobs`
* First Processing Job (Data Splitter) checks the data sanity, performs stratified shuffling and splits the data into training and validation. 
* Second Processing Job (Candidate Generator) first streams through the data to compute statistics for the dataset. Then, uses these statistics to identify the problem type, and possible types of every column-predictor: numeric, categorical, natural language, etc.

In [22]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a target="blank" href="https://console.aws.amazon.com/sagemaker/home?region={}#/processing-jobs/">Processing Jobs</a></b>'.format(region)))


# The Next Cell Will Show `InProgress` For A Few Minutes.

## _Please be patient._

In [23]:
%%time

job_status = job_description_response['AutoMLJobStatus']
job_sec_status = job_description_response['AutoMLJobSecondaryStatus']

if job_status not in ('Stopped', 'Failed'):
    while job_status in ('InProgress') and job_sec_status in ('Starting', 'AnalyzingData'):
        job_description_response = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
        job_status = job_description_response['AutoMLJobStatus']
        job_sec_status = job_description_response['AutoMLJobSecondaryStatus']
        print(job_status, job_sec_status)
        sleep(15)
    print('[OK] Data analysis phase completed.\n')
    
print(json.dumps(job_description_response, indent=4, sort_keys=True, default=str))

InProgress Starting
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress FeatureEngineering
[OK] Data analysis phase completed.

{
    "AutoMLJobArn": "arn:aws:sagemaker:us-west-2:085964654406:automl-job/automl-dm-26-16-00-25",
    "AutoMLJobArtifacts": {
        "CandidateDefinitionNotebook

# View Generated Notebook Samples
Once data analysis is complete, SageMaker AutoPilot generates two notebooks: 
* Data Exploration
* Candidate Definition

# Waiting For Generated Notebooks

In [24]:
job_description_response = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)

while 'AutoMLJobArtifacts' not in job_description_response.keys():
    job_description_response = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
    print('[INFO] Autopilot Job has not yet generated the artifacts. Please wait. ')
    print(json.dumps(job_description_response, indent=4, sort_keys=True, default=str))
    print('[INFO] Waiting for AutoMLJobArtifacts...')
    sleep(15)

print('[OK] AutoMLJobArtifacts generated.')

[OK] AutoMLJobArtifacts generated.


In [25]:
job_description_response = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)

while 'DataExplorationNotebookLocation' not in job_description_response['AutoMLJobArtifacts'].keys():
    job_description_response = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
    print('[INFO] Autopilot Job has not yet generated the notebooks. Please wait. ')
    print(json.dumps(job_description_response, indent=4, sort_keys=True, default=str))
    print('[INFO] Waiting for DataExplorationNotebookLocation...')
    sleep(15)

print('[OK] DataExplorationNotebookLocation found.')   

[OK] DataExplorationNotebookLocation found.


In [26]:
generated_resources = job_description_response['AutoMLJobArtifacts']['DataExplorationNotebookLocation'].rstrip('notebooks/SageMakerAutopilotDataExplorationNotebook.ipynb')

pr_job_id = generated_resources.rsplit('/', 1)[-1]

In [27]:
from IPython.core.display import display, HTML

if not pr_job_id: 
    print('No AutoMLJobArtifacts found.')
else: 
    display(HTML('<b>Review <a target="blank" href="https://s3.console.aws.amazon.com/s3/buckets/{}/{}/{}/sagemaker-automl-candidates/{}/">S3 Generated Resources</a></b>'.format(bucket, prefix_model_output, auto_ml_job_name, pr_job_id)))

# In the Jupyter File Browser, Open the Following Folders to See Samples of the Generated Assets:
```
notebooks/
generated_module/
```

Lots of useful information ^^ in these folders ^^

#### _(Optional) You can download the actual files generated for your specific Autopilot run using the following:_
```
generated_resources = job['AutoMLJobArtifacts']['DataExplorationNotebookLocation'].rstrip('notebooks/SageMakerAutopilotDataExplorationNotebook.ipynb')

!aws s3 cp --recursive $generated_resources .
```

# Feature Engineering

### Watch out for SageMaker `Training Jobs` and `Batch Transform Jobs` to start. 

* This is the candidate exploration phase. 
* Each python script code for data-processing is executed inside a SageMaker framework container as a training job, followed by transform job.

Note, that feature preprocessing part of each pipeline has all hyper parameters fixed, i.e. does not require tuning, thus feature preprocessing step can be done prior runing the hyper parameter optimization job. 

It outputs up to 10 variants of transformed data, therefore algorithms for each pipeline are set to use
the respective transformed data.

<img src="img/autopilot-steps.png" width="90%" align="left">

**Autopilot Research Paper: https://assets.amazon.science/e8/8b/2366b1ab407990dec96e55ee5664/amazon-sagemaker-autopilot-a-white-box-automl-solution-at-scale.pdf**

In [28]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a target="blank" href="https://console.aws.amazon.com/sagemaker/home?region={}#/jobs/">Training Jobs</a></b>'.format(region)))


In [29]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a target="blank" href="https://console.aws.amazon.com/sagemaker/home?region={}#/transform-jobs/">Batch Transform Jobs</a></b>'.format(region)))


# The Next Cell Will Show `InProgress` For A Few Minutes.

## _Please be patient._ ##

In [30]:
%%time

job_description_response = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
job_status = job_description_response['AutoMLJobStatus']
job_sec_status = job_description_response['AutoMLJobSecondaryStatus']
print(job_status)
print(job_sec_status)
if job_status not in ('Stopped', 'Failed'):
    while job_status in ('InProgress') and job_sec_status in ('FeatureEngineering'):
        job_description_response = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
        job_status = job_description_response['AutoMLJobStatus']
        job_sec_status = job_description_response['AutoMLJobSecondaryStatus']
        print(job_status, job_sec_status)
        sleep(15)
    print('[OK] Feature engineering phase completed.\n')
    
print(json.dumps(job_description_response, indent=4, sort_keys=True, default=str))

InProgress
FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress

# [INFO] _Feel free to continue to the next workshop section while this notebook is running._

# Model Training and Tuning

### Watch out for a SageMaker`Hyperparameter Tuning Job` and various `Training Jobs` to start. 

* All algorithms are optimized using a SageMaker Hyperparameter Tuning job. 
* Up to 250 training jobs (based on number of candidates specified) are selectively executed to find the best candidate model.

<img src="img/autopilot-steps.png" width="90%" align="left">

**Autopilot Research Paper: https://assets.amazon.science/e8/8b/2366b1ab407990dec96e55ee5664/amazon-sagemaker-autopilot-a-white-box-automl-solution-at-scale.pdf**

In [31]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a target="blank" href="https://console.aws.amazon.com/sagemaker/home?region={}#/hyper-tuning-jobs/">Hyperparameter Tuning Jobs</a></b>'.format(region)))


In [32]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a target="blank" href="https://console.aws.amazon.com/sagemaker/home?region={}#/jobs/">Training Jobs</a></b>'.format(region)))


# The Next Cell Will Show `InProgress` For A Few Minutes.

## _Please be patient._

In [33]:
%%time

job_description_response = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
job_status = job_description_response['AutoMLJobStatus']
job_sec_status = job_description_response['AutoMLJobSecondaryStatus']
print(job_status)
print(job_sec_status)
if job_status not in ('Stopped', 'Failed'):
    while job_status in ('InProgress') and job_sec_status in ('ModelTuning'):
        job_description_response = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
        job_status = job_description_response['AutoMLJobStatus']
        job_sec_status = job_description_response['AutoMLJobSecondaryStatus']
        print(job_status, job_sec_status)
        sleep(15)
    print('[OK] Model tuning phase completed.\n')
    
print(json.dumps(job_description_response, indent=4, sort_keys=True, default=str))

InProgress
ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
Completed MaxCandidatesReached
[OK] Model tuning phase completed.

{
    "AutoMLJobArn": "arn:aws:sagemaker:us-west-2:085964654406:automl-job/automl-dm-26-16-00-25",
    "AutoMLJobArtifacts": {
        "CandidateDefinitionNotebookLocation": "s3://sagemaker-us-west-2-085964654406/models/autopilot/automl-dm-26-16-00-25/sagemaker-automl-candidates/pr-1-59ccd780cff2465fb4222cdff6e916091d72abf795724dd7990e026f5c/notebooks/SageMakerAutopilotCandidateDefinitionNotebook.ipynb",
        "DataExplorationNotebookLocation": "s3://sagemaker-us-west-2-085964654406/models/autopilot/automl-dm-26-16-00-25/sagemaker-automl-candidates/pr-1-59c

# _Please Wait Until ^^ Autopilot ^^ Completes Above_

# [INFO] _Feel free to continue to the next workshop section while this notebook is running._

Make sure the status below indicates `Completed`.

In [34]:
if job_status not in ('Completed'):
    print('************************************************************')
    print('[ERROR] THIS JOB DID NOT COMPLETE PROPERLY. ****************')
    print('[ERROR] LOOK IN PREVIOUS CELLS TO FIND THE ISSUE. **********')    
    print('************************************************************')

# Viewing All Candidates
Once model tuning is complete, you can view all the candidates (pipeline evaluations with different hyperparameter combinations) that were explored by AutoML and sort them by their final performance metric.

In [35]:
candidates_response = sm.list_candidates_for_auto_ml_job(AutoMLJobName=auto_ml_job_name, 
                                                         SortBy='FinalObjectiveMetricValue')

### Check that candidates exist

In [36]:
if not candidates_response:
    print('[ERROR] THE JOB DID NOT COMPLETE PROPERLY. LOOK IN PREVIOUS CELLS TO FIND THE ISSUE.')
else:
    candidates = candidates_response['Candidates']
    print('[OK]')

[OK]


In [37]:
if not candidates:
    print('[ERROR] THE JOB DID NOT COMPLETE PROPERLY. LOOK IN PREVIOUS CELLS TO FIND THE ISSUE.')
elif 'CandidateName' not in candidates[0]:
    print('[ERROR] THE JOB DID NOT COMPLETE PROPERLY. LOOK IN PREVIOUS CELLS TO FIND THE ISSUE.')
elif 'FinalAutoMLJobObjectiveMetric' not in candidates[0]:
    print('[ERROR] THE JOB DID NOT COMPLETE PROPERLY. LOOK IN PREVIOUS CELLS TO FIND THE ISSUE.')
else:
    print('[OK]')

[OK]


In [38]:
print(json.dumps(candidates, indent=4, sort_keys=True, default=str))

[
    {
        "CandidateName": "tuning-job-1-f0b555e532ea4856a2-003-79e5e79f",
        "CandidateStatus": "Completed",
        "CandidateSteps": [
            {
                "CandidateStepArn": "arn:aws:sagemaker:us-west-2:085964654406:processing-job/db-1-667ac3ae1341459aaf9a5820cfa42fd329118a231f484f40aab37c19a8",
                "CandidateStepName": "db-1-667ac3ae1341459aaf9a5820cfa42fd329118a231f484f40aab37c19a8",
                "CandidateStepType": "AWS::SageMaker::ProcessingJob"
            },
            {
                "CandidateStepArn": "arn:aws:sagemaker:us-west-2:085964654406:training-job/automl-dm--dpp2-1-d4672888ec294b6e81bde5bdd6915697815849129e314",
                "CandidateStepName": "automl-dm--dpp2-1-d4672888ec294b6e81bde5bdd6915697815849129e314",
                "CandidateStepType": "AWS::SageMaker::TrainingJob"
            },
            {
                "CandidateStepArn": "arn:aws:sagemaker:us-west-2:085964654406:transform-job/automl-dm--dpp2-rpb-1-26107

In [39]:
for index, candidate in enumerate(candidates):
    print(str(index) + "  " 
        + candidate['CandidateName'] + "  " 
        + str(candidate['FinalAutoMLJobObjectiveMetric']['Value']))

0  tuning-job-1-f0b555e532ea4856a2-003-79e5e79f  0.38075000047683716
1  tuning-job-1-f0b555e532ea4856a2-002-b9cff5d3  0.3007600009441376
2  tuning-job-1-f0b555e532ea4856a2-001-692a1be2  0.2910799980163574


# Inspect Trials using Experiments API

SageMaker Autopilot automatically creates a new experiment, and pushes information for each trial. 

In [40]:
from sagemaker.analytics import ExperimentAnalytics, TrainingJobAnalytics

exp = ExperimentAnalytics(
    sagemaker_session=sess, 
    experiment_name=auto_ml_job_name + '-aws-auto-ml-job',
)

df = exp.dataframe()
print(df)

                                  TrialComponentName  \
0  tuning-job-1-f0b555e532ea4856a2-002-b9cff5d3-a...   
1  tuning-job-1-f0b555e532ea4856a2-003-79e5e79f-a...   
2  tuning-job-1-f0b555e532ea4856a2-001-692a1be2-a...   
3  automl-dm--dpp1-csv-1-c175c35a63a8483a96750d7f...   
4  automl-dm--dpp2-rpb-1-2610746b182d45b0babf3609...   
5  automl-dm--dpp1-1-f84e699827854961b80e6c63fd5f...   
6  automl-dm--dpp2-1-d4672888ec294b6e81bde5bdd691...   
7  db-1-667ac3ae1341459aaf9a5820cfa42fd329118a231...   

                                         DisplayName  \
0  tuning-job-1-f0b555e532ea4856a2-002-b9cff5d3-a...   
1  tuning-job-1-f0b555e532ea4856a2-003-79e5e79f-a...   
2  tuning-job-1-f0b555e532ea4856a2-001-692a1be2-a...   
3  automl-dm--dpp1-csv-1-c175c35a63a8483a96750d7f...   
4  automl-dm--dpp2-rpb-1-2610746b182d45b0babf3609...   
5  automl-dm--dpp1-1-f84e699827854961b80e6c63fd5f...   
6  automl-dm--dpp2-1-d4672888ec294b6e81bde5bdd691...   
7  db-1-667ac3ae1341459aaf9a5820cfa42fd329118a2

# Explore the Best Candidate
Now that we have successfully completed the AutoML job on our dataset and visualized the trials, we can create a model from any of the trials with a single API call and then deploy that model for online or batch prediction using [Inference Pipelines](https://docs.aws.amazon.com/sagemaker/latest/dg/inference-pipelines.html). For this notebook, we deploy only the best performing trial for inference.

The best candidate is the one we're really interested in.

In [41]:
best_candidate_response = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)

In [42]:
if not best_candidate_response:
    print('[ERROR] THE JOB DID NOT COMPLETE PROPERLY. LOOK IN PREVIOUS CELLS TO FIND THE ISSUE.')
else:
    best_candidate = best_candidate_response['BestCandidate']
    print('[OK]')

[OK]


In [43]:
print(json.dumps(best_candidate_response, indent=4, sort_keys=True, default=str))

{
    "AutoMLJobArn": "arn:aws:sagemaker:us-west-2:085964654406:automl-job/automl-dm-26-16-00-25",
    "AutoMLJobArtifacts": {
        "CandidateDefinitionNotebookLocation": "s3://sagemaker-us-west-2-085964654406/models/autopilot/automl-dm-26-16-00-25/sagemaker-automl-candidates/pr-1-59ccd780cff2465fb4222cdff6e916091d72abf795724dd7990e026f5c/notebooks/SageMakerAutopilotCandidateDefinitionNotebook.ipynb",
        "DataExplorationNotebookLocation": "s3://sagemaker-us-west-2-085964654406/models/autopilot/automl-dm-26-16-00-25/sagemaker-automl-candidates/pr-1-59ccd780cff2465fb4222cdff6e916091d72abf795724dd7990e026f5c/notebooks/SageMakerAutopilotDataExplorationNotebook.ipynb"
    },
    "AutoMLJobConfig": {
        "CompletionCriteria": {
            "MaxAutoMLJobRuntimeInSeconds": 3600,
            "MaxCandidates": 3,
            "MaxRuntimePerTrainingJobInSeconds": 600
        }
    },
    "AutoMLJobName": "automl-dm-26-16-00-25",
    "AutoMLJobSecondaryStatus": "MaxCandidatesReached",
  

In [44]:
if not best_candidate:
    print('[ERROR] THE JOB DID NOT COMPLETE PROPERLY. LOOK IN PREVIOUS CELLS TO FIND THE ISSUE.')
elif 'CandidateName' not in best_candidate:
    print('[ERROR] THE JOB DID NOT COMPLETE PROPERLY. LOOK IN PREVIOUS CELLS TO FIND THE ISSUE.')
elif 'FinalAutoMLJobObjectiveMetric' not in best_candidate:
    print('[ERROR] THE JOB DID NOT COMPLETE PROPERLY. LOOK IN PREVIOUS CELLS TO FIND THE ISSUE.')
else:
    best_candidate_identifier = best_candidate['CandidateName']
    print("Candidate name: " + best_candidate_identifier)
    print("Metric name: " + best_candidate['FinalAutoMLJobObjectiveMetric']['MetricName'])
    print("Metric value: " + str(best_candidate['FinalAutoMLJobObjectiveMetric']['Value']))


Candidate name: tuning-job-1-f0b555e532ea4856a2-003-79e5e79f
Metric name: validation:accuracy
Metric value: 0.38075000047683716


In [45]:
print(json.dumps(best_candidate, indent=4, sort_keys=True, default=str))

{
    "CandidateName": "tuning-job-1-f0b555e532ea4856a2-003-79e5e79f",
    "CandidateStatus": "Completed",
    "CandidateSteps": [
        {
            "CandidateStepArn": "arn:aws:sagemaker:us-west-2:085964654406:processing-job/db-1-667ac3ae1341459aaf9a5820cfa42fd329118a231f484f40aab37c19a8",
            "CandidateStepName": "db-1-667ac3ae1341459aaf9a5820cfa42fd329118a231f484f40aab37c19a8",
            "CandidateStepType": "AWS::SageMaker::ProcessingJob"
        },
        {
            "CandidateStepArn": "arn:aws:sagemaker:us-west-2:085964654406:training-job/automl-dm--dpp2-1-d4672888ec294b6e81bde5bdd6915697815849129e314",
            "CandidateStepName": "automl-dm--dpp2-1-d4672888ec294b6e81bde5bdd6915697815849129e314",
            "CandidateStepType": "AWS::SageMaker::TrainingJob"
        },
        {
            "CandidateStepArn": "arn:aws:sagemaker:us-west-2:085964654406:transform-job/automl-dm--dpp2-rpb-1-2610746b182d45b0babf3609292d01f76dd731cb4",
            "CandidateStepN

# View Individual Autopilot Jobs

In [46]:
steps = []
if not best_candidate:
    print('[ERROR] THE JOB DID NOT COMPLETE PROPERLY. LOOK IN PREVIOUS CELLS TO FIND THE ISSUE.')
elif 'InferenceContainers' not in best_candidate:
    print('[ERROR] THE JOB DID NOT COMPLETE PROPERLY. LOOK IN PREVIOUS CELLS TO FIND THE ISSUE.')
else:
    for step in best_candidate['CandidateSteps']:
        print('Candidate Step Type: {}'.format(step['CandidateStepType']))
        print('Candidate Step Name: {}'.format(step['CandidateStepName']))
        steps.append(step['CandidateStepName'])

Candidate Step Type: AWS::SageMaker::ProcessingJob
Candidate Step Name: db-1-667ac3ae1341459aaf9a5820cfa42fd329118a231f484f40aab37c19a8
Candidate Step Type: AWS::SageMaker::TrainingJob
Candidate Step Name: automl-dm--dpp2-1-d4672888ec294b6e81bde5bdd6915697815849129e314
Candidate Step Type: AWS::SageMaker::TransformJob
Candidate Step Name: automl-dm--dpp2-rpb-1-2610746b182d45b0babf3609292d01f76dd731cb4
Candidate Step Type: AWS::SageMaker::TrainingJob
Candidate Step Name: tuning-job-1-f0b555e532ea4856a2-003-79e5e79f


In [47]:
from IPython.core.display import display, HTML

display(HTML('<b>Review Best Candidate <a target="blank" href="https://console.aws.amazon.com/sagemaker/home?region={}#/processing-jobs/{}">Processing Job</a></b>'.format(region, steps[0])))

In [48]:
from IPython.core.display import display, HTML

display(HTML('<b>Review Best Candidate <a target="blank" href="https://console.aws.amazon.com/sagemaker/home?region={}#/jobs/{}">Training Job</a></b>'.format(region, steps[1])))

In [49]:
from IPython.core.display import display, HTML

display(HTML('<b>Review Best Candidate <a target="blank" href="https://console.aws.amazon.com/sagemaker/home?region={}#/transform-jobs/{}">Transform Job</a></b>'.format(region, steps[2])))

In [50]:
from IPython.core.display import display, HTML

display(HTML('<b>Review Best Candidate <a target="blank" href="https://console.aws.amazon.com/sagemaker/home?region={}#/jobs/{}">Training Job (Tuning)</a></b>'.format(region, steps[3])))

# See the containers and models composing the Inference Pipeline

In [51]:
best_candidate_containers = best_candidate['InferenceContainers']

In [52]:
for container in best_candidate_containers:
        print(container['Image'])
        print(container['ModelDataUrl'])
        print('======================')

246618743249.dkr.ecr.us-west-2.amazonaws.com/sagemaker-sklearn-automl:0.2-1-cpu-py3
s3://sagemaker-us-west-2-085964654406/models/autopilot/automl-dm-26-16-00-25/data-processor-models/automl-dm--dpp2-1-d4672888ec294b6e81bde5bdd6915697815849129e314/output/model.tar.gz
246618743249.dkr.ecr.us-west-2.amazonaws.com/sagemaker-xgboost:1.0-1-cpu-py3
s3://sagemaker-us-west-2-085964654406/models/autopilot/automl-dm-26-16-00-25/tuning/automl-dm--dpp2-xgb/tuning-job-1-f0b555e532ea4856a2-003-79e5e79f/output/model.tar.gz
246618743249.dkr.ecr.us-west-2.amazonaws.com/sagemaker-sklearn-automl:0.2-1-cpu-py3
s3://sagemaker-us-west-2-085964654406/models/autopilot/automl-dm-26-16-00-25/data-processor-models/automl-dm--dpp2-1-d4672888ec294b6e81bde5bdd6915697815849129e314/output/model.tar.gz


# Update Containers To Show Predicted Label and Confidence Score

In [53]:
for container in best_candidate_containers:
        print(container['Environment'])
        print('======================')

{'AUTOML_SPARSE_ENCODE_RECORDIO_PROTOBUF': '1', 'AUTOML_TRANSFORM_MODE': 'feature-transform', 'SAGEMAKER_DEFAULT_INVOCATIONS_ACCEPT': 'application/x-recordio-protobuf', 'SAGEMAKER_PROGRAM': 'sagemaker_serve', 'SAGEMAKER_SUBMIT_DIRECTORY': '/opt/ml/model/sagemaker_serve.py'}
{'MAX_CONTENT_LENGTH': '20971520', 'SAGEMAKER_DEFAULT_INVOCATIONS_ACCEPT': 'text/csv', 'SAGEMAKER_INFERENCE_OUTPUT': 'predicted_label', 'SAGEMAKER_INFERENCE_SUPPORTED': 'predicted_label,probability,probabilities'}
{'AUTOML_TRANSFORM_MODE': 'inverse-label-transform', 'SAGEMAKER_DEFAULT_INVOCATIONS_ACCEPT': 'text/csv', 'SAGEMAKER_INFERENCE_INPUT': 'predicted_label', 'SAGEMAKER_INFERENCE_OUTPUT': 'predicted_label', 'SAGEMAKER_INFERENCE_SUPPORTED': 'predicted_label,probability,labels,probabilities', 'SAGEMAKER_PROGRAM': 'sagemaker_serve', 'SAGEMAKER_SUBMIT_DIRECTORY': '/opt/ml/model/sagemaker_serve.py'}


In [54]:
best_candidate_containers[1]['Environment'].update({'SAGEMAKER_INFERENCE_OUTPUT': 'predicted_label, probability'})
best_candidate_containers[2]['Environment'].update({'SAGEMAKER_INFERENCE_INPUT': 'predicted_label, probability'})
best_candidate_containers[2]['Environment'].update({'SAGEMAKER_INFERENCE_OUTPUT': 'predicted_label, probability'})

In [55]:
for container in best_candidate_containers:
        print(container['Environment'])
        print('======================')

{'AUTOML_SPARSE_ENCODE_RECORDIO_PROTOBUF': '1', 'AUTOML_TRANSFORM_MODE': 'feature-transform', 'SAGEMAKER_DEFAULT_INVOCATIONS_ACCEPT': 'application/x-recordio-protobuf', 'SAGEMAKER_PROGRAM': 'sagemaker_serve', 'SAGEMAKER_SUBMIT_DIRECTORY': '/opt/ml/model/sagemaker_serve.py'}
{'MAX_CONTENT_LENGTH': '20971520', 'SAGEMAKER_DEFAULT_INVOCATIONS_ACCEPT': 'text/csv', 'SAGEMAKER_INFERENCE_OUTPUT': 'predicted_label, probability', 'SAGEMAKER_INFERENCE_SUPPORTED': 'predicted_label,probability,probabilities'}
{'AUTOML_TRANSFORM_MODE': 'inverse-label-transform', 'SAGEMAKER_DEFAULT_INVOCATIONS_ACCEPT': 'text/csv', 'SAGEMAKER_INFERENCE_INPUT': 'predicted_label, probability', 'SAGEMAKER_INFERENCE_OUTPUT': 'predicted_label, probability', 'SAGEMAKER_INFERENCE_SUPPORTED': 'predicted_label,probability,labels,probabilities', 'SAGEMAKER_PROGRAM': 'sagemaker_serve', 'SAGEMAKER_SUBMIT_DIRECTORY': '/opt/ml/model/sagemaker_serve.py'}


# Autopilot Chooses XGBoost as Best Candidate!

Note that Autopilot chose different hyper-parameters and feature transformations than we used in our own XGBoost model.

# Deploy the Model as a REST Endpoint
Batch transformations are also supported, but for now, we will use a REST Endpoint.

In [56]:
print(best_candidate['InferenceContainers'])

[{'Image': '246618743249.dkr.ecr.us-west-2.amazonaws.com/sagemaker-sklearn-automl:0.2-1-cpu-py3', 'ModelDataUrl': 's3://sagemaker-us-west-2-085964654406/models/autopilot/automl-dm-26-16-00-25/data-processor-models/automl-dm--dpp2-1-d4672888ec294b6e81bde5bdd6915697815849129e314/output/model.tar.gz', 'Environment': {'AUTOML_SPARSE_ENCODE_RECORDIO_PROTOBUF': '1', 'AUTOML_TRANSFORM_MODE': 'feature-transform', 'SAGEMAKER_DEFAULT_INVOCATIONS_ACCEPT': 'application/x-recordio-protobuf', 'SAGEMAKER_PROGRAM': 'sagemaker_serve', 'SAGEMAKER_SUBMIT_DIRECTORY': '/opt/ml/model/sagemaker_serve.py'}}, {'Image': '246618743249.dkr.ecr.us-west-2.amazonaws.com/sagemaker-xgboost:1.0-1-cpu-py3', 'ModelDataUrl': 's3://sagemaker-us-west-2-085964654406/models/autopilot/automl-dm-26-16-00-25/tuning/automl-dm--dpp2-xgb/tuning-job-1-f0b555e532ea4856a2-003-79e5e79f/output/model.tar.gz', 'Environment': {'MAX_CONTENT_LENGTH': '20971520', 'SAGEMAKER_DEFAULT_INVOCATIONS_ACCEPT': 'text/csv', 'SAGEMAKER_INFERENCE_OUTPUT'

In [57]:
model_name = 'automl-dm-model-' + timestamp_suffix

model_arn = sm.create_model(Containers=best_candidate['InferenceContainers'],
                            ModelName=model_name,
                            ExecutionRoleArn=role)

print('Best candidate model ARN: ', model_arn['ModelArn'])

Best candidate model ARN:  arn:aws:sagemaker:us-west-2:085964654406:model/automl-dm-model-26-16-00-25


# Define EndpointConfig Name

In [58]:
timestamp_suffix = strftime('%d-%H-%M-%S', gmtime())
epc_name = 'automl-dm-epc-' + timestamp_suffix

print(epc_name)

automl-dm-epc-26-16-21-49


# Define REST Endpoint Name for Autopilot Model

In [59]:
autopilot_endpoint_name = 'automl-dm-ep-' + timestamp_suffix
variant_name = 'automl-dm-variant-' + timestamp_suffix

print(autopilot_endpoint_name)
print(variant_name)

automl-dm-ep-26-16-21-49
automl-dm-variant-26-16-21-49


In [60]:
ep_config = sm.create_endpoint_config(EndpointConfigName = epc_name,
                                      ProductionVariants=[{'InstanceType':'ml.m5.large',
                                                           'InitialInstanceCount': 1,
                                                           'ModelName': model_name,
                                                           'VariantName': variant_name}])

In [61]:
create_endpoint_response = sm.create_endpoint(EndpointName=autopilot_endpoint_name,
                                              EndpointConfigName=epc_name)
print(create_endpoint_response['EndpointArn'])

arn:aws:sagemaker:us-west-2:085964654406:endpoint/automl-dm-ep-26-16-21-49


In [62]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a target="blank" href="https://console.aws.amazon.com/sagemaker/home?region={}#/endpoints/{}">SageMaker REST Endpoint</a></b>'.format(region, autopilot_endpoint_name)))

# Store Variables for the Next Notebooks

In [63]:
%store autopilot_endpoint_name

Stored 'autopilot_endpoint_name' (str)


In [64]:
%store

Stored variables and their in-db values:
auto_ml_job_name                        -> 'automl-dm-26-16-00-25'
autopilot_endpoint_name                 -> 'automl-dm-ep-26-16-21-49'
autopilot_train_s3_uri                  -> 's3://sagemaker-us-west-2-085964654406/data/amazon
setup_dependencies_passed               -> True
setup_iam_roles_passed                  -> True
setup_instance_check_passed             -> True
setup_s3_bucket_passed                  -> True


# Summary
We used Autopilot to automatically find the best model, hyper-parameters, and feature-engineering scripts for our dataset.  

Autopilot uses a transparent approach to generate re-usable exploration Jupyter Notebooks and transformation Python scripts to continue to train and deploy our model on new data - well after this initial interaction with the Autopilot service.

In [None]:
%%javascript
Jupyter.notebook.save_checkpoint();
Jupyter.notebook.session.delete();