## Introduction
Amazon SageMaker Autopilot is an automated machine learning (commonly referred to as AutoML) solution for tabular datasets. You can use SageMaker Autopilot in different ways: on autopilot (hence the name) or with human guidance, without code through SageMaker Studio, or using the AWS SDKs.

In [2]:
%%time
import sagemaker
import os
import boto3
import re
import pandas as pd
import numpy as np
from sagemaker import get_execution_role
import matplotlib.pyplot as plt

role = get_execution_role()
region = boto3.Session().region_name
session = sagemaker.Session()
bucket='sainsbury-workshop-autopilot' # put your s3 bucket name here, and create s3 bucket
prefix = 'sagemaker/DEMO-xgboost-autopilot'
sm = boto3.Session().client(service_name='sagemaker',region_name=region)

CPU times: user 781 ms, sys: 96.9 ms, total: 878 ms
Wall time: 1.47 s


In [3]:
# customize to your bucket where you have stored the data
bucket_path = 'https://s3-{}.amazonaws.com/{}'.format(region,bucket)

In [4]:
items = pd.read_csv('data/items.csv')
holidays_events = pd.read_csv('data/holidays_events.csv')
oil = pd.read_csv('data/oil.csv')
stores = pd.read_csv('data/stores.csv')
transactions = pd.read_csv('data/transactions.csv')
df_train = pd.read_csv('data/train.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [5]:

holidays_events = holidays_events[['date','type']]
holidays_events.head()


Unnamed: 0,date,type
0,2012-03-02,Holiday
1,2012-04-01,Holiday
2,2012-04-12,Holiday
3,2012-04-14,Holiday
4,2012-04-21,Holiday


In [6]:
df_train = df_train.sample(frac=0.5, replace=False, random_state=1)
df_train.shape


(62748520, 6)

In [7]:
train = pd.merge(df_train, stores, on= "store_nbr")
train = pd.merge(train, items, on= "item_nbr")
train = pd.merge(train, holidays_events, on="date")
train = pd.merge(train, oil, on ="date")

In [8]:
train.loc[(train.unit_sales<0),'unit_sales'] = 1 
rolling_mean_5 = train.groupby(['item_nbr','store_nbr'])['unit_sales'].apply(lambda x: x.shift().rolling(5, min_periods=1).mean())
rolling_mean_family_5 = train.groupby(['family','store_nbr'])['unit_sales'].apply(lambda x: x.shift().rolling(5, min_periods=1).mean())
rolling_mean_30 = train.groupby(['item_nbr','store_nbr'])['unit_sales'].apply(lambda x: x.shift().rolling(30, min_periods=1).mean())
rolling_mean_family_30 = train.groupby(['family','store_nbr'])['unit_sales'].apply(lambda x: x.shift().rolling(30, min_periods=1).mean())
train['unit_rolling_mean_5'] = rolling_mean_5
train['unit_rolling_mean_family_5'] = rolling_mean_family_5
train['unit_rolling_mean_30'] = rolling_mean_30
train['unit_rolling_mean_family_30'] = rolling_mean_family_30

In [9]:
train['month'] = pd.DatetimeIndex(train['date']).month
train['dow'] = pd.DatetimeIndex(train['date']).dayofweek

In [10]:
train['unit_log_sales'] =  train['unit_sales'].apply(pd.np.log1p) 
train['unit_log_rolling_mean_5'] =  train['unit_rolling_mean_5'].apply(pd.np.log1p) 
train['unit_log_rolling_mean_family_5'] =  train['unit_rolling_mean_family_5'].apply(pd.np.log1p) 
train['unit_log_rolling_mean_30'] =  train['unit_rolling_mean_30'].apply(pd.np.log1p) 
train['unit_log_rolling_mean_family_30'] =  train['unit_rolling_mean_family_30'].apply(pd.np.log1p) 
train.tail()

Unnamed: 0,id,date,store_nbr,item_nbr,unit_sales,onpromotion,city,state,type_x,cluster,...,unit_rolling_mean_family_5,unit_rolling_mean_30,unit_rolling_mean_family_30,month,dow,unit_log_sales,unit_log_rolling_mean_5,unit_log_rolling_mean_family_5,unit_log_rolling_mean_30,unit_log_rolling_mean_family_30
7599991,446,2013-01-01,25,911429,7.0,,Salinas,Santa Elena,D,1,...,1.4,5.233333,2.666667,1,1,2.079442,2.028148,0.875469,1.829911,1.299283
7599992,296,2013-01-01,25,638327,2.0,,Salinas,Santa Elena,D,1,...,2.6,1.333333,2.866667,1,1,1.098612,0.847298,1.280934,0.847298,1.352393
7599993,544,2013-01-01,25,1071949,6.0,,Salinas,Santa Elena,D,1,...,3.4,5.466667,8.833333,1,1,1.94591,1.526056,1.481605,1.866661,2.285778
7599994,185,2013-01-01,25,420720,2.0,,Salinas,Santa Elena,D,1,...,3.8,2.733333,4.833333,1,1,1.098612,1.163151,1.568616,1.317301,1.763589
7599995,149,2013-01-01,25,363868,2.0,,Salinas,Santa Elena,D,1,...,2.8,4.6,2.9,1,1,1.098612,1.648659,1.335001,1.722767,1.360977


In [11]:
train = train.drop(['date','id','store_nbr', 'item_nbr','unit_sales','unit_rolling_mean_5','unit_rolling_mean_family_5','unit_rolling_mean_30','unit_rolling_mean_family_30'], axis=1)
for col in ['cluster', 'class', 'perishable','month','dow']:
    train[col] = train[col].astype('category')
type(train['cluster'])
train.head()

Unnamed: 0,onpromotion,city,state,type_x,cluster,family,class,perishable,type_y,dcoilwtico,month,dow,unit_log_sales,unit_log_rolling_mean_5,unit_log_rolling_mean_family_5,unit_log_rolling_mean_30,unit_log_rolling_mean_family_30
0,False,Machala,El Oro,D,4,CLEANING,3034,0,Holiday,99.69,5,3,1.098612,,,,
1,False,Quito,Pichincha,A,5,CLEANING,3034,0,Holiday,99.69,5,3,1.098612,,,,
2,False,Cuenca,Azuay,D,2,CLEANING,3034,0,Holiday,99.69,5,3,1.098612,,,,
3,False,Latacunga,Cotopaxi,C,15,CLEANING,3034,0,Holiday,99.69,5,3,0.693147,,,,
4,False,Guayaquil,Guayas,E,10,CLEANING,3034,0,Holiday,99.69,5,3,2.70805,,,,


In [14]:
train_data, test_data = np.split(train.sample(frac=1, random_state=1729), [int(0.8 * len(train))])

In [15]:
print(train_data.head())

        onpromotion       city       state type_x cluster     family class  \
4166185       False  Guayaquil      Guayas      D       1  BEVERAGES  1114   
7331994         NaN     Ambato  Tungurahua      A      14  GROCERY I  1072   
1141056       False      Quito   Pichincha      C      12   CLEANING  3032   
2022293       False      Quito   Pichincha      D       8  GROCERY I  1040   
3978783       False  Guayaquil      Guayas      A      17   CLEANING  3044   

        perishable    type_y  dcoilwtico month dow  unit_log_sales  \
4166185          0   Holiday       47.83     5   4        1.098612   
7331994          0   Holiday       90.74     5   2        2.639057   
1141056          0     Event       43.65     5   1        1.609438   
2022293          0     Event      104.06     7   1        2.833213   
3978783          0  Transfer       49.58     5   4        1.609438   

         unit_log_rolling_mean_5  unit_log_rolling_mean_family_5  \
4166185                 1.609438          

In [17]:
train_file = 'train_data.csv';
train_data.to_csv(train_file, index=False, header=True)
test_file = 'test_data.csv';
test_data_no_target = test_data.drop(columns=['unit_log_sales'])
test_data_no_target.to_csv(test_file, index=False, header=False)

boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train/train_data.csv')).upload_file('train_data.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'test/test_data.csv')).upload_file('test_data.csv')


## Setting up the SageMaker Autopilot Job<a name="Settingup"></a>

After uploading the dataset to Amazon S3, you can invoke Autopilot to find the best ML pipeline to train a model on this dataset. 

The required inputs for invoking a Autopilot job are:
* Amazon S3 location for input dataset and for all output artifacts
* Name of the column of the dataset you want to predict (`AG` in this case) 
* An IAM role

Currently Autopilot supports only tabular datasets in CSV format. Either all files should have a header row, or the first file of the dataset, when sorted in alphabetical/lexical order, is expected to have a header row.

In [18]:
input_data_config = [{
      'DataSource': {
        'S3DataSource': {
          'S3DataType': 'S3Prefix',
          'S3Uri': 's3://{}/{}/train'.format(bucket,prefix)
        }
      },
      'TargetAttributeName': 'unit_log_sales'
    }
  ]

output_data_config = {
    'S3OutputPath': 's3://{}/{}/output'.format(bucket,prefix)
  }

In [21]:
from time import gmtime, strftime, sleep
timestamp_suffix = strftime('%d-%H-%M-%S', gmtime())

auto_ml_job_name = 'automl-kaggle-' + timestamp_suffix
print('AutoMLJobName: ' + auto_ml_job_name)

sm.create_auto_ml_job(AutoMLJobName=auto_ml_job_name,
                      InputDataConfig=input_data_config,
                      OutputDataConfig=output_data_config,
                      RoleArn=role)

AutoMLJobName: automl-kaggle-25-20-30-03


{'AutoMLJobArn': 'arn:aws:sagemaker:us-east-1:208480242416:automl-job/automl-kaggle-25-20-30-03',
 'ResponseMetadata': {'RequestId': '7dde419a-767e-4a30-bed8-cc78b5650ace',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '7dde419a-767e-4a30-bed8-cc78b5650ace',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '96',
   'date': 'Tue, 25 Feb 2020 20:30:03 GMT'},
  'RetryAttempts': 0}}

## Tracking SageMaker Autopilot job progress<a name="Tracking"></a>
SageMaker Autopilot job consists of the following high-level steps : 
* Analyzing Data, where the dataset is analyzed and Autopilot comes up with a list of ML pipelines that should be tried out on the dataset. The dataset is also split into train and validation sets.
* Feature Engineering, where Autopilot performs feature transformation on individual features of the dataset as well as at an aggregate level.
* Model Tuning, where the top performing pipeline is selected along with the optimal hyperparameters for the training algorithm (the last stage of the pipeline). 

In [24]:
print ('JobStatus - Secondary Status')
print('------------------------------')


describe_response = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
print (describe_response['AutoMLJobStatus'] + " - " + describe_response['AutoMLJobSecondaryStatus'])
job_run_status = describe_response['AutoMLJobStatus']
    
while job_run_status not in ('Failed', 'Completed', 'Stopped'):
    describe_response = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
    job_run_status = describe_response['AutoMLJobStatus']
    
    print (describe_response['AutoMLJobStatus'] + " - " + describe_response['AutoMLJobSecondaryStatus'])
    sleep(30)

JobStatus - Secondary Status
------------------------------
Completed - MaxCandidatesReached


In [25]:
best_candidate = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)['BestCandidate']
best_candidate_name = best_candidate['CandidateName']
print(best_candidate)
print('\n')
print("CandidateName: " + best_candidate_name)
print("FinalAutoMLJobObjectiveMetricName: " + best_candidate['FinalAutoMLJobObjectiveMetric']['MetricName'])
print("FinalAutoMLJobObjectiveMetricValue: " + str(best_candidate['FinalAutoMLJobObjectiveMetric']['Value']))

{'CandidateName': 'tuning-job-1-16dfc46c15f94841b4-202-f414791f', 'FinalAutoMLJobObjectiveMetric': {'MetricName': 'validation:mse', 'Value': 0.29326900839805603}, 'ObjectiveStatus': 'Succeeded', 'CandidateSteps': [{'CandidateStepType': 'AWS::SageMaker::ProcessingJob', 'CandidateStepArn': 'arn:aws:sagemaker:us-east-1:208480242416:processing-job/db-1-a11387f1ba86405ea94da87383456cf0ffa0a0ae17924decaf375e0a24', 'CandidateStepName': 'db-1-a11387f1ba86405ea94da87383456cf0ffa0a0ae17924decaf375e0a24'}, {'CandidateStepType': 'AWS::SageMaker::TrainingJob', 'CandidateStepArn': 'arn:aws:sagemaker:us-east-1:208480242416:training-job/automl-kag-dpp1-1-57da327e049e4d95ac88003f24fa897d899e2caf77a44', 'CandidateStepName': 'automl-kag-dpp1-1-57da327e049e4d95ac88003f24fa897d899e2caf77a44'}, {'CandidateStepType': 'AWS::SageMaker::TransformJob', 'CandidateStepArn': 'arn:aws:sagemaker:us-east-1:208480242416:transform-job/automl-kag-dpp1-rpb-1-429ae0423a494abc886765f1859f917ad716d099e', 'CandidateStepName':

## Perform batch inference using the best candidate

In [26]:
model_name = 'automl-kaggle-regression-model-' + timestamp_suffix

model = sm.create_model(Containers=best_candidate['InferenceContainers'],
                            ModelName=model_name,
                            ExecutionRoleArn=role)

print('Model ARN corresponding to the best candidate is : {}'.format(model['ModelArn']))

Model ARN corresponding to the best candidate is : arn:aws:sagemaker:us-east-1:208480242416:model/automl-kaggle-regression-model-25-20-30-03


You can use batch inference by using Amazon SageMaker batch transform. The same model can also be deployed to perform online inference using Amazon SageMaker hosting.

In [27]:
transform_job_name = 'automl-kaggle-regression-transform-' + timestamp_suffix

transform_input = {
        'DataSource': {
            'S3DataSource': {
                'S3DataType': 'S3Prefix',
                'S3Uri': 's3://{}/{}/test/test_data.csv'.format(bucket,prefix)
            }
        },
        'ContentType': 'text/csv',
        'CompressionType': 'None',
        'SplitType': 'Line'
    }

transform_output = {
        'S3OutputPath': 's3://{}/{}/inference-results'.format(bucket,prefix),
    }

transform_resources = {
        'InstanceType': 'ml.m5.4xlarge',
        'InstanceCount': 1
    }

sm.create_transform_job(TransformJobName = transform_job_name,
                        ModelName = model_name,
                        TransformInput = transform_input,
                        TransformOutput = transform_output,
                        TransformResources = transform_resources
)

{'TransformJobArn': 'arn:aws:sagemaker:us-east-1:208480242416:transform-job/automl-kaggle-regression-transform-25-20-30-03',
 'ResponseMetadata': {'RequestId': '31d06955-16c2-4a43-bcdf-adf25703a04a',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '31d06955-16c2-4a43-bcdf-adf25703a04a',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '123',
   'date': 'Wed, 26 Feb 2020 09:55:58 GMT'},
  'RetryAttempts': 0}}

In [None]:
print ('JobStatus')
print('----------')


describe_response = sm.describe_transform_job(TransformJobName = transform_job_name)
job_run_status = describe_response['TransformJobStatus']
print (job_run_status)

while job_run_status not in ('Failed', 'Completed', 'Stopped'):
    describe_response = sm.describe_transform_job(TransformJobName = transform_job_name)
    job_run_status = describe_response['TransformJobStatus']
    print (job_run_status)
    sleep(30)

JobStatus
----------
InProgress
InProgress
InProgress
InProgress
InProgress
InProgress
InProgress
InProgress
InProgress
InProgress
InProgress
InProgress
InProgress
InProgress
InProgress
InProgress
InProgress
InProgress
InProgress
InProgress
InProgress
InProgress
InProgress


## Candidate Generation Notebook

Sagemaker AutoPilot also auto-generates a Candidate Definitions notebook. This notebook can be used to interactively step through the various steps taken by the Sagemaker Autopilot to arrive at the best candidate. This notebook can also be used to override various runtime parameters like parallelism, hardware used, algorithms explored, feature extraction scripts and more.

The notebook can be downloaded from the following Amazon S3 location:


In [None]:
sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)['AutoMLJobArtifacts']['CandidateDefinitionNotebookLocation']


### Data Exploration Notebook
Sagemaker Autopilot also auto-generates a Data Exploration notebook, which can be downloaded from the following Amazon S3 location:

In [None]:
sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)['AutoMLJobArtifacts']['DataExplorationNotebookLocation']