## Extending Amazon SageMaker Autopilot to Custom Code

Amazon SageMaker Autopilot generates a series of artifacts during the AutoML job, allowing you to download, explore, re-use, or customize any part of the Autopilot pipeline.

In this notebook, we will learn how to re-use these artifacts generated for a given model from an Autopilot job.

The following diagram illustrates in high-level the steps Autopilot follows, and the artifacts generated and stored in Amazon S3.

<img src="./images/Autopilot_diagram.png" width="1000"/>



-----

### 1. Setting-up Libraries and Variables

Let's start by ensuring we have an updated SageMaker SDK in our kernel, and importing some libraries.

In [2]:
!pip install -qU awscli boto3 sagemaker

[0m

In [39]:
import boto3, sagemaker
import pandas as pd
import re, os

**Replace the following variables with the artifacts' outputs for the corresponding Autopilot Model Details**

You can access this information from SageMaker Studio by checking:
* Open the "SageMaker Resources" tab (left menu in Studio)
* Select "Experiments and trials" from the dropdown combo
* Select your Autopilot Job name, right click and choose "Describe AutoML Job"
* In the window open, select the model you want to use as a base, right click and choose "Open in model details"
* In the new tab open, select the "Artifacts" tab, copy-paste the URLs shown to the variables in the following cell...

<img src="./images/studio.png" width="1000"/>

In [40]:
#Input artifacts...

input_data = 'https://eu-west-1.console.aws.amazon.com/s3/object/rodzanto2021ml/verisure/all_data/output_1659100501/part-00000-f967beaa-76d8-45cc-ba3a-989f1938acbe-c000.csv'
shuffled_split_data = 'https://console.aws.amazon.com/s3/buckets/rodzanto2021ml/verisure/output_1659100501/verisure-02/preprocessed-data/tuning_data/'
transformed_data = 'https://console.aws.amazon.com/s3/buckets/rodzanto2021ml/verisure/output_1659100501/verisure-02/transformed-data/dpp4/rpb/'
feature_engineering_code = 'https://eu-west-1.console.aws.amazon.com/s3/object/rodzanto2021ml/verisure/output_1659100501/verisure-02/sagemaker-automl-candidates/verisure-02-pr-1-5f3f81f452c545f29ec1075d7ff1211c0d06d394097340/generated_module/candidate_data_processors/dpp4.py'
feature_engineering_model = 'https://eu-west-1.console.aws.amazon.com/s3/object/rodzanto2021ml/verisure/output_1659100501/verisure-02/data-processor-models/verisure-02-dpp4-1-78a0076f02bd4f9c8871d70155985ee554cc6b54e7e0/output/model.tar.gz'
algorithm_model = 'https://eu-west-1.console.aws.amazon.com/s3/object/rodzanto2021ml/verisure/output_1659100501/verisure-02/tuning/verisure-0-dpp4-xgb/verisure-02v7z745kGnyMtSvlKTHFsP-006-aeea7101/output/model.tar.gz'
explainability = 'https://console.aws.amazon.com/s3/buckets/rodzanto2021ml/verisure/output_1659100501/verisure-02/documentation/explainability/output/verisure-02v7z745kGnyMtSvlKTHFsP-006-aeea7101/'


------

### 2. Explore data

Autopilot shuffles and split the original input dataset into training and validation folders, it also splits the data into CSV chunks for better performance.

In [41]:
if not os.path.exists('./artifacts'):
    os.makedirs('./artifacts')

s3 = boto3.client('s3')

bucket = input_data[input_data.index('object/')+len('object/') : input_data.index('/', input_data.index('object/')+len('object/')+1)]

The results of the data exploration performed by SageMaker Autopilot can be checked directly from the **Data Exploration Notebook**. Let's download this notebook for review:

In [42]:
#Data Exploration notebook...
s3_exploration_notebook = feature_engineering_code[feature_engineering_code.index(bucket)+len(bucket)+1 : feature_engineering_code.index('generated_module')] + 'notebooks/SageMakerAutopilotDataExplorationNotebook.ipynb'
print('data_exploration_notebook:\ns3://{}/{}'.format(bucket, s3_exploration_notebook))
s3.download_file(bucket, s3_exploration_notebook, 'artifacts/SageMakerAutopilotDataExplorationNotebook.ipynb')

data_exploration_notebook:
s3://rodzanto2021ml/verisure/output_1659100501/verisure-02/sagemaker-automl-candidates/verisure-02-pr-1-5f3f81f452c545f29ec1075d7ff1211c0d06d394097340/notebooks/SageMakerAutopilotDataExplorationNotebook.ipynb


Also, the whole Autopilot process can be reproduced from the generated **Candidate Definition Notebook**. Let's download this notebook for further exploration as well:

In [43]:
#Candidate Definition notebook...
s3_candidates_notebook = feature_engineering_code[feature_engineering_code.index(bucket)+len(bucket)+1 : feature_engineering_code.index('generated_module')] + 'notebooks/SageMakerAutopilotCandidateDefinitionNotebook.ipynb'
print('cadidate_definition_notebook:\ns3://{}/{}'.format(bucket, s3_candidates_notebook))
s3.download_file(bucket, s3_candidates_notebook, 'artifacts/SageMakerAutopilotCandidateDefinitionNotebook.ipynb')

cadidate_definition_notebook:
s3://rodzanto2021ml/verisure/output_1659100501/verisure-02/sagemaker-automl-candidates/verisure-02-pr-1-5f3f81f452c545f29ec1075d7ff1211c0d06d394097340/notebooks/SageMakerAutopilotCandidateDefinitionNotebook.ipynb


In [45]:
#Shuffled & split data sample...
s3_shuffled_split = shuffled_split_data[shuffled_split_data.index(bucket)+len(bucket)+1 : shuffled_split_data.index('tuning_data')+12]
print('bucket: {}'.format(bucket))
print('s3_shuffled_split: s3://{}/{}'.format(bucket, s3_shuffled_split))

bucket: rodzanto2021ml
s3_shuffled_split: s3://rodzanto2021ml/verisure/output_1659100501/verisure-02/preprocessed-data/tuning_data/


In [46]:
s3.download_file(bucket, s3_shuffled_split + 'train/chunk_0.csv', 'artifacts/train_chunk_0.csv')
train_data = pd.read_csv('artifacts/train_chunk_0.csv', header=None)
train_data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,124,125,126,127,128,129,130,131,132,133
0,0,11875800,4291,499.0,0,1,0,0,0,0,...,4291.0,0.0,0.0,0.0,0.0,0.0,95.0,46.0,44.0,185.0
1,0,14304458,577,199.0,0,1,0,0,0,0,...,577.0,0.0,0.0,29.0,23.0,27.0,0.0,0.0,0.0,0.0
2,0,14108665,730,79.0,0,0,1,0,0,0,...,730.0,0.0,0.0,0.0,0.0,0.0,213.0,99.0,82.0,394.0
3,0,14323153,546,0.0,0,0,1,0,0,0,...,546.0,0.0,0.0,0.0,0.0,0.0,7.0,2.0,2.0,11.0
4,0,14297649,577,49.0,0,0,0,1,0,0,...,577.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,3.0,5.0


------

### 3. Pre-processing

Autopilot performs the feature engineering required for each candidate, as shown in the generated notebooks.

As we have chosen a specific model, let's explore the processing script and output data generated after this step.

#### 3.1 Processing script

This is the processing code used for feature engineering by the candidate selected:

In [47]:
#Pre-processing script...
s3_processing_code = feature_engineering_code[feature_engineering_code.index(bucket)+len(bucket)+1 : feature_engineering_code.index('.py')+3]
print('feature_engineering_code:\ns3://{}/{}'.format(bucket, s3_processing_code))
s3.download_file(bucket, s3_processing_code, 'artifacts/processing.py')

feature_engineering_code:
s3://rodzanto2021ml/verisure/output_1659100501/verisure-02/sagemaker-automl-candidates/verisure-02-pr-1-5f3f81f452c545f29ec1075d7ff1211c0d06d394097340/generated_module/candidate_data_processors/dpp4.py


In [48]:
!pygmentize artifacts/processing.py

[34mfrom[39;49;00m [04m[36mnumpy[39;49;00m [34mimport[39;49;00m nan
[34mfrom[39;49;00m [04m[36msagemaker_sklearn_extension[39;49;00m[04m[36m.[39;49;00m[04m[36mexternals[39;49;00m [34mimport[39;49;00m Header
[34mfrom[39;49;00m [04m[36msagemaker_sklearn_extension[39;49;00m[04m[36m.[39;49;00m[04m[36mimpute[39;49;00m [34mimport[39;49;00m RobustImputer
[34mfrom[39;49;00m [04m[36msagemaker_sklearn_extension[39;49;00m[04m[36m.[39;49;00m[04m[36mpreprocessing[39;49;00m [34mimport[39;49;00m RobustLabelEncoder
[34mfrom[39;49;00m [04m[36msagemaker_sklearn_extension[39;49;00m[04m[36m.[39;49;00m[04m[36mpreprocessing[39;49;00m [34mimport[39;49;00m RobustStandardScaler
[34mfrom[39;49;00m [04m[36msagemaker_sklearn_extension[39;49;00m[04m[36m.[39;49;00m[04m[36mpreprocessing[39;49;00m [34mimport[39;49;00m ThresholdOneHotEncoder
[34mfrom[39;49;00m [04m[36msklearn[39;49;00m[04m[36m.[39;49;00m[04m[36mcompose[39;49;00m [34mim

#### 3.2 Processing Pipeline

This is the pipeline definition - remember SageMaker Autopilot relies on SciKit Learn Pipelines for performing the Feature Engineering:

In [49]:
#Pre-processing pipeline...
s3_processing_pipeline = feature_engineering_code[feature_engineering_code.index(bucket)+len(bucket)+1 : feature_engineering_code.index('.py')-4] + 'trainer.py'
print('feature_engineering_pipeline:\ns3://{}/{}'.format(bucket, s3_processing_pipeline))
s3.download_file(bucket, s3_processing_pipeline, 'artifacts/trainer.py')

feature_engineering_pipeline:
s3://rodzanto2021ml/verisure/output_1659100501/verisure-02/sagemaker-automl-candidates/verisure-02-pr-1-5f3f81f452c545f29ec1075d7ff1211c0d06d394097340/generated_module/candidate_data_processors/trainer.py


In [50]:
!pygmentize artifacts/trainer.py

[37m# This code is auto-generated.[39;49;00m

[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mimportlib[39;49;00m
[34mimport[39;49;00m [04m[36mlogging[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mshutil[39;49;00m

[34mfrom[39;49;00m [04m[36mjoblib[39;49;00m [34mimport[39;49;00m dump

[34mfrom[39;49;00m [04m[36msagemaker_sklearn_extension[39;49;00m[04m[36m.[39;49;00m[04m[36mexternals[39;49;00m [34mimport[39;49;00m AutoMLTransformer
[34mfrom[39;49;00m [04m[36msagemaker_sklearn_extension[39;49;00m[04m[36m.[39;49;00m[04m[36mexternals[39;49;00m[04m[36m.[39;49;00m[04m[36mread_data[39;49;00m [34mimport[39;49;00m read_csv_data



[34mdef[39;49;00m [32mtrain[39;49;00m(X, y, header, feature_transformer, label_transformer):
    [33m"""Trains the data processing model.[39;49;00m
[33m[39;49;00m
[33m    Splits training data to features and labels based on the he

#### 3.3 Processing Execution

Once the processing model is generated, there is a Transform Job executed for obtaining the training data. This is done through the serving script below:

In [51]:
#Pre-processing serving code...
s3_processing_serving = feature_engineering_code[feature_engineering_code.index(bucket)+len(bucket)+1 : feature_engineering_code.index('.py')-4] + 'sagemaker_serve.py'
print('feature_engineering_pipeline:\ns3://{}/{}'.format(bucket, s3_processing_serving))
s3.download_file(bucket, s3_processing_serving, 'artifacts/sagemaker_serve.py')

feature_engineering_pipeline:
s3://rodzanto2021ml/verisure/output_1659100501/verisure-02/sagemaker-automl-candidates/verisure-02-pr-1-5f3f81f452c545f29ec1075d7ff1211c0d06d394097340/generated_module/candidate_data_processors/sagemaker_serve.py


In [52]:
!pygmentize artifacts/sagemaker_serve.py

[37m# This code is auto-generated.[39;49;00m
[34mimport[39;49;00m [04m[36mhttp[39;49;00m[04m[36m.[39;49;00m[04m[36mclient[39;49;00m [34mas[39;49;00m [04m[36mhttp_client[39;49;00m
[34mimport[39;49;00m [04m[36mio[39;49;00m
[34mimport[39;49;00m [04m[36mjson[39;49;00m
[34mimport[39;49;00m [04m[36mlogging[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m

[34mimport[39;49;00m [04m[36mnumpy[39;49;00m [34mas[39;49;00m [04m[36mnp[39;49;00m
[34mfrom[39;49;00m [04m[36mjoblib[39;49;00m [34mimport[39;49;00m load
[34mfrom[39;49;00m [04m[36mscipy[39;49;00m [34mimport[39;49;00m sparse

[34mfrom[39;49;00m [04m[36msagemaker_containers[39;49;00m[04m[36m.[39;49;00m[04m[36mbeta[39;49;00m[04m[36m.[39;49;00m[04m[36mframework[39;49;00m [34mimport[39;49;00m encoders
[34mfrom[39;49;00m [04m[36msagemaker_containers[39;49;00m[04m[36m.[39;49;00m[04m[36mbeta[39;49;00m[04m[36m.[39;49;00m[04m[36mframework[39;49;00m [

#### 3.4 Generating Training Data

The result of this job is the training data, delivered in file chunks for performance efficiency. Let's explore one of the files below:

In [53]:
s3_transformed = transformed_data[transformed_data.index(bucket)+len(bucket)+1 : transformed_data.index('rpb')+3]
print('feature_engineering_code:\ns3://{}/{}'.format(bucket, s3_transformed))

feature_engineering_code:
s3://rodzanto2021ml/verisure/output_1659100501/verisure-02/transformed-data/dpp4/rpb


In [54]:
s3.download_file(bucket, s3_transformed+'/train/chunk_0.csv.out', 'artifacts/transform_chunk_0.out')

In [55]:
import sagemaker.amazon.common as smac

def read_recordio_file (filename, recordsToPrint = 1):
    with open(filename, 'rb') as f:
        record = smac.read_records(f)
        for i, r in enumerate(record):
            if i >= recordsToPrint:
                break
            print ("record: {}".format(i))
            print(r)

read_recordio_file('artifacts/transform_chunk_0.out')


record: 0
features {
  key: "values"
  value {
    float32_tensor {
      values: 5.061580657958984
      values: 2.3908212184906006
      values: 2.2058229446411133
      values: 2.006274938583374
      values: 2.65960693359375
      values: 2.0958914756774902
      values: 10.30570125579834
      values: 10.696402549743652
      values: 3.1209700107574463
      values: 0.13086290657520294
      values: 4.037381649017334
      values: 0.04754875972867012
      values: 5.05332612991333
      values: 1.1885870695114136
      values: 1.1604565382003784
      values: 1.0153844356536865
      values: 1.906218409538269
      values: 0.45210689306259155
      values: 4.754154682159424
      values: 5.0775275230407715
      values: 2.3733115196228027
      values: 0.19784241914749146
      values: 6.486716270446777
      values: 0.7017309665679932
      values: 0.029747100546956062
      values: 0.030027609318494797
      values: 20.231672286987305
      values: 0.614924967288971
      values

#### 3.5 Processing with your own script

If you want to customize this processing code for adapting it to your own transformations, or adding your own transformations to the existing pipeline, you can do so by following the instructions in this blog post:

https://aws.amazon.com/blogs/machine-learning/customizing-and-reusing-models-generated-by-amazon-sagemaker-autopilot/

In those instructions you will either replace the processing code with your own script, or add your own transformations to the SciKit Learn Pipeline definition, or both.

Remember you can also simplify the whole process if you just take the processing code above and use it directly, e.g. with a SageMaker Processing job. For this task you can check examples like this one:

https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker_processing/basic_sagemaker_data_processing/basic_sagemaker_processing.ipynb

Also, consider using the Candidate Definition Notebook generated by Autopilot for reproducing any part of the process you might be interested on.

-------

### 4. Training

In this section, let's assume we have already processed our data and just want to use it in our own Training Job.

We can use any example as a reference, like e.g.:

https://github.com/aws/amazon-sagemaker-examples/blob/main/aws_sagemaker_studio/getting_started/xgboost_customer_churn_studio.ipynb

In [56]:
sess = boto3.Session()
sm = sess.client("sagemaker")
role = sagemaker.get_execution_role()

In [85]:
from time import strftime, gmtime
from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
from smexperiments.trial_component import TrialComponent
from smexperiments.tracker import Tracker

docker_image_name = sagemaker.image_uris.retrieve("xgboost", boto3.Session().region_name, "1.3-1", image_scope="training")
print(docker_image_name)

INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.


141502667606.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-xgboost:1.3-1


In [96]:
s3_input_train = sagemaker.TrainingInput(
    s3_data="s3://{}/{}/train/".format(bucket, s3_transformed),
    content_type="csv"
)
print("s3://{}/{}/train/".format(bucket, s3_transformed))

s3_input_validation = sagemaker.TrainingInput(
    s3_data="s3://{}/{}/validation/".format(bucket, s3_transformed),
    content_type="csv"
)
print("s3://{}/{}/validation/".format(bucket, s3_transformed))

s3://rodzanto2021ml/verisure/output_1659100501/verisure-02/transformed-data/dpp4/rpb/train/
s3://rodzanto2021ml/verisure/output_1659100501/verisure-02/transformed-data/dpp4/rpb/validation/


In [97]:
sess = sagemaker.session.Session()

create_date = strftime("%Y-%m-%d-%H-%M-%S", gmtime())
customer_experiment = Experiment.create(
    experiment_name="automl-to-custom-{}".format(create_date),
    description="Reusing Autopilot generated training",
    sagemaker_boto_client=boto3.client("sagemaker"),
)

In [98]:
hyperparams = {
    "max_depth": 8,
    "subsample": 0.7855482055675881,
    "num_round": 195,
    "eta": 0.091189676348378,
    "gamma": 1.248783399604081,
    "min_child_weight": 8.8593717950024e-05,
    "objective": "binary:logistic",
}

In [None]:
trial = Trial.create(
    trial_name="algorithm-mode-trial-{}".format(strftime("%Y-%m-%d-%H-%M-%S", gmtime())),
    experiment_name=customer_experiment.experiment_name,
    sagemaker_boto_client=boto3.client("sagemaker"),
)

xgb = sagemaker.estimator.Estimator(
    image_uri=docker_image_name,
    role=role,
    hyperparameters=hyperparams,
    instance_count=1,
    instance_type="ml.m5.12xlarge",
    output_path="s3://{}/{}/output".format(bucket, customer_experiment.experiment_name),
    base_job_name="automl-to-custom",
    sagemaker_session=sess,
)

xgb.fit(
    {"train": s3_input_train, "validation": s3_input_validation},
    experiment_config={
        "ExperimentName": customer_experiment.experiment_name,
        "TrialName": trial.trial_name,
        "TrialComponentDisplayName": "Training",
    },
)

### 5. Hosting

#### 5.1 Option 1: Re-using Autopilot's best model

5.1.1 Real-time Endpoint

In [None]:
endpoint_name = "demo-xgboost-customer-churn-" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print("EndpointName = {}".format(endpoint_name))

In [None]:
predictor = xgb.deploy(
    initial_instance_count=1,
    instance_type="ml.m5.large",
    endpoint_name=endpoint_name,
)

5.1.2 Batch Transform

In [None]:
# creates a transformer object from the trained model
transformer = xgb.transformer(
                          instance_count=1,
                          instance_type='ml.m5.large',
                          output_path=s3_batch_output)

# calls that object's transform method to create a transform job
transformer.transform(data=s3_batch_input, data_type='S3Prefix', content_type='text/csv', split_type='Line')

transformer.wait()

#### 5.2 Option 2: Hosting your own trained model (from steps above)

5.2.1 Real-time Endpoint

In [None]:
endpoint_name = "demo-xgboost-customer-churn-" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print("EndpointName = {}".format(endpoint_name))

In [None]:
predictor = xgb.deploy(
    initial_instance_count=1,
    instance_type="ml.m4.xlarge",
    endpoint_name=endpoint_name,
    data_capture_config=DataCaptureConfig(
        enable_capture=True,
        sampling_percentage=100,
        destination_s3_uri="s3://{}/{}".format(bucket, data_capture_prefix),
    ),
)

5.2.2 Batch Transform

In [None]:
# creates a transformer object from the trained model
transformer = xgb.transformer(
                          instance_count=1,
                          instance_type='ml.m5.large',
                          output_path=s3_batch_output)

# calls that object's transform method to create a transform job
transformer.transform(data=s3_batch_input, data_type='S3Prefix', content_type='text/csv', split_type='Line')

transformer.wait()