# Assignment 2: use SageMaker processing and training jobs
In this assignment you move your data processing, feature enginering, and model training code to SageMaker jobs.

The following diagram shows an anatomy of a SageMaker container:

![](../img/container-anatomy.png)

Refer to the notebook [`02-sagemaker-containers.ipynb`](../02-sagemaker-containers.ipynb) for code snippets and a general guidance for the exercises in this assignment.

## Import packages

In [2]:
import time
import boto3
import botocore
import numpy as np  
import pandas as pd  
import sagemaker
from time import gmtime, strftime, sleep
from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sklearn.metrics import roc_auc_score
from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
from smexperiments.trial_component import TrialComponent
from smexperiments.tracker import Tracker

sagemaker.__version__

'2.165.0'

In [59]:
%store -r 

%store

try:
    initialized
except NameError:
    print("+++++++++++++++++++++++++++++++++++++++++++++++++")
    print("[ERROR] YOU HAVE TO RUN 00-start-here notebook   ")
    print("+++++++++++++++++++++++++++++++++++++++++++++++++")

Stored variables and their in-db values:
athena_table_name                      -> 'sagemaker_workshop_e2e_churn_1686747619'
baseline_s3_url                        -> 's3://sagemaker-us-east-1-531485126105/from-idea-t
bucket                                 -> 'sagemaker-studio-us-east-1-531485126105'
bucket_name                            -> 'sagemaker-us-east-1-531485126105'
bucket_prefix                          -> 'from-idea-to-prod/xgboost'
churn_feature_group_name               -> 'sagemaker-workshop-e2e-churn'
docker_image_name                      -> '683313688378.dkr.ecr.us-east-1.amazonaws.com/sage
domain_id                              -> 'd-1qvmpqvqiuve'
evaluation_s3_url                      -> 's3://sagemaker-us-east-1-531485126105/from-idea-t
experiment_name                        -> 'from-idea-to-prod-experiment-18-07-41-20'
framework_version                      -> '1.3-1'
initialized                            -> True
input_s3_url                           -> 's3://sag

In [60]:
session = sagemaker.Session()
sm = session.sagemaker_client

## [Optional] Load an existing or create a new experiment
Load an existing or create a new experiment to track parameters, metrics, and artifacts in this notebook.

In [None]:
# Load experiment based on the a name
# experiment = Experiment.load(experiment_name, sagemaker_boto_client=sm)

In [None]:
# Alternatively, create a new experiment
#experiment_name = f"from-idea-to-prod-experiment-{strftime('%d-%H-%M-%S', gmtime())}"
#experiment = Experiment.create(
#    experiment_name=experiment_name,
#    description="Direct marketing binary classification",
#    sagemaker_boto_client=sm,
#)

## Excercise 1: Process data
- Use SageMaker session object to [upload](https://sagemaker.readthedocs.io/en/stable/api/utility/session.html#sagemaker.session.Session.upload_data)  the dataset to an Amazon S3 bucket. Use a SageMaker [default bucket](https://sagemaker.readthedocs.io/en/stable/api/utility/session.html#sagemaker.session.Session.default_bucket)
- Move data processing code from the previous notebook to a Python executable script. You can pass any parameters to your script to parametrize the data processing
- Set the Amazon S3 paths for the output datasets
- Use [SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/overview.html) [`SKLearnProcessor`](https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/sagemaker.sklearn.html#sagemaker.sklearn.processing.SKLearnProcessor) class to setup a processing job. 
- Configure processing job's [inputs](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.processing.ProcessingInput) and [outputs](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.processing.ProcessingOutput) to point the processing job to Amazon S3 locations
- [Run](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.processing.ScriptProcessor.run) the processing job

### Python SDK processor classes
Use the most suitable class to implement a processor for your use case:
    
![](../img/python-sdk-processors.png)

In [4]:
session = sagemaker.Session()

In [6]:
# Write data upload code
# S3 key to the full dataset
# input_s3_url = session.upload_data()

In [16]:
# import sagemaker

# # Create a SageMaker session
# sagemaker_session = sagemaker.Session()

# # Specify the S3 bucket and prefix for the data
# # bucket = sagemaker_session.default_bucket()
# # prefix = 'assignment_day_4'  # Specify your desired prefix

# # Upload the dataset to S3
# input_data = sagemaker_session.upload_data(path='data/bank-additional/bank-additional-full.csv', bucket=bucket, key_prefix=prefix)

In [61]:
input_s3_url = session.upload_data(
    path="data/bank-additional/bank-additional-full.csv",
    bucket=bucket_name,
    key_prefix=f"{bucket_prefix}/input"
)

%store input_s3_url

Stored 'input_s3_url' (str)


In [17]:
# output_path = f's3://{bucket}/{prefix}/output'  # Specify the desired output path in S3
# output_path

's3://sagemaker-us-east-1-531485126105/assignment_day_4/output'

In [62]:
%%writefile preprocessing.py

import pandas as pd
import numpy as np
import argparse
import os

def _parse_args():
    
    parser = argparse.ArgumentParser()
    # Data, model, and output directories
    # model_dir is always passed in from SageMaker. By default this is a S3 path under the default bucket.
    parser.add_argument('--filepath', type=str, default='/opt/ml/processing/input/')
    parser.add_argument('--filename', type=str, default='bank-additional-full.csv')
    parser.add_argument('--outputpath', type=str, default='/opt/ml/processing/output/')
    
    return parser.parse_known_args()


if __name__=="__main__":
    # Process arguments
    args, _ = _parse_args()
    
    target_col = "y"
    
    # Load data
    df_data = pd.read_csv(os.path.join(args.filepath, args.filename), sep=";")

    # Indicator variable to capture when pdays takes a value of 999
    df_data["no_previous_contact"] = np.where(df_data["pdays"] == 999, 1, 0)

    # Indicator for individuals not actively employed
    df_data["not_working"] = np.where(
        np.in1d(df_data["job"], ["student", "retired", "unemployed"]), 1, 0
    )

    # remove unnecessary data
    df_model_data = df_data.drop(
        ["duration", "emp.var.rate", "cons.price.idx", "cons.conf.idx", "euribor3m", "nr.employed"],
        axis=1,
    )

    df_model_data = pd.get_dummies(df_model_data)  # Convert categorical variables to sets of indicators

    # Replace "y_no" and "y_yes" with a single label column, and bring it to the front:
    df_model_data = pd.concat(
        [
            df_model_data["y_yes"].rename(target_col),
            df_model_data.drop(["y_no", "y_yes"], axis=1),
        ],
        axis=1,
    )

    # Shuffle and splitting dataset
    train_data, validation_data, test_data = np.split(
        df_model_data.sample(frac=1, random_state=1729),
        [int(0.7 * len(df_model_data)), int(0.9 * len(df_model_data))],
    )

    print(f"Data split > train:{train_data.shape} | validation:{validation_data.shape} | test:{test_data.shape}")
    
    # Save datasets locally
    train_data.to_csv(os.path.join(args.outputpath, 'train/train.csv'), index=False, header=False)
    validation_data.to_csv(os.path.join(args.outputpath, 'validation/validation.csv'), index=False, header=False)
    test_data[target_col].to_csv(os.path.join(args.outputpath, 'test/test_y.csv'), index=False, header=False)
    test_data.drop([target_col], axis=1).to_csv(os.path.join(args.outputpath, 'test/test_x.csv'), index=False, header=False)
    
    # Save the baseline dataset for model monitoring
    df_model_data.drop([target_col], axis=1).to_csv(os.path.join(args.outputpath, 'baseline/baseline.csv'), index=False, header=False)
    
    print("## Processing complete. Exiting.")

Overwriting preprocessing.py


In [None]:
# Set the Amazon S3 paths for the output datasets
# train_s3_url = 
# validation_s3_url =
# test_s3_url = 
# baseline_s3_url = 

In [64]:
train_s3_url = f"s3://{bucket_name}/{bucket_prefix}/train"
validation_s3_url = f"s3://{bucket_name}/{bucket_prefix}/validation"
test_s3_url = f"s3://{bucket_name}/{bucket_prefix}/test"
baseline_s3_url = f"s3://{bucket_name}/{bucket_prefix}/baseline"

In [67]:
%store train_s3_url
%store validation_s3_url
%store test_s3_url
%store baseline_s3_url

Stored 'train_s3_url' (str)
Stored 'validation_s3_url' (str)
Stored 'test_s3_url' (str)
Stored 'baseline_s3_url' (str)


### [Optional] Create a trail
If you use an experiment, you must create a trial to capture processing and training output from this notebook. 

Use [`Trial`](https://sagemaker-experiments.readthedocs.io/en/latest/trial.html) class to interact with trials and the [`Tracker`](https://sagemaker-experiments.readthedocs.io/en/latest/tracker.html) class to record information to a trial component. 

SageMaker processing and training jobs automatically handle trial components and save metrics, parameters, metadata, and artifacts in the trial components if you provide an `experiment_config` in `Processor.run()` or `Estimator.fit()` calls.

In [None]:
# trial = experiment.create_trial(trial_name_prefix="Container-training")

In [None]:
# with Tracker.create(display_name="Preprocessing-split", sagemaker_boto_client=sm) as tracker:
#    tracker.log_parameters()
#    tracker.log_input()

In [None]:
# Create experiment config to use in the processing and training jobs
#experiment_config = {
#    "ExperimentName": experiment.experiment_name,
#    "TrialName": trial.trial_name,
#    "TrialComponentDisplayName": "Preprocessing",
#}

### Create a processor

In [33]:
import time
import boto3
import botocore
import numpy as np  
import pandas as pd  
import sagemaker
from time import gmtime, strftime, sleep
from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sklearn.metrics import roc_auc_score
from sagemaker.experiments.run import Run, load_run

sagemaker.__version__

'2.165.0'

In [54]:
# Uncomment code block (Cmd + /) if you would like to create a new experiment
experiment_name = f"assignment-lab3-{strftime('%d-%H-%M-%S', gmtime())}"
experiment_name

'assignment-lab3-18-08-48-57'

In [68]:
run_suffix = strftime('%Y-%m-%M-%S', gmtime())
run_name = f"container-processing-{run_suffix}"

with Run(experiment_name=experiment_name,
         run_name=run_name,
         run_display_name="container-processing",
         sagemaker_session=session
        ) as run:
    run.log_parameters(
        {
            "train": 0.7,
            "validate": 0.2,
            "test": 0.1
        }
    )
   
    experiment_config = run.experiment_config
    # time.sleep(8) # wait until resource tags are propagated to the run

In [69]:
from sagemaker import get_execution_role
from sagemaker.sklearn.processing import SKLearnProcessor

framework_version = "0.23-1"
processing_instance_type = "ml.m5.large"
processing_instance_count = 1

sm_role = get_execution_role()

sklearn_processor = SKLearnProcessor(
    framework_version=framework_version,
    role=sm_role,
    instance_type=processing_instance_type,
    instance_count=processing_instance_count, 
    base_job_name='from-idea-to-prod-processing',
    sagemaker_session=session,
)

INFO:sagemaker.image_uris:Defaulting to only available Python version: py3


In [70]:
# Define procesing inputs and outputs
processing_inputs = [] # use input_s3_url as pointer to the full dataset

processing_outputs = [] # map local directories in the processing container to Amazon S3 locations

In [72]:
processing_inputs = [
        ProcessingInput(
            source=input_s3_url, 
            destination="/opt/ml/processing/input",
            s3_input_mode="File",
            s3_data_distribution_type="ShardedByS3Key"
        )
    ]

processing_outputs = [
        ProcessingOutput(
            output_name="train_data", 
            source="/opt/ml/processing/output/train",
            destination=train_s3_url,
        ),
        ProcessingOutput(
            output_name="validation_data", 
            source="/opt/ml/processing/output/validation", 
            destination=validation_s3_url
        ),
        ProcessingOutput(
            output_name="test_data", 
            source="/opt/ml/processing/output/test", 
            destination=test_s3_url
        ),
        ProcessingOutput(
            output_name="baseline_data", 
            source="/opt/ml/processing/output/baseline", 
            destination=baseline_s3_url
        ),
    ]

In [None]:
# Start the processing job, pass an experiment_config parameter if you use experiments
# sklearn_processor.run() 

In [73]:
try:
    sklearn_processor.run(
        inputs=processing_inputs,
        outputs=processing_outputs,
        code='preprocessing.py',
        wait=True,
        experiment_config=experiment_config,
        # arguments = ['arg1', 'arg2'],
    )
except botocore.exceptions.ClientError as e:
    if e.response['Error']['Code'] == 'AccessDeniedException':
        print(f"Ignore AccessDeniedException: {e.response['Error']['Message']} because of the slow resource tag auto propagation")
    else:
        raise e

INFO:sagemaker:Creating processing-job with name from-idea-to-prod-processing-2023-06-18-08-55-54-992


.........................[34mData split > train:(28831, 60) | validation:(8238, 60) | test:(4119, 60)[0m
[34m## Processing complete. Exiting.[0m



## Exercise 2: Model training
- Get a container image URI for the used built-in SageMaker ML algorithm using SageMaker SDK [helper](https://sagemaker.readthedocs.io/en/stable/api/utility/image_uris.html#sagemaker.image_uris.retrieve)
- Configure data [input channels](https://sagemaker.readthedocs.io/en/stable/api/utility/inputs.html#sagemaker.inputs.TrainingInput) for the training job
- Use [`Estimator`](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.Estimator) class to setup a training job
- Set [hyperparameters](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.Estimator.set_hyperparameters)
- [Run](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.EstimatorBase.fit) the training job

In [None]:
# Write code to retrieve a container image URI


In [74]:
# get training container uri
training_image = sagemaker.image_uris.retrieve("xgboost", region=region, version="1.5-1")

print(training_image)

INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.


683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-xgboost:1.5-1


In [None]:
# Set the input data channels
# s3_input_train = 
# s3_input_validation =

In [75]:
s3_input_train = sagemaker.inputs.TrainingInput(train_s3_url, content_type='csv')
s3_input_validation = sagemaker.inputs.TrainingInput(validation_s3_url, content_type='csv')

In [None]:
# Set an Amazon S3 path for a model artifact
# output_s3_url = 

In [76]:
train_instance_count = 1
train_instance_type = "ml.m5.xlarge"

# Define where the training job stores the model artifact
output_s3_url = f"s3://{bucket_name}/{bucket_prefix}/output"

%store output_s3_url

Stored 'output_s3_url' (str)


### Python SDK estimator classes
SageMaker Python SDK contains corresponding [`EstimatorBase`](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.EstimatorBase)-derived classes to access each of the built-in algorithms. You can extend [`Framework`](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.Framework) class to implement a training with a custom framework.

![](../img/python-sdk-estimators.png)

In [None]:
# Create an estimator
train_instance_count = 1
train_instance_type = "ml.m5.xlarge"

# estimator = sagemaker.estimator.Estimator()

In [77]:
# Instantiate an XGBoost estimator object
estimator = sagemaker.estimator.Estimator(
    image_uri=training_image,  # XGBoost algorithm container
    instance_type=train_instance_type,  # type of training instance
    instance_count=train_instance_count,  # number of instances to be used
    role=sm_role,  # IAM execution role to be used
    max_run=20 * 60,  # Maximum allowed active runtime
    # use_spot_instances=True,  # Use spot instances to reduce cost
    # max_wait=30 * 60,  # Maximum clock time (including spot delays)
    output_path=output_s3_url, # S3 location for saving the training result
    sagemaker_session=session, # Session object which manages interactions with SageMaker API and AWS services
    base_job_name="from-idea-to-prod-training", # Prefix for training job name
)

# define its hyperparameters
estimator.set_hyperparameters(
    num_round=150, # the number of rounds to run the training
    max_depth=3, # maximum depth of a tree
    eta=0.5, # step size shrinkage used in updates to prevent overfitting
    alpha=2.5, # L1 regularization term on weights
    objective="binary:logistic",
    eval_metric="auc", # evaluation metrics for validation data
    subsample=0.8, # subsample ratio of the training instance
    colsample_bytree=0.8, # subsample ratio of columns when constructing each tree
    min_child_weight=3, # minimum sum of instance weight (hessian) needed in a child
    early_stopping_rounds=10, # the model trains until the validation score stops improving
    verbosity=1, # verbosity of printing messages
)

In [None]:
# Set hyperparameters for the estimator algorithm
# estimator.set_hyperparameters()

In [None]:
# Set the training inputs
# training_inputs = {}

In [78]:
training_inputs = {'train': s3_input_train, 'validation': s3_input_validation}

In [79]:
try:
    run_suffix = strftime('%Y-%m-%M-%S', gmtime())
    run_name = f"container-training-{run_suffix}"

    with Run(experiment_name=experiment_name,
             run_name=run_name,
             run_display_name="container-training",
             sagemaker_session=session
            ) as run:
        
        estimator.fit(
            training_inputs,
            wait=True,
            logs=False,
        ) 
except botocore.exceptions.ClientError as e:
    if e.response['Error']['Code'] == 'AccessDeniedException':
        print(f"Ignore AccessDeniedException: {e.response['Error']['Message']} because of the slow resource tag auto propagation")
    else:
        raise e

INFO:sagemaker:Creating training-job with name: from-idea-to-prod-training-2023-06-18-09-00-49-889



2023-06-18 09:00:49 Starting - Starting the training job....
2023-06-18 09:01:17 Starting - Preparing the instances for training...........
2023-06-18 09:02:18 Downloading - Downloading input data.....
2023-06-18 09:02:48 Training - Downloading the training image...
2023-06-18 09:03:09 Training - Training image download completed. Training in progress...
2023-06-18 09:03:24 Uploading - Uploading generated training model.
2023-06-18 09:03:35 Completed - Training job completed


In [None]:
# Run the training job, optionally use an experiment_config parameter
# estimator.fit(training_inputs)

Wait until the training job is done.

In [None]:
# Describe the training job
# training_job_name = estimator._current_job_name
# boto3.client("sagemaker", region_name=region).describe_training_job(TrainingJobName=training_job_name)

In [None]:
# Get the model metrics from the describe job result

# print(f"Train-auc:{train_auc:.2f}, Validate-auc:{validate_auc:.2f}")

## Exercise 3: Validate model
To validate the model, you use the model artifact from the training job to run predictions on the test dataset. You can either create a [real-time inference endpoint](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints.html) or create a [batch transform](https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html).

### Option 1: Real-time inference
- Use [Estimator.deploy](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.EstimatorBase.deploy) function to provision a real-time inference endpoint
- Load test dataset
- Send the test dataset to the endpoint. Use [Predictor.predict](https://sagemaker.readthedocs.io/en/stable/api/inference/predictors.html#sagemaker.predictor.Predictor.predict) function
- Evaluate the predictions

In [None]:
# Create a predictor
# Remember, the training job saved the test dataset in the specified S3 location

# predictor = estimator.deploy()

In [83]:
# Real-time endpoint
endpoint_name = f"from-idea-to-prod-endpoint-{strftime('%d-%H-%M-%S', gmtime())}"

try:
    predictor = estimator.deploy(
        initial_instance_count=1,
        instance_type="ml.m5.large",
        wait=False,  # Remember, predictor.predict() won't work until deployment finishes!
        # Turn on data capture here, in case you want to experiment with monitoring:
        data_capture_config=sagemaker.model_monitor.DataCaptureConfig(
            enable_capture=True,
            sampling_percentage=100,
            destination_s3_uri=f"s3://{bucket_name}/{bucket_prefix}/data-capture",
        ),
        endpoint_name=endpoint_name,
        serializer=sagemaker.serializers.CSVSerializer(),
        deserializer=sagemaker.deserializers.CSVDeserializer(),
    )
except botocore.exceptions.ClientError as e:
    if e.response['Error']['Code'] == 'AccessDeniedException':
        print(f"Ignore AccessDeniedException: {e.response['Error']['Message']} because of the slow resource tag auto propagation")
        predictor = sagemaker.predictor.Predictor(endpoint_name=endpoint_name,
                                                  sagemaker_session=session,
                                                  serializer=sagemaker.serializers.CSVSerializer(),
                                                  deserializer=sagemaker.deserializers.CSVDeserializer(),
                                                 )
    else:
        raise e

INFO:sagemaker:Creating model with name: from-idea-to-prod-training-2023-06-18-09-04-02-239
INFO:sagemaker:Creating endpoint-config with name from-idea-to-prod-endpoint-18-09-04-01
INFO:sagemaker:Creating endpoint with name from-idea-to-prod-endpoint-18-09-04-01


In [None]:
# Load the test dataset
# test_x = pd.read_csv()
# test_y = pd.read_csv()

In [87]:
!aws s3 cp $test_s3_url/test_x.csv tmp/test_x.csv
!aws s3 cp $test_s3_url/test_y.csv tmp/test_y.csv

download: s3://sagemaker-us-east-1-531485126105/from-idea-to-prod/xgboost/test/test_x.csv to tmp/test_x.csv
download: s3://sagemaker-us-east-1-531485126105/from-idea-to-prod/xgboost/test/test_y.csv to tmp/test_y.csv


In [88]:
test_x = pd.read_csv("tmp/test_x.csv", names=[f'{i}' for i in range(59)])
test_y = pd.read_csv("tmp/test_y.csv", names=['y'])

In [91]:
# Predict
predictions = np.array(predictor.predict(test_x.values), dtype=float).squeeze()
predictions

array([0.06094802, 0.09423219, 0.20767137, ..., 0.04564261, 0.03911222,
       0.03352626])

In [92]:
# Evaluate predictions
# Compare the predicted label to ground truth label
test_results = pd.concat(
    [
        pd.Series(predictions, name="y_pred", index=test_x.index),
        test_x,
    ],
    axis=1,
)
test_results.head()

Unnamed: 0,y_pred,0,1,2,3,4,5,6,7,8,...,49,50,51,52,53,54,55,56,57,58
0,0.060948,25,1,999,0,1,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0
1,0.094232,28,3,999,0,1,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0
2,0.207671,38,1,999,0,1,0,1,0,0,...,0,0,0,1,0,0,0,0,1,0
3,0.256129,32,1,999,1,1,0,1,0,0,...,0,0,0,1,0,0,0,1,0,0
4,0.039112,40,1,999,0,1,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0


In [93]:
pd.crosstab(
    index=test_y['y'].values,
    columns=np.round(predictions), 
    rownames=['actuals'], 
    colnames=['predictions']
)

predictions,0.0,1.0
actuals,Unnamed: 1_level_1,Unnamed: 2_level_1
0,3602,34
1,401,82


### Option 2: Batch transform
For an asynchronous inference you can use a SageMaker [transform job](https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html).
- Use [Estimator.tranformer](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.EstimatorBase.transformer) function to create a transformer
- [Run](https://sagemaker.readthedocs.io/en/stable/api/inference/transformer.html#sagemaker.transformer.Transformer.transform) a tranform job
- Download the dataset from an S3 output location
- Evalute the predictions

In [None]:
# Create a transformer
# transformer = estimator.transformer()

In [None]:
# Run a transform, use an experiment_config parameter
# transformer.transform()

Wait until the transform job is done.

The transformer outputs the prediction probabilities and stores them as a `csv` file in the specified S3 location. The S3 path is stored in `transformer.output_path`. To compare the predictions with the ground truth labels, you must download the dataset from S3 and load it into a Pandas DataFrame.

In [None]:
# Download the predictions and the ground truth labels from S3


In [None]:
# Load the output dataset and the ground truth label
# predictions = pd.read_csv()
# test_y = pd.read_csv()

In [None]:
# Show the confusion matrix
# pd.crosstab()


In [None]:
# Calculate AUC
# test_auc = roc_auc_score(test_y, predictions)
#  print(f"Test-auc: {test_auc:.2f}")

### [Optional] build ROC and precision-recall curves
You can create various charts using [`sklearn.metrics`](https://scikit-learn.org/stable/modules/model_evaluation.html) package.

### [Optional] Save charts to the trial component
You can use the Tracker class to save various charts to a trial component of the trial of your experiment.

A Jupyter notebook trick: Press `Ctrl` + `/` to comment or uncomment all selected lines in the sell.

In [None]:
# Find a trial component name of the current trial based on the display name
#batch_transform_trail_component = [
#    tc for tc in trial.list_trial_components() 
#    if tc.display_name == <DISPLAY NAME OF THE TRIAL COMPONENT>][0]

In [None]:
# Add charts
# with Tracker.load(
#    trial_component_name=batch_transform_trail_component.trial_component_name,
#    sagemaker_boto_client=sm
# ) as tracker:
#    tracker.log_precision_recall()
#    tracker.log_confusion_matrix()
#    tracker.log_roc_curve()

### [Optional] Explore experiments, trials, and trial components in Studio
In **SageMaker resources** select **Experiment and trials**, choose **Open in trial component list** from the context menu:

<img src="../img/experiment-and-trials-context-menu.png" width="400"/>

## Exercise 4: [Optional] hyperparameter optimization (HPO)
- Use [HyperparameterTuner](https://sagemaker.readthedocs.io/en/stable/api/training/tuner.html#sagemaker.tuner.HyperparameterTuner) to run a HPO job
- Specify hyperparameters ranges and tuning strategy
- [Run](https://sagemaker.readthedocs.io/en/stable/api/training/tuner.html#sagemaker.tuner.HyperparameterTuner.fit) the tuning job
- Compare performance of the tuned and non-tuned models

In [None]:
# import required HPO objects
from sagemaker.tuner import (
    CategoricalParameter,
    ContinuousParameter,
    HyperparameterTuner,
    IntegerParameter,
)

In [None]:
# set up hyperparameter ranges
# hp_ranges = {}


In [None]:
# set up the objective metric
objective = "validation:auc"

In [None]:
# instantiate a HPO object
# tuner = HyperparameterTuner()

In [None]:
# evaluate performance

## Clean-up
Remove all real-time endpoints you created

In [94]:
# predictor.delete_endpoint(delete_endpoint_config=True)
predictor.delete_endpoint(delete_endpoint_config=True)

INFO:sagemaker:Deleting endpoint configuration with name: from-idea-to-prod-endpoint-18-09-04-01
INFO:sagemaker:Deleting endpoint with name: from-idea-to-prod-endpoint-18-09-04-01


In [None]:
# run if you created a tuned predictor after HPO
# hpo_predictor.delete_endpoint(delete_endpoint_config=True)


In [95]:
# run if you created a tuned predictor after HPO
hpo_predictor.delete_endpoint(delete_endpoint_config=True)

NameError: name 'hpo_predictor' is not defined

In [None]:
%%html

<p><b>Shutting down your kernel for this notebook to release resources.</b></p>
<button class="sm-command-button" data-commandlinker-command="kernelmenu:shutdown" style="display:none;">Shutdown Kernel</button>
        
<script>
try {
    els = document.getElementsByClassName("sm-command-button");
    els[0].click();
}
catch(err) {
    // NoOp
}    
</script>

## Continue with the assignment 3
Navigate to the [assignment 3](03-assignment-sagemaker-pipeline.ipynb) notebook.