# Improves Data Science Productiity Using SageMaker Studio

Using Machine Learning Model to predict customer churn for a Music Streaming Service

The notebook is organized into the following sections:

* Background
* Dataset Exploration
* Centralize the feature set with SageMaker Feature Store
* Train a model using a SageMaker Training job.
* Track and organize data science projects using SageMaker Experiment
* Use Sagemaker Debugger to monitor utilization of system resources such as GPUs, CPUs, network, and memory, and profiles the training jobs to collect detailed ML framework metrics.
* Automate Hyperparameter Tuning process to find the best model hyperparameters.
* Identify bias in the data and the model. Measure model feature importance using SHAP through SageMaker Clarify
* Register the model into SageMaker Model Registry
* Deploy model into SageMaker Inference to serve churn predictions

# Background

This particular challenge was originally introduced as a Kaggle competition in 2018. The goal was to build an algorithm that predicts 
whether a subscription user will churn using a donated dataset from KKBOX. 

For a subscription business, accurately predicting churn is critical to long-term success. 

Even slight variations in churn can drastically affect profits.

KKBOX is Asia’s leading music streaming service, holding the world’s most comprehensive Asia-Pop music library with over 30 million tracks. They offer a generous, unlimited version of their service to millions of people, supported by advertising and paid subscriptions. This delicate model is dependent on accurately predicting churn of their paid users.

In this notebook, we'll explore a machine learning model called XGBoost to predict whether a user will churn after their subscription expires. Currently, the company uses survival analysis techniques to determine the residual membership life time for each subscriber. 

# Data

We combining multiple datasets, including the subscription, membership and user activity logs to extract the signals for training a machine learning model. We use an EMR cluster to perform the feature engineering work, directly from within SageMaker Studio. For detail about using EMR and Pyspark, please refer to the notebook [here](processing_pyspark.ipynb)

In the following section, we'll explore the curated dataset in greater detail. 

In [2]:
!pip install sagemaker -U

Keyring is skipped due to an exception: 'keyring.backends'
Collecting sagemaker
  Using cached sagemaker-2.131.0-py2.py3-none-any.whl
Collecting importlib-metadata<5.0,>=1.4.0
  Using cached importlib_metadata-4.13.0-py3-none-any.whl (23 kB)
Collecting boto3<2.0,>=1.26.28
  Using cached boto3-1.26.62-py3-none-any.whl (132 kB)
Collecting botocore<1.30.0,>=1.29.62
  Using cached botocore-1.29.62-py3-none-any.whl (10.4 MB)
Installing collected packages: importlib-metadata, botocore, boto3, sagemaker
  Attempting uninstall: importlib-metadata
    Found existing installation: importlib-metadata 5.1.0
    Uninstalling importlib-metadata-5.1.0:
      Successfully uninstalled importlib-metadata-5.1.0
  Attempting uninstall: botocore
    Found existing installation: botocore 1.29.24
    Uninstalling botocore-1.29.24:
      Successfully uninstalled botocore-1.29.24
  Attempting uninstall: boto3
    Found existing installation: boto3 1.26.24
    Uninstalling boto3-1.26.24:
      Successfully unin

## Setup

Let's import the Python libraries we'll need for this project.

In [217]:
import sagemaker
import boto3
from sagemaker import image_uris
from sagemaker.session import Session
from sagemaker.inputs import TrainingInput
from sagemaker.experiments.run import Run
from sagemaker.xgboost.estimator import XGBoost
import pandas as pd
import logging
import time
from sagemaker.tuner import (
    IntegerParameter,
    CategoricalParameter,
    ContinuousParameter,
    HyperparameterTuner,
    HyperbandStrategyConfig,
    StrategyConfig
)
from sagemaker.experiments import load_run
from sagemaker.processing import ScriptProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.xgboost.model import XGBoostModel
from sagemaker.model_metrics import MetricsSource, ModelMetrics, FileSource

In [124]:
role = sagemaker.get_execution_role()
sagemaker_session = Session()
bucket = sagemaker_session.default_bucket()
region = sagemaker_session.boto_region_name

prefix = "data/kkbox-customer-churn-model"
experiment_name = "kkbox-customer-churn-model-experiment"
content_type = "csv"
customer_churn_dataset_s3 = f"{prefix}/processed/all/customer_churn.csv"
s3_model_evaluation_prefix = "data/kkbox-customer-churn-model/evaluation"
inference_serving_instance_type = "ml.m5.xlarge"

In [5]:
logger = logging.getLogger('__name__')
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler())

# Explore the Dataset

In [216]:
feature_cols = ['msno', 'is_churn', 'regist_trans', 'mst_frq_plan_days', 
            'revenue', 'regist_cancels', 'bd', 'tenure', 'num_25', 
            'num_50', 'num_75', 'num_985', 'num_100', 'num_unq', 
            'total_secs', 'city', 'gender', 'registered_via', 'qtr_trans', 'mst_frq_pay_met', 'is_auto_renew']

churn_df = pd.read_csv(f"s3://{bucket}/{customer_churn_dataset_s3}", names=feature_cols)

In [7]:
churn_df.head()

Unnamed: 0,msno,is_churn,regist_trans,mst_frq_plan_days,revenue,regist_cancels,bd,tenure,num_25,num_50,...,num_985,num_100,num_unq,total_secs,city,gender,registered_via,qtr_trans,mst_frq_pay_met,is_auto_renew
0,++4RuqBw0Ss6bQU4oMxaRlbBPoWzoEiIZaxPM04Y4+U=,0,3.295837,3.433987,8.230844,0.0,0,7.341067,1.292343,0.886311,...,13.603248,21.37355,7.728394,2.715068,0.0,0.0,0.0,0.0,0.0,0.0
1,+/namlXq+u3izRjHCFJV4MgqcXcLidZYszVsROOq/y4=,0,3.135494,3.433987,8.095294,0.0,0,20.822034,1.423729,1.186441,...,28.584746,46.016949,8.457757,10.832877,5.0,1.0,1.0,0.0,6.0,0.0
2,+0/X9tkmyHyet9X80G6GTrDFHnJqvai8d1ZPhayT0os=,0,3.135494,3.433987,8.095294,0.0,0,4.432647,1.139461,0.838352,...,31.03962,30.091918,8.62362,13.010959,10.0,1.0,1.0,0.0,6.0,0.0
3,+0jTOa6KGPk1vtNTwRDMZc/McUo41AeuwV3ndo54Y+Q=,0,3.091042,3.433987,8.048788,0.0,0,5.446634,2.706076,1.07225,...,16.692939,21.883415,7.874394,3.032877,3.0,2.0,1.0,2.0,7.0,0.0
4,+0l+FDuhyjaZnu0APnrg5L9QqgaRw4RmdQMvqOtKDmU=,0,3.218876,3.433987,8.18228,0.0,0,6.696237,1.327957,0.704301,...,23.787634,29.215054,8.276365,2.043836,2.0,1.0,3.0,0.0,5.0,0.0


In [8]:
print(f"total number of rows: {churn_df.shape[0]}, columns: {churn_df.shape[1]}")

total number of rows: 1570131, columns: 21


## Ingest Features into SageMaker Feature Store

Since the data is already preprocessed, we'll simply work on ingesting the data into a SageMaker offline store. 

# Feature Store Overview

Amazon SageMaker Feature Store is a fully managed, purpose-built repository to store, share, and manage features for machine learning (ML) models. Features are inputs to ML models used during training and inference. Features are used repeatedly by multiple teams and feature quality is critical to ensure a highly accurate model. Also, when features used to train models offline in batch are made available for real-time inference, it’s hard to keep the two feature stores synchronized. SageMaker Feature Store provides a secured and unified store for feature use across the ML lifecycle.

![how feature store works](img/sm_feature_store.png)

The following cells demonstrates how to get started with Feature Store, create feature groups, and ingest data into them. 
These feature groups are stored in your Feature Store.

Feature groups are resources that contain metadata for all data stored in your Feature Store. A feature group is a logical grouping of features, defined in the feature store to describe records. A feature group’s definition is composed of a list of feature definitions, a record identifier name, and configurations for its online and offline store.

* Creating a feature group
* Ingest data into a feature group




## Create a Feature Group 

In [9]:
from time import gmtime, strftime, sleep
import pandas as pd
import botocore
from datetime import datetime
from datetime import timezone
customer_churn_feature_group_name = "kkbox-customer-churn-features"
customer_churn_feature_group_s3_prefix = f"{prefix}/feature-store"

Fist we want to see if the feature store is already created. We can use boto3 client for SageMaker to help us with that

In [10]:
sm_client = boto3.client("sagemaker")

In [11]:
create_new_fs = False

In [12]:
try:
    response = response = sm_client.describe_feature_group(
        FeatureGroupName=customer_churn_feature_group_name)
except botocore.exceptions.ClientError as error:
        if error.response['Error']['Code'] == "ResourceNotFound":
            create_new_fs = True
        else:
            raise error

In [13]:
def generate_event_timestamp():
    # naive datetime representing local time
    naive_dt = datetime.now()
    # take timezone into account
    aware_dt = naive_dt.astimezone()
    # time in UTC
    utc_dt = aware_dt.astimezone(timezone.utc)
    # transform to ISO-8601 format
    event_time = utc_dt.isoformat(timespec='milliseconds')
    event_time = event_time.replace('+00:00', 'Z')
    return event_time

In [14]:
def wait_for_feature_group_creation_complete(feature_group):
    status = feature_group.describe().get('FeatureGroupStatus')
    print(f'Initial status: {status}')
    while status == 'Creating':
        logger.info(f'Waiting for feature group: {feature_group.name} to be created ...')
        time.sleep(5)
        status = feature_group.describe().get('FeatureGroupStatus')
    if status != 'Created':
        raise SystemExit(f'Failed to create feature group {feature_group.name}: {status}')
    logger.info(f'FeatureGroup {feature_group.name} was successfully created.')

Given we are working a pandas dataframe, we could use Sagemaker Feature Store API to infer the schema from the pandas dataframe.

In [16]:
from sagemaker.feature_store.feature_group import FeatureGroup

if create_new_fs:
    customer_churn_feature_group = FeatureGroup(
        name=customer_churn_feature_group_name, sagemaker_session=sagemaker_session
    )
    churn_df['event_time'] = generate_event_timestamp()
    customer_churn_feature_group.load_feature_definitions(data_frame=churn_df)
    customer_churn_feature_group.create(s3_uri=f's3://{bucket}/{customer_churn_feature_group_s3_prefix}', 
                               record_identifier_name='msno', 
                               event_time_feature_name='event_time', 
                               role_arn=role, 
                               enable_online_store=True)
    wait_for_feature_group_creation_complete(customer_churn_feature_group)
else:
    customer_churn_feature_group = FeatureGroup(
        name=customer_churn_feature_group_name, sagemaker_session=sagemaker_session
    )

## Ingest the data into the newly created feature group

In [None]:
logger.info(f'Ingesting data into feature group: {customer_churn_feature_group.name} ...')
customer_churn_feature_group.ingest(data_frame=churn_df, max_processes=16, wait=True)
logger.info(f'{len(churn_df)} customer records ingested into feature group: {customer_churn_feature_group.name}')

## Using the Amazon SageMaker Python SDK to get your data from your feature groups
You can use the Feature Store APIs to create a dataset from your feature groups. Data scientists create ML datasets for training by retrieving ML feature data from one or more feature groups in the offline store. Use the create_dataset() function to create the dataset. You can use the SDK to do the following:

Create a dataset from multiple feature groups.
Create a dataset from the feature groups and a pandas data frame.
By default, Feature Store doesn't include records that you've deleted from the dataset. It also doesn't include duplicated records. A duplicate record has the record ID and timestamp value in the event time column. Before you use the SDK to create a dataset, you must start a SageMaker session. Use the following code to start the session.

In [20]:
from sagemaker.feature_store.feature_store import FeatureStore

featurestore_runtime = boto3.client(service_name="sagemaker-featurestore-runtime",region_name=region)
feature_store_output_prefix=f"{prefix}/feature-store-output"
feature_store_session = sagemaker.Session(
    sagemaker_client=sm_client,
    sagemaker_featurestore_runtime_client=featurestore_runtime,
)

In [111]:
feature_cols = ['msno', 'is_churn', 'regist_trans', 'mst_frq_plan_days', 
            'revenue', 'regist_cancels', 'bd', 'tenure', 'num_25', 
            'num_50', 'num_75', 'num_985', 'num_100', 'num_unq', 
            'total_secs', 'city', 'gender', 'registered_via', 'qtr_trans', 'mst_frq_pay_met', 'is_auto_renew']
feature_store = FeatureStore(feature_store_session)
builder = feature_store.create_dataset(
    base=customer_churn_feature_group,
    output_path=f"s3://{bucket}/{feature_store_output_prefix}",
    included_feature_names=feature_cols
)

In [22]:
fs_output = builder.to_dataframe()

INFO:sagemaker:Query 2cbce736-b6ce-4877-8306-ca897be07fd4 is being executed.
INFO:sagemaker:Query 2cbce736-b6ce-4877-8306-ca897be07fd4 is being executed.
INFO:sagemaker:Query 2cbce736-b6ce-4877-8306-ca897be07fd4 is being executed.
INFO:sagemaker:Query 2cbce736-b6ce-4877-8306-ca897be07fd4 is being executed.
INFO:sagemaker:Query 2cbce736-b6ce-4877-8306-ca897be07fd4 successfully executed.


Because the output from the query above is a tuple (dataframe and the SQL query used in fetching the data from the feature store),
we'll reference the dataframe for our use case. 

In [23]:
df = fs_output[0]

# Train an XGBoost model using SageMaker Training
In the following section, we'll use the data in the feature store to train a model using the dataset in the feature store. 
Specifically, we'll use the column named "is_churn" as the target label, and train a binary classification model to predict customer churn given the 
input data.

First, we will set some hyperparameters for the model, then evaluate the performance based on these hyperparams. We could either tune this set of parameters 
based on the performance, or use SageMaker auto parameter tuning feature to help us find the most effective paremeters. 
We'll need to capture the model performance metrics so that we could review the model and make decision about whether to retrain a model, or to proceed to deploy the model in the downstream process.





First, before we train a model, we'll need to split the data into train and test dataset. To do that,  we'll use sklearn library to achieve this objective.

In [25]:
from sklearn.model_selection import train_test_split

In [80]:
# df.drop(["event_time"], axis=1, inplace=True) # dropping event_time from the columns since it was added for feature store
train, test = train_test_split(df, test_size=0.33, random_state=42)
train, val = train_test_split(train, test_size=0.125, random_state=42)

We'll upload the train and test dataset to S3 for the next step

In [92]:
train_dataset_s3_prefix = f"{prefix}/input/train"
test_dataset_s3_prefix = f"{prefix}/input/test"
val_dataset_s3_prefix = f"{prefix}/input/val"

train.to_csv(f"s3://{bucket}/{train_dataset_s3_prefix}/train.csv", index=False, header=False)
test.to_csv(f"s3://{bucket}/{test_dataset_s3_prefix}/test.csv", index=False, header=False)
val.to_csv(f"s3://{bucket}/{val_dataset_s3_prefix}/val.csv", index=False, header=False)

In [29]:
train.head()

Unnamed: 0,msno,is_churn,regist_trans,mst_frq_plan_days,revenue,regist_cancels,bd,tenure,num_25,num_50,...,num_985,num_100,num_unq,total_secs,city,gender,registered_via,qtr_trans,mst_frq_pay_met,is_auto_renew
496039,HSKFilVIP2kVbGVmzcr3vnBLkZ5Z5VnDMDPNQDb5vFM=,0,3.258097,3.433987,8.239593,1.098612,0,4.923841,0.534768,0.495033,...,26.849338,28.970199,8.482085,4.09863,2.0,2.0,1.0,2.0,1.0,0.0
267882,+w0pdN48/gWA22LFznPjXuUQnXqLfjrGHQ/ubboZ0Mk=,0,2.397895,3.433987,7.509883,1.098612,0,9.505917,4.677515,3.221893,...,84.772189,60.778107,9.85664,1.010959,4.0,1.0,4.0,2.0,1.0,1.0
220198,JCidxiRvGGq9C+YZD9o9c0vWXEBjDCwlCCsV0trekyM=,0,3.178054,3.433987,8.048788,0.693147,0,6.110368,2.0,1.588629,...,117.899666,77.341137,9.540826,10.268493,10.0,1.0,1.0,3.0,1.0,0.0
646547,FzMY8Z7yxKVh2cQ/QfumjSiGspsjctozK+06wGKadbI=,0,2.772589,3.433987,7.594381,0.693147,0,5.941994,1.493841,0.897251,...,24.619346,25.098488,7.977952,3.575661,1.0,0.0,2.0,0.0,0.0,0.0
30860,8UqlJXmPq8BwWdK49bCWPIyzxIud1pJlDDe7iUeduys=,0,3.044522,3.433987,7.591357,0.0,0,8.883436,2.297546,0.984663,...,16.880368,23.634969,7.841221,1.715068,0.0,0.0,0.0,2.0,0.0,0.0


## SageMaker Experiment
Amazon SageMaker Experiments is a capability of Amazon SageMaker that lets you create, manage, analyze, and compare your machine learning experiments.
Because machine learning is an iterative process, we need to experiment with multiple combinations of data, algorithms, and parameters, all while observing the impact of incremental changes on model accuracy. Over time, this iterative experimentation can result in thousands of model training runs and model versions. This makes it hard to track the best performing models and their input configurations. It’s also difficult to compare active experiments with past experiments to identify opportunities for further incremental improvements. Use SageMaker Experiments to organize, view, analyze, and compare iterative ML experimentation to gain comparative insights and track your best performing models.

SageMaker Experiments automatically tracks the inputs, parameters, configurations, and results of your iterations as runs. You can assign, group, and organize these runs into experiments. SageMaker Experiments is integrated with Amazon SageMaker Studio, providing a visual interface to browse your active and past experiments, compare runs on key performance metrics, and identify the best performing models. SageMaker Experiments tracks all of the steps and artifacts that went into creating a model, and you can quickly revisit the origins of a model when you are troubleshooting issues in production, or auditing your models for compliance verifications.

# Amazon SageMaker Debugger
With Amazon SageMaker Debugger, you can debug models during training. During training, Debugger periodicially saves tensors, which specify the state of the model at that point in time. Debugger saves the tensors to Amazon S3 for analysis and visualization. This allows you to diagnose training issues with Studio.

# Specify Debugger rules
To enable automated detection of common issues during training, you can attach a list of rules to evaluate the training job against.

For our project, we'll configure the following rules:

* LossNotDecreasing rule -- triggered if the loss doesn't decrease monotonically at any point during training
* Overtraining rule -- Triggered when the model approaches to a minimum of the loss function and does not improve anymore.
* Overfit rule -- This rule detects if your model is being overfit to the training data by comparing the validation and training losses.

In [50]:
from sagemaker.debugger import rule_configs, Rule

debug_rules = [
    Rule.sagemaker(rule_configs.loss_not_decreasing()),
    Rule.sagemaker(rule_configs.overtraining()),
    Rule.sagemaker(rule_configs.overfit())
]

In [51]:
hyperparameters = {
    "max_depth":5,
    "eta":0.2,
    "gamma":4,
    "min_child_weight":6,
    "subsample":0.7,
    "n_estimators":500,
    "region" : region
}

with Run(experiment_name=experiment_name, sagemaker_session=sagemaker_session) as run: 
    # initialize hyperparameters
    output_path = 's3://{}/{}/output'.format(bucket, prefix)    
    estimator = XGBoost(entry_point = "pipelines/cust_churn_prediction/train-fs.py", 
                    framework_version='1.5-1',
                    hyperparameters=hyperparameters,
                    role=sagemaker.get_execution_role(),
                    instance_count=1,
                    instance_type='ml.m5.2xlarge',
                    volume_size =10,
                    output_path=output_path, 
                    base_job_name="kkbox-customer-churn-training",
                    rules=debug_rules)

    train_input = TrainingInput(f"s3://{bucket}/{train_dataset_s3_prefix}")
    test_input = TrainingInput(f"s3://{bucket}/{test_dataset_s3_prefix}")

    # execute the XGBoost training job
    estimator.fit({'train': train_input, 'test': test_input})
    

INFO:sagemaker.image_uris:Ignoring unnecessary Python version: py3.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: ml.m5.2xlarge.
INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: latest.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: latest.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: latest.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker:Creating training-job with name: kkbox-customer-churn-training-2023-02-02-00-44-54-410


2023-02-02 00:44:54 Starting - Starting the training job...
2023-02-02 00:45:14 Starting - Preparing the instances for trainingLossNotDecreasing: InProgress
Overtraining: InProgress
Overfit: InProgress
......
2023-02-02 00:46:25 Downloading - Downloading input data......
2023-02-02 00:47:26 Training - Training image download completed. Training in progress.[34m[2023-02-02 00:47:11.647 ip-10-0-74-164.ec2.internal:7 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m
[34m[2023-02-02:00:47:11:INFO] Imported framework sagemaker_xgboost_container.training[0m
[34m[2023-02-02:00:47:11:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2023-02-02:00:47:11:INFO] Invoking user training script.[0m
[34m[2023-02-02:00:47:11:INFO] Module train-fs does not provide a setup.py. [0m
[34mGenerating setup.py[0m
[34m[2023-02-02:00:47:11:INFO] Generating setup.cfg[0m
[34m[2023-02-02:00:47:11:INFO] Generating MANIFEST.in[0m
[34m[2023-02-02:00:47:11:INFO] Installing module with

## Visualize Training performance using SageMaker Experiment

From the training run, we captured a few performance matrics, including the training and testing loss, confusion matrix and precision recall and ROC Curve.
Additionally, we can use SageMaker Experiment to visualize these metrics as charts to help us understand the model. 

![sagemaker experiment metrics](img/sm_experiment_metrics.png)


Here are the visualization that provides insights into validation loss over time, confusion matrix, precision recall and ROC Curve.

![sagemaker experiment pr](img/sm-experiment-pr-curve.png)
![sagemaker experiment roc](img/sm-experiment-cm-roc-curve.png)
![sagemaker experiment logloss](img/sm-experiment-logloss.png)


For SageMaker Debugger, here's the list of rules and system utilization over the course of training time.
![sagemaker experiment debugger rules](img/sm-debugger-rules.png)

![sagemaker experiment debugger system](img/sm-debugger-system.png)

![sagemaker experiment debugger gpu system](img/sm-debugger-gpu-system.png)


## Automatic Hyperparameter Tuning

Amazon SageMaker automatic model tuning (AMT), also known as hyperparameter tuning, finds the best version of a model by running many training jobs on your dataset. To do this, AMT uses the algorithm and ranges of hyperparameters that you specify. It then chooses the hyperparameter values that creates a model that performs the best, as measured by a metric that you choose.

Here's are the approaches supported by SageMaker Hyperparameter Tuning:

## Grid Search
Hyperparameter tuning chooses combinations of values from the range of categorical values that you specify when you create the job. Only categorical parameters are supported when using the grid search strategy. 

## Random Search
Hyperparameter tuning chooses a random combination of values from within the ranges that you specify for hyperparameters for each training job it launches. 

## Bayesian Optimization
Bayesian optimization treats hyperparameter tuning like a regression problem. Given a set of input features (the hyperparameters), hyperparameter tuning optimizes a model for the metric that you choose. To solve a regression problem, hyperparameter tuning makes guesses about which hyperparameter combinations are likely to get the best results, and runs training jobs to test these values. After testing a set of hyperparameter values, hyperparameter tuning uses regression to choose the next set of hyperparameter values to test.

## Hyperband 
Multi-fidelity based tuning strategy that dynamically reallocates resources. Hyperband uses both intermediate and final results of training jobs to re-allocate epochs to well-utilized hyperparameter configurations and automatically stops those that underperform. 

For this experiment, we'll leverage HPO to automatically tune both eta and max_depth parameters. We'll use Hyperband as the optimization technique to train the XGBoost model and observe the result

In [77]:
hyperparameter_ranges = {
    "eta": ContinuousParameter(0, 1),
    "max_depth": IntegerParameter(1, 10),
}

In [76]:
hyperparameters = {
    "max_depth":5,
    "eta":0.2,
    "gamma":4,
    "min_child_weight":6,
    "subsample":0.7,
    "n_estimators":100,
    "region" : region,
    "sm_experiment" : experiment_name,
    "sm_run" : "default"
}

objective_metric_name = "validation:logloss"
output_path = 's3://{}/{}/output'.format(bucket, prefix)    
estimator = XGBoost(entry_point = "pipelines/cust_churn_prediction/train-fs.py", 
                framework_version='1.5-1',
                hyperparameters=hyperparameters,
                role=sagemaker.get_execution_role(),
                instance_count=1,
                instance_type='ml.m5.2xlarge',
                volume_size =10,
                output_path=output_path, 
                base_job_name="kkbox-customer-churn-training")

train_input = TrainingInput(f"s3://{bucket}/{train_dataset_s3_prefix}")
test_input = TrainingInput(f"s3://{bucket}/{test_dataset_s3_prefix}")

strategy_config = StrategyConfig(
    HyperbandStrategyConfig(
        max_resource=10,
        min_resource=1
    )
)

# metric_definitions = [ { 'Name': 'validation:logloss', 'Regex': "validation:logloss=(.*?);"} ]

tuner = HyperparameterTuner(
    estimator, 
    objective_metric_name, 
    hyperparameter_ranges, 
    max_jobs=6, 
    max_parallel_jobs=3,
    strategy="Hyperband",
    objective_type="Minimize",
    strategy_config=strategy_config)
    
tuner.fit({'train': train_input, 'test': test_input})

INFO:sagemaker.image_uris:Ignoring unnecessary Python version: py3.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: ml.m5.2xlarge.
INFO:sagemaker:Creating hyperparameter tuning job with name: sagemaker-xgboost-230202-0327


............................................................................................!


# HyperParameter Tuner Job Analysis

After the HPO jobs complete, we can use SageMaker Experiment to evaluate the metrics for determining the best model. In our example, we defined 
 validation:logloss as our objective metrics. In Sagemaker Experiment, we plot a bar chart to show the metrics for each job triggered by the HPO. 
    
![sagemaker experiment hpo analysis](img/sm-hpo-analysis.png)

We could retrieve the best performant model using the SDK directly

In [78]:

best_model = tuner.best_estimator()

INFO:sagemaker.image_uris:Ignoring unnecessary Python version: py3.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: ml.m5.2xlarge.


2023-02-02 03:33:10 Starting - Found matching resource for reuse
2023-02-02 03:33:10 Downloading - Downloading input data
2023-02-02 03:33:10 Training - Training image download completed. Training in progress.
2023-02-02 03:33:10 Uploading - Uploading generated training model
2023-02-02 03:33:10 Completed - Resource released due to keep alive period expiry[34m[2023-02-02 03:31:17.713 ip-10-0-89-11.ec2.internal:7 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m
[34m[2023-02-02:03:31:17:INFO] Imported framework sagemaker_xgboost_container.training[0m
[34m[2023-02-02:03:31:17:INFO] Failed to parse hyperparameter _tuning_objective_metric value validation:logloss to Json.[0m
[34mReturning the value itself[0m
[34m[2023-02-02:03:31:17:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2023-02-02:03:31:17:INFO] Invoking user training script.[0m
[34m[2023-02-02:03:31:17:INFO] Module train-fs does not provide a setup.py. [0m
[34mGenerating setup.py[0m
[34m[2023

In [97]:
best_model.model_data

's3://sagemaker-us-east-1-602900100639/data/kkbox-customer-churn-model/output/sagemaker-xgboost-230202-0327-004-094f7b87/output/model.tar.gz'

# Model Performance Evaluation

Next, we'll perform one final performance evaluation with the validation test dataset that we held out while splitting the data for training. 
The metrics to capture in our example are:
    
* auc score
* recall
* precision
* accuracy
* F1 score

These metrics are to be used as the performance for the model, and will be recorded when we register the model with SageMaker Model Registry.

We'll use a SageMaker Processor job to run the model evaluation step, and upload the metrics to the given S3 bucket. 

In [105]:
s3_model_evaluation_prefix = "data/kkbox-customer-churn-model/evaluation"

image_uri = sagemaker.image_uris.retrieve(
    framework="xgboost",
    region=region,
    version="1.5-1"
)

# define model evaluation step to evaluate the trained model
script_eval = ScriptProcessor(
    image_uri=image_uri,
    command=["python3"],
    instance_type="ml.m5.large",
    instance_count=1,
    base_job_name="model-evaluation",
    role=role,
    sagemaker_session=sagemaker_session,
)

script_eval.run(
    code="pipelines/cust_churn_prediction/evaluate.py",
    inputs=[
        ProcessingInput(source=best_model.model_data, 
                           destination="/opt/ml/processing/model"),
        ProcessingInput(
            source=f"s3://{bucket}/{val_dataset_s3_prefix}",
            destination="/opt/ml/processing/validation")
    ],
    outputs=[
        ProcessingOutput(output_name="evaluation", source="/opt/ml/processing/evaluation",\
                        destination=f"s3://{bucket}/{s3_model_evaluation_prefix}"),
    ]
)

INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker:Creating processing-job with name model-evaluation-2023-02-02-04-35-49-688


..............................
..

# Bias Detection and Model Explainability

Algorithmic bias, discrimination, fairness, and related topics have been studied across disciplines such as law, policy,and computer science. The machine learning models powering these applications learn from data and this datamay reflect disparities or other inherent biases. For example, the training data may not have sufficient representation ofvarious feature groups or may contain biased labels. These biases could end up learning them and then reproduce or even exacerbate those biases in their predictions. Thefield of machine learning provides an opportunity to address biases by detecting them and measuring them at eachstage of the ML lifecycle.

Amazon SageMaker Clarify can detect potential bias during data preparation, after model training, and in your deployed model. For instance, you can check for bias related to gender in your dataset or in your trained model and receive a detailed report that quantifies different types of potential bias. 

SageMaker Clarify also includes feature importance scores that help you explain how your model make predictions and produces explainability reports in bulk or real time via online explainability. You can use these reports to support customer or internal presentations, or to identify potential issues with your model.

Here's a diagram that outlines how SageMaker Clarify can be integrated into the ML development lifecycle:

![sagemaker clarify](img/sm-clarify.png)


In the following section, we'll explore SageMaker Clarify capability to provide bias and explainability for our XGBoost model.

In [107]:
from sagemaker import clarify

# First we'll instantiate a Clarify Processor.
clarify_processor = clarify.SageMakerClarifyProcessor(
    role=role, 
    instance_count=1, 
    instance_type="ml.m5.4xlarge", 
    sagemaker_session=sagemaker_session)

INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: 1.0.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.


In [116]:
bias_report_output_path = f"s3://{bucket}/{prefix}/clarify-bias"

bias_data_config = clarify.DataConfig(
    s3_data_input_path=f"s3://{bucket}/{train_dataset_s3_prefix}/train.csv",
    s3_output_path=bias_report_output_path,
    label="is_churn",
    headers=feature_cols,
    dataset_type="text/csv",
)

bias_config = clarify.BiasConfig(
    label_values_or_threshold=[1], facet_name="gender", facet_values_or_threshold=[1], group_name="bd")

job_name = f"clarify-pretrain-bias-{int(time.time())}"
run_name = "clarify-pretrain-bias"
with load_run(experiment_name=experiment_name, 
              run_name=run_name,
              sagemaker_session=sagemaker_session) as run:    
    clarify_processor.run_pre_training_bias(data_config=bias_data_config, 
                          data_bias_config=bias_config, 
                          methods='all', 
                          wait=True, 
                          logs=True, 
                          job_name=job_name)

INFO:sagemaker.clarify:Analysis Config: {'dataset_type': 'text/csv', 'headers': ['msno', 'is_churn', 'regist_trans', 'mst_frq_plan_days', 'revenue', 'regist_cancels', 'bd', 'tenure', 'num_25', 'num_50', 'num_75', 'num_985', 'num_100', 'num_unq', 'total_secs', 'city', 'gender', 'registered_via', 'qtr_trans', 'mst_frq_pay_met', 'is_auto_renew'], 'label': 'is_churn', 'label_values_or_threshold': [1], 'facet': [{'name_or_index': 'gender', 'value_or_threshold': [1]}], 'group_variable': 'bd', 'methods': {'report': {'name': 'report', 'title': 'Analysis Report'}, 'pre_training_bias': {'methods': 'all'}}}
INFO:sagemaker:Creating processing-job with name clarify-pretrain-bias-1675314395


...............................[34m2023-02-02 05:11:35,789 logging.conf not found when configuring logging, using default logging configuration.[0m
[34m2023-02-02 05:11:35,790 Starting SageMaker Clarify Processing job[0m
[34m2023-02-02 05:11:35,790 Analysis config path: /opt/ml/processing/input/config/analysis_config.json[0m
[34m2023-02-02 05:11:35,790 Analysis result path: /opt/ml/processing/output[0m
[34m2023-02-02 05:11:35,790 This host is algo-1.[0m
[34m2023-02-02 05:11:35,790 This host is the leader.[0m
[34m2023-02-02 05:11:35,790 Number of hosts in the cluster is 1.[0m
[34m2023-02-02 05:11:35,792 Running Python / Pandas based analyzer.[0m
[34m2023-02-02 05:11:35,792 Dataset type: text/csv uri: /opt/ml/processing/input/data[0m
[34m2023-02-02 05:11:35,803 Loading dataset...[0m
  df = df.append(df_tmp, ignore_index=True)[0m
[34m2023-02-02 05:11:37,620 Loaded dataset. Dataset info:[0m
[34m<class 'pandas.core.frame.DataFrame'>[0m
[34mRangeIndex: 582105 entrie

## Pretrain Bias Analysis
Once the Clarify job is complete, we can view the Pretrain Bias Evaluation analysis report in SageMaker Experiment.

![sagemaker clarify pretrain bias](img/sm-clarify-pretrain-bias.png)

## Model Bias Analysis

Similar to the bias analysis for the training dataset, we are also interested in evaluating the bias that the model might exhibit given the signal learn from the training data.

Amazon SageMaker Clarify provides eleven posttraining data and model bias metrics to help quantify various conceptions of fairness. These concepts cannot all be satisfied simultaneously and the selection depends on specifics of the cases involving potential bias being analyzed. Most of these metrics are a combination of the numbers taken from the binary classification confusion matrices for the different demographic groups. Because fairness and bias can be defined by a wide range of metrics, human judgment is required to understand and choose which metrics are relevant to the individual use case, and customers should consult with appropriate stakeholders to determine the appropriate measure of fairness for their application.

In the following section, we'll launch a SageMaker Clarify processor job to conduct the model bias analysis on the best XGBoost model trained in this notebook.

In [126]:
model_name=f"kkbox-customer-churn-model-{strftime('%Y%m%d%H%M%S', gmtime())}"
model = XGBoostModel(
    name = model_name,
    model_data=best_model.model_data,
    sagemaker_session=sagemaker_session,
    role=role,
    entry_point="pipelines/cust_churn_prediction/inference.py",
    framework_version="1.5-1"
)
model.create(instance_type=inference_serving_instance_type)

INFO:sagemaker.image_uris:Ignoring unnecessary instance type: ml.m5.xlarge.
INFO:sagemaker:Creating model with name: kkbox-customer-churn-model-20230202053205


In [129]:
model_config = clarify.ModelConfig(
    model_name=model_name,
    instance_type=inference_serving_instance_type,
    instance_count=1,
    accept_type="text/csv",
    content_type="text/csv",
)
predictions_config = clarify.ModelPredictedLabelConfig(probability_threshold=0.8)
bias_report_output_path = f"s3://{bucket}/{prefix}/clarify-bias"

post_train_bias_data_config = clarify.DataConfig(
    s3_data_input_path=f"s3://{bucket}/{train_dataset_s3_prefix}/train.csv",
    s3_output_path=bias_report_output_path,
    headers=feature_cols,
    label="is_churn",
    dataset_type="text/csv",
    excluded_columns=["msno"]
)
post_train_bias_config = clarify.BiasConfig(
    label_values_or_threshold=[1], facet_name="gender", facet_values_or_threshold=[1], group_name="bd")


In [130]:
post_train_bias_run_name = "clarify-post-train-bias"
with load_run(experiment_name=experiment_name, 
              run_name=post_train_bias_run_name,
              sagemaker_session=sagemaker_session) as run:
    clarify_processor.run_post_training_bias(
        data_config=post_train_bias_data_config,
        data_bias_config=post_train_bias_config,
        model_config=model_config,
        model_predicted_label_config=predictions_config)

INFO:sagemaker.clarify:Analysis Config: {'dataset_type': 'text/csv', 'headers': ['msno', 'is_churn', 'regist_trans', 'mst_frq_plan_days', 'revenue', 'regist_cancels', 'bd', 'tenure', 'num_25', 'num_50', 'num_75', 'num_985', 'num_100', 'num_unq', 'total_secs', 'city', 'gender', 'registered_via', 'qtr_trans', 'mst_frq_pay_met', 'is_auto_renew'], 'label': 'is_churn', 'excluded_columns': ['msno'], 'label_values_or_threshold': [1], 'facet': [{'name_or_index': 'gender', 'value_or_threshold': [1]}], 'group_variable': 'bd', 'methods': {'report': {'name': 'report', 'title': 'Analysis Report'}, 'post_training_bias': {'methods': 'all'}}, 'predictor': {'model_name': 'kkbox-customer-churn-model-20230202053205', 'instance_type': 'ml.m5.xlarge', 'initial_instance_count': 1, 'accept_type': 'text/csv', 'content_type': 'text/csv'}, 'probability_threshold': 0.8}
INFO:sagemaker:Creating processing-job with name Clarify-Posttraining-Bias-2023-02-02-05-39-45-687


................................[34m2023-02-02 05:44:52,549 logging.conf not found when configuring logging, using default logging configuration.[0m
[34m2023-02-02 05:44:52,550 Starting SageMaker Clarify Processing job[0m
[34m2023-02-02 05:44:52,550 Analysis config path: /opt/ml/processing/input/config/analysis_config.json[0m
[34m2023-02-02 05:44:52,550 Analysis result path: /opt/ml/processing/output[0m
[34m2023-02-02 05:44:52,550 This host is algo-1.[0m
[34m2023-02-02 05:44:52,550 This host is the leader.[0m
[34m2023-02-02 05:44:52,550 Number of hosts in the cluster is 1.[0m
[34m2023-02-02 05:44:52,764 Running Python / Pandas based analyzer.[0m
[34m2023-02-02 05:44:52,764 Dataset type: text/csv uri: /opt/ml/processing/input/data[0m
[34m2023-02-02 05:44:52,774 Loading dataset...[0m
  df = df.append(df_tmp, ignore_index=True)[0m
[34m2023-02-02 05:44:54,649 Loaded dataset. Dataset info:[0m
[34m<class 'pandas.core.frame.DataFrame'>[0m
[34mRangeIndex: 582105 entri

## Post Training Bias Analysis
Once the Clarify job is complete, we can view the Post Train Bias Evaluation analysis report in SageMaker Experiment.

![sagemaker clarify post train bias](img/sm-clarify-post-train-bias.png)

# Model Explainability

So far we've seen how SageMaker Clarify can be used to analyze Bias in both Data and the Model. 
In addition to bias analysis, Clarify can also be used to explain how machine learning (ML) models make predictions.  These tools can help ML modelers and developers and other internal stakeholders understand model characteristics as a whole prior to deployment and to debug predictions provided by the model after it's deployed. SageMaker Clarify uses a model-agnostic **feature attribution** approach to explain why a model made a prediction after training, and to provide per-instance explanation during inference. The implementation includes a scalable and efficient implementation of SHAP, based on the concept of a Shapley value, from the field of cooperative game theory, that assigns each feature an importance value for a particular prediction.

Explanations are typically contrastive (that is, they account for deviations from a baseline). As a result, for the same model prediction, you can expect to get different explanations with respect to different baselines. Therefore, your choice of a baseline is crucial. In an ML context, the baseline corresponds to a hypothetical instance that can be either uninformative or informative. During the computation of Shapley values, SageMaker Clarify generates several new instances between the baseline and the given instance, in which the absence of a feature, is modeled by setting the feature value to that of the baseline and the presence of a feature is modeled by setting the feature value to that of the given instance. Thus, the absence of all features corresponds to the baseline and the presence of all features corresponds to the given instance.

## Choosing a Baseline For Explainability
How can you choose good baselines? Often it is desirable to select a baseline with very low information content. For example, you can construct an average instance from the training dataset by taking either the median or average for numerical features and the mode for categorical features. For the college admissions example, you might be interested in explaining why a particular applicant was accepted as compared to a baseline acceptances based on an average applicant. If not provided, a baseline is calculated automatically by SageMaker Clarify using K-means or K-prototypes in the input dataset.


In the following section, we'll use Clarify to analyze the Model and provide Explainability in the form of SHAP values for each feature. 



Since our dataset contains both categorical and numerical columns, we'll use the following technique to construct our baseline dataset for SHAP analysis.

* For numeric features, we'll use the mean value for the feature in the training dataset.
* For categorical features, we'll use the mode for each features in the training dataset.

In [200]:
train.head()

Unnamed: 0,msno,is_churn,regist_trans,mst_frq_plan_days,revenue,regist_cancels,bd,tenure,num_25,num_50,...,num_985,num_100,num_unq,total_secs,city,gender,registered_via,qtr_trans,mst_frq_pay_met,is_auto_renew
452844,x7jIfnVqYQJQ/rUpXQ6a9qD3o3op5BOprCBV2AFOoX4=,0,2.639057,3.433987,8.273592,0.0,0,3.734242,1.357751,0.727428,...,15.051107,16.144804,7.85009,5.876712,6.0,0.0,1.0,0.0,4.0,0.0
840421,/qcOoQQtPFBQHIoa/be/3672SArOmNqfzH+Z16l07ng=,0,1.386294,3.433987,6.293419,0.0,0,28.158768,1.187204,0.881517,...,18.751185,45.950237,8.07988,3.246575,2.0,2.0,1.0,0.0,4.0,0.0
716023,NXlC99Lxpo4RNFIa+dHhSSXhYDi8K9T19cuT4nPFSEs=,0,2.564949,3.433987,7.423568,0.0,0,6.538835,1.315534,1.019417,...,21.17233,19.15534,8.178232,8.758904,2.0,1.0,1.0,0.0,3.0,0.0
966444,xRZdxjljVoblzXnN3lASe+bbYdkYUH1GRt60MMxzo/s=,0,2.995732,3.433987,7.54009,0.0,0,10.146853,2.538462,1.153846,...,12.48951,23.482517,7.650415,1.715068,0.0,0.0,0.0,2.0,0.0,0.0
469497,1wbBkYB1AOyJrfitu2KuJC9+klX6PGreWmHgMj7vXbI=,0,3.295837,3.433987,8.262301,0.0,0,2.020833,1.106771,0.440104,...,18.140625,13.997396,7.802782,3.558904,0.0,0.0,0.0,0.0,0.0,0.0


In [144]:
category_feature_names = [ 'city', 'gender', 'registered_via', 'qtr_trans', 'mst_frq_pay_met', 'is_auto_renew' ]

In [151]:
numeric_feature_names = [ x for x in feature_cols if x not in category_feature_names and x not in [ "msno", "is_churn" ] ]

In [177]:
baseline_df = pd.DataFrame(columns= ['regist_trans', 'mst_frq_plan_days', 'revenue', 'regist_cancels', 'bd', 'tenure', 'num_25', 
    'num_50', 'num_75', 'num_985', 'num_100', 'num_unq', 
    'total_secs', 'city', 'gender', 'registered_via', 'qtr_trans', 'mst_frq_pay_met', 'is_auto_renew'])

In [178]:
for f in category_feature_names:
     baseline_df[f] = train[f].mode()

In [179]:
for f in numeric_feature_names:
    baseline_df[f] = train[f].mean()

In [180]:
baseline = baseline_df.loc[0,:].values.tolist()

In [181]:
shap_config = clarify.SHAPConfig(
    baseline=[baseline],
    agg_method="mean_abs",
    num_samples=59,
    save_local_shap_values=True,
)

In [185]:
explainability_output_path = f"s3://{bucket}/{prefix}/clarify-explainability"
explainability_data_config = clarify.DataConfig(
    s3_data_input_path=f"s3://{bucket}/{val_dataset_s3_prefix}/val.csv",
    s3_output_path=explainability_output_path,
    label="is_churn",
    headers=feature_cols,
    dataset_type="text/csv",
    excluded_columns=["msno"]
)

In [187]:
explainability_run_name = "clarify-explainability"
with load_run(experiment_name=experiment_name, 
              run_name=explainability_run_name,
              sagemaker_session=sagemaker_session) as run:    

    clarify_processor.run_explainability(
        data_config=explainability_data_config,
        model_config=model_config,
        explainability_config=shap_config,
)

INFO:sagemaker.clarify:Analysis Config: {'dataset_type': 'text/csv', 'headers': ['msno', 'is_churn', 'regist_trans', 'mst_frq_plan_days', 'revenue', 'regist_cancels', 'bd', 'tenure', 'num_25', 'num_50', 'num_75', 'num_985', 'num_100', 'num_unq', 'total_secs', 'city', 'gender', 'registered_via', 'qtr_trans', 'mst_frq_pay_met', 'is_auto_renew'], 'label': 'is_churn', 'excluded_columns': ['msno'], 'predictor': {'model_name': 'kkbox-customer-churn-model-20230202053205', 'instance_type': 'ml.m5.xlarge', 'initial_instance_count': 1, 'accept_type': 'text/csv', 'content_type': 'text/csv'}, 'methods': {'report': {'name': 'report', 'title': 'Analysis Report'}, 'shap': {'use_logit': False, 'save_local_shap_values': True, 'baseline': [[2.663893229857712, 3.465653240347559, 7.443359819138155, 0.17611774303500843, 0.0, 5.922386381609899, 1.4932106804795426, 0.8961970300898964, 0.9524666352497243, 24.61423117387386, 25.068232366720203, 7.977549570168414, 3.5766200219489708, 0.0, 0.0, 0.0, 0.0, 0.0, 0.

.............................[34m2023-02-02 06:35:26,864 logging.conf not found when configuring logging, using default logging configuration.[0m
[34m2023-02-02 06:35:26,865 Starting SageMaker Clarify Processing job[0m
[34m2023-02-02 06:35:26,865 Analysis config path: /opt/ml/processing/input/config/analysis_config.json[0m
[34m2023-02-02 06:35:26,865 Analysis result path: /opt/ml/processing/output[0m
[34m2023-02-02 06:35:26,865 This host is algo-1.[0m
[34m2023-02-02 06:35:26,865 This host is the leader.[0m
[34m2023-02-02 06:35:26,865 Number of hosts in the cluster is 1.[0m
[34m2023-02-02 06:35:27,093 Running Python / Pandas based analyzer.[0m
[34m2023-02-02 06:35:27,093 Dataset type: text/csv uri: /opt/ml/processing/input/data[0m
[34m2023-02-02 06:35:27,103 Loading dataset...[0m
  df = df.append(df_tmp, ignore_index=True)[0m
[34m2023-02-02 06:35:27,408 Loaded dataset. Dataset info:[0m
[34m<class 'pandas.core.frame.DataFrame'>[0m
[34mRangeIndex: 83158 entries, 

## Model Explainability Analysis
Once the Clarify job is complete, we can view the Model Explainability analysis report in SageMaker Experiment.

![sagemaker clarify model explainability](img/sm-clarify-explainability.png)

# Model Registry

From MLOps perspective, Model Registry is a mechanism that provides a central repository that allows model developers to publish production-ready models for ease of access. With the registry, developers can also work together with other teams and stakeholders, collaboratively manage the lifecycle of all models in the organization. 

A data scientist can push trained models to the model registry. Once in the registry, your models are ready to be tested, validated, and deployed to production.

SageMaker provides a Model Registry for any models. Here are some key features of SageMaker Model Registry:

* Catalog models for production

* Manage model versions

* Associate metadata, such as training metrics, with a model

* Manage the approval status of a model

* Deploy models to production.

* Automate model deployment with CI/CD.

* Associate metadata, such as training metrics, with a model.

* Manage the approval status of a model.

* Deploy models to production.

Automate model deployment with CI/CD.

## Model Registry Structure

The SageMaker Model Registry is structured as several model groups with model packages in each group. Each model package in a model group corresponds to a trained model. The version of each model package is a numerical value that starts at 1 and is incremented with each new model package added to a model group. For example, if 5 model packages are added to a model group, the model package versions will be 1, 2, 3, 4, and 5. The example Model Registry shown in the following image contains 3 model groups, where each group contains the model packages related to a particular ML problem.

The following diagram shows how a model registry is organized in SageMaker:

![sagemaker model registry](img/sm-model-registry.png)

In the following section, we'll register our best model trained using SageMaker HPO into a new model group. 
We'll also provide the model metrics generated from the model evaluation processor.

In [194]:
model_approval_status = "PendingManualApproval"

model_metrics = ModelMetrics(
    model_statistics=MetricsSource(
        s3_uri=f"s3://{bucket}/{s3_model_evaluation_prefix}/evaluation.json",
        content_type="application/json",
    )
)

In [195]:
model_package_group_name = "demo-kkbox-customer-churn-model-group"
model.register(
    content_types=["text/csv"],
    response_types=["text/csv"],
    inference_instances=["ml.m5.xlarge"],
    transform_instances=["ml.m5.xlarge"],
    model_package_group_name=model_package_group_name,
    approval_status=model_approval_status,
    model_metrics=model_metrics
)

INFO:sagemaker.image_uris:Ignoring unnecessary instance type: ml.m5.xlarge.


<sagemaker.model.ModelPackage at 0x7efcc2ac5450>

Here's a screenshot of a model version registered with Model Registry

![sm model registry version](img/sm-model-registry-version.png)

# Model Deployment

There are many options available to optimally serve your model inference with SageMaker. Here are the common types of inference modes supported:

![sagemaker inference](img/sm-deployment-modes.png)

In this example, we'll deploy a realtime endpoint and run some inference on it to validate the results.

In [218]:
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import CSVDeserializer

In [219]:
predictor = model.deploy(
    initial_instance_count=1,
    instance_type="ml.m5.xlarge",
    serializer = CSVSerializer(),
    deserializer = CSVDeserializer())

INFO:sagemaker:Creating model with name: kkbox-customer-churn-model-20230202053205
INFO:sagemaker:Creating endpoint-config with name kkbox-customer-churn-model-202302020532-2023-02-02-16-26-14-949
INFO:sagemaker:Creating endpoint with name kkbox-customer-churn-model-202302020532-2023-02-02-16-26-14-949


-----!

Test the endpoint by sending traffics to it

In [212]:
test_data = test.iloc[:10, 2:].to_csv(header=False, index=False)

In [213]:
predictions = predictor.predict(test_data)

In [214]:
predictions

[['0'], ['1'], ['0'], ['0'], ['0'], ['0'], ['0'], ['0'], ['0'], ['0']]