# Improves Data Science Productiity Using SageMaker Studio

Using Machine Learning Model to predict customer churn for a Music Streaming Service

The notebook is organized into the following sections:

* Background
* Dataset Exploration
* Model Training Using SageMaker Studio
* Use SageMaker Experiment To organize Data Science experiments
* Use Sagemaker Debugger to monitors utilization of system resources such as GPUs, CPUs, network, and memory, and profiles the training jobs to collect detailed ML framework metrics.  
* Deploy the trained model into SageMaker Inference to serve churn prediction
* Monitor Model quality using SageMaker Monitor

# Background

This particular challenge was originally introduced as a Kaggle competition in 2018. The goal was to build an algorithm that predicts 
whether a subscription user will churn using a donated dataset from KKBOX. 

For a subscription business, accurately predicting churn is critical to long-term success. 

Even slight variations in churn can drastically affect profits.

KKBOX is Asia’s leading music streaming service, holding the world’s most comprehensive Asia-Pop music library with over 30 million tracks. They offer a generous, unlimited version of their service to millions of people, supported by advertising and paid subscriptions. This delicate model is dependent on accurately predicting churn of their paid users.

In this notebook, we'll explore a machine learning model called XGBoost to predict whether a user will churn after their subscription expires. Currently, the company uses survival analysis techniques to determine the residual membership life time for each subscriber. 

# Data

We combining multiple datasets, including the subscription, membership and user activity logs to extract the signals for training a machine learning model. We use an EMR cluster to perform the feature engineering work, directly from within SageMaker Studio. For detail about using EMR and Pyspark, please refer to the notebook [here](processing_pyspark.ipynb)

In the following section, we'll explore the curated dataset in greater detail. 

In [3]:
!pip install sagemaker -U

Keyring is skipped due to an exception: 'keyring.backends'
Collecting sagemaker
  Using cached sagemaker-2.129.0-py2.py3-none-any.whl
Collecting importlib-metadata<5.0,>=1.4.0
  Using cached importlib_metadata-4.13.0-py3-none-any.whl (23 kB)
Collecting boto3<2.0,>=1.26.28
  Using cached boto3-1.26.54-py3-none-any.whl (132 kB)
Collecting botocore<1.30.0,>=1.29.54
  Using cached botocore-1.29.54-py3-none-any.whl (10.3 MB)
Installing collected packages: importlib-metadata, botocore, boto3, sagemaker
  Attempting uninstall: importlib-metadata
    Found existing installation: importlib-metadata 5.1.0
    Uninstalling importlib-metadata-5.1.0:
      Successfully uninstalled importlib-metadata-5.1.0
  Attempting uninstall: botocore
    Found existing installation: botocore 1.29.24
    Uninstalling botocore-1.29.24:
      Successfully uninstalled botocore-1.29.24
  Attempting uninstall: boto3
    Found existing installation: boto3 1.26.24
    Uninstalling boto3-1.26.24:
      Successfully unin

## Setup

Let's import the Python libraries we'll need for this exercise.

In [6]:
import sagemaker
import boto3
from sagemaker import image_uris
from sagemaker.session import Session
from sagemaker.inputs import TrainingInput
from sagemaker.experiments.run import Run
from sagemaker.xgboost.estimator import XGBoost

In [7]:
role = sagemaker.get_execution_role()
sagemaker_session = Session()
bucket = sagemaker_session.default_bucket()
region = sagemaker_session.boto_region_name

prefix = "data/kkbox-customer-churn-model"
experiment_name = "kkbox-customer-churn-model-experiment"
content_type = "csv"

In [6]:
hyperparameters = {
    "max_depth":5,
    "eta":0.2,
    "gamma":4,
    "min_child_weight":6,
    "subsample":0.7,
    "n_estimators":50,
    "region" : region}

with Run(experiment_name=experiment_name, sagemaker_session=sagemaker_session) as run: 
    # initialize hyperparameters
    output_path = 's3://{}/{}/output'.format(bucket, prefix)

    # this line automatically looks for the XGBoost image URI and builds an XGBoost container.
    # specify the repo_version depending on your preference.
    # xgboost_container = sagemaker.image_uris.retrieve("xgboost", region, "1.5-1")
    # construct a SageMaker estimator that calls the xgboost-container
    # estimator = sagemaker.estimator.Estimator(image_uri=xgboost_container, 
    #                                           hyperparameters=hyperparameters,
    #                                           role=sagemaker.get_execution_role(),
    #                                           instance_count=1, 
    #                                           instance_type='ml.m5.2xlarge', 
    #                                           volume_size=10, # 5 GB 
    #                                           output_path=output_path)
    
    estimator = XGBoost(entry_point = "scripts/train.py", 
                    framework_version='1.5-1',
                    hyperparameters=hyperparameters,
                    role=sagemaker.get_execution_role(),
                    instance_count=1,
                    instance_type='ml.m5.2xlarge',
                    volume_size =10,
                    output_path=output_path)

    train_input = TrainingInput("s3://{}/{}/{}/".format(bucket, prefix, 'train'))
    validation_input = TrainingInput("s3://{}/{}/{}/".format(bucket, prefix, 'validation'))

    # execute the XGBoost training job
    estimator.fit({'train': train_input, 'validation': validation_input})
    

INFO:sagemaker.image_uris:Ignoring unnecessary Python version: py3.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: ml.m5.2xlarge.
INFO:sagemaker:Creating training-job with name: sagemaker-xgboost-2023-01-23-02-14-47-668


2023-01-23 02:14:47 Starting - Starting the training job...
2023-01-23 02:15:03 Starting - Preparing the instances for training......
2023-01-23 02:15:51 Downloading - Downloading input data.....[34m[2023-01-23 02:16:58.538 ip-10-0-215-214.us-east-2.compute.internal:7 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m
[34m[2023-01-23:02:16:58:INFO] Imported framework sagemaker_xgboost_container.training[0m
[34m[2023-01-23:02:16:58:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2023-01-23:02:16:58:INFO] Invoking user training script.[0m
[34m[2023-01-23:02:16:58:INFO] Module train does not provide a setup.py. [0m
[34mGenerating setup.py[0m
[34m[2023-01-23:02:16:58:INFO] Generating setup.cfg[0m
[34m[2023-01-23:02:16:58:INFO] Generating MANIFEST.in[0m
[34m[2023-01-23:02:16:58:INFO] Installing module with the following command:[0m
[34m/miniconda3/bin/python3 -m pip install . [0m
[34mProcessing /opt/ml/code
  Preparing metadata (setup.py): started
  P

In [7]:
estimator.model_data

's3://sagemaker-us-east-2-869530972998/data/kkbox-customer-churn-model/output/sagemaker-xgboost-2023-01-23-02-14-47-668/output/model.tar.gz'