# Customer churn prediction with Network and Customer data using Amazon SageMaker XGBoost
_**Supervised Learning with Gradient Boosted Trees: A Binary Prediction Problem With Unbalanced Classes**_

---


## Background
Customer churn is a business challenge that all Communication Service Providers encounter on a daily basis. The goal of this machine learning notebook is to look at historical data from the network, subscribed products, usage, and pricing to predict the conditions under which a particular customer is likely to churn.The steps include:

* Preparing your Amazon SageMaker notebook
* Downloading data from S3 into Amazon SageMaker
* Investigating and transforming the data so that it can be fed to Amazon SageMaker algorithms
* Estimating a model using the Gradient Boosting algorithm
* Evaluating the effectiveness of the model
* Setting the model up to make on-going predictions

---

## Setup

_This notebook was created and tested on an ml.m4.xlarge notebook instance._

Let's start by specifying:

- The S3 bucket and prefix that you want to use for training and model data.  This should be within the same region as the Notebook Instance, training, and hosting.
- The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these.  Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the boto regexp with a the appropriate full IAM role arn string(s).

### Step 1 - Uncomment and run cell 1 only once

In [2]:
# cell 01
#!pip install awswrangler

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.2[0m[39;49m -> [0m[32;49m22.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


### Step 2 - Import all required Libraries

In [20]:
# cell 02 - Import libraries
import awswrangler as wr
import numpy as np                                # For matrix operations and numerical processing
import pandas as pd                               # For munging tabular data
import matplotlib.pyplot as plt                   # For charts and visualizations
from IPython.display import Image                 # For displaying images in the notebook
from IPython.display import display               # For displaying outputs in the notebook
from time import gmtime, strftime                 # For labeling SageMaker models, endpoints, etc.
import sys                                        # For writing outputs to notebook
import math                                       # For ceiling function
import json                                       # For parsing hosting outputs
import os                                         # For manipulating filepath names
import sagemaker 
import zipfile                                    # Amazon SageMaker's Python SDK provides many helper functions
import getpass
import time

### Step 3 - Change bucket name to the bucket in your S3. Set path as below with your bucket name

In [3]:
# cell 03
import sagemaker
bucket='tlc301assets' #It should be of the form tlc301-account_number
path = f"s3://tlc301assets/"

# Define IAM role
import boto3
import re
from sagemaker import get_execution_role

role = get_execution_role()

### Step 4 - Read parquet data that was an output from the previous section of the workshop

In [4]:
# cell 04
df = wr.s3.read_parquet(path+'joined', dataset=True)
df.head()

Unnamed: 0,location,area_code,phone,intl_plan,vmail_plan,vmail_message,day_mins,day_calls,day_charge,eve_mins,...,partition,location5gcell,avg_5g_ran_health_index_mwc_1,avg_5g_availability_mwc_1,avg_5g_accessibility_mwc_1,avg_5g_mobility_mwc_1,avg_5g_retainability_mwc_1,avg_5g_cell_downlink_avg_throughput_den_huaw,avg_5g_cell_uplink_avg_throughput_den_huaw,avg_5g_user_downlink_avg_throughput_den_huaw
0,BASAKLAPUCEBN,778,970-7188,no,no,0,11.809295,5,7.88944,7.582861,...,partition-limit-500000-offset-0,BASAKLAPUCEBN,0.979645,0.999954,0.91785,0.995625,0.99983,6.950833,26.417546,6721416.0
1,BASAKLAPUCEBN,777,596-5803,yes,no,0,10.994388,3,10.213156,6.963251,...,partition-limit-500000-offset-0,BASAKLAPUCEBN,0.979645,0.999954,0.91785,0.995625,0.99983,6.950833,26.417546,6721416.0
2,BASAKLAPUCEBN,678,975-1551,yes,yes,100,7.801079,2,6.190318,6.651484,...,partition-limit-500000-offset-0,BASAKLAPUCEBN,0.979645,0.999954,0.91785,0.995625,0.99983,6.950833,26.417546,6721416.0
3,BASAKLAPUCEBN,798,468-9308,no,no,0,10.455195,3,4.531812,3.691651,...,partition-limit-500000-offset-0,BASAKLAPUCEBN,0.979645,0.999954,0.91785,0.995625,0.99983,6.950833,26.417546,6721416.0
4,BASAKLAPUCEBN,676,344-9464,no,no,0,5.63576,1,7.729696,6.227893,...,partition-limit-500000-offset-0,BASAKLAPUCEBN,0.979645,0.999954,0.91785,0.995625,0.99983,6.950833,26.417546,6721416.0


#### Data Exploration - Use Amazon QuickSignt on next section
Please see next section where we do data exploration on Amazon QuickSight

### Step 5 -  Data Transformation
Let's transform the data to make it ready for machine learning.

In [5]:
# Make Target Churn column binary
#change True. to 1 and False. to 0
df['churn'] = df['churn'].replace(regex=r'True.' , value='1')
df['churn'] = df['churn'].replace(regex=r'False.' , value='0')
df['phone'] = df['phone'].replace(regex=r'-' , value='')

# Drop another location column that is redundant 
df = df.drop('location5gcell', axis=1)

# One hot encoding
df = pd.get_dummies(df, columns=['location', 'intl_plan', 'vmail_plan'])

# Drop another location column that is redundant 
df = df.drop('partition', axis=1)


### Step 6 -  Split into Test & Train

In [6]:
# Copy the original dataframe into a new dataframe
model_data = df.copy()

# Inspect the head of the transformed dataframe
model_data.head()

Unnamed: 0,area_code,phone,vmail_message,day_mins,day_calls,day_charge,eve_mins,eve_calls,eve_charge,night_mins,...,location_BULUA2N,location_BULUACDON,location_BUNGADQCR,location_BURGOSPANDACANMLANCRR-401,location_BUTINGN,location_BYNSUNRISEPQUEN,intl_plan_no,intl_plan_yes,vmail_plan_no,vmail_plan_yes
0,778,9707188,0,11.809295,5,7.88944,7.582861,5,6.618473,4.973284,...,0,0,0,0,0,0,1,0,1,0
1,777,5965803,0,10.994388,3,10.213156,6.963251,7,5.333787,3.733399,...,0,0,0,0,0,0,0,1,1,0
2,678,9751551,100,7.801079,2,6.190318,6.651484,1,6.879444,3.820463,...,0,0,0,0,0,0,0,1,0,1
3,798,4689308,0,10.455195,3,4.531812,3.691651,0,8.389032,4.682667,...,0,0,0,0,0,0,1,0,1,0
4,676,3449464,0,5.63576,1,7.729696,6.227893,3,4.184465,1.778058,...,0,0,0,0,0,0,1,0,1,0


When building a model whose primary goal is to predict a target value on new data, it is important to understand overfitting.  Supervised learning models are designed to minimize error between their predictions of the target value and actuals, in the data they are given.  This last part is key, as frequently in their quest for greater accuracy, machine learning models bias themselves toward picking up on minor idiosyncrasies within the data they are shown.  These idiosyncrasies then don't repeat themselves in subsequent data, meaning those predictions can actually be made less accurate, at the expense of more accurate predictions in the training phase.

The most common way of preventing this is to build models with the concept that a model shouldn't only be judged on its fit to the data it was trained on, but also on "new" data.  There are several different ways of operationalizing this, holdout validation, cross-validation, leave-one-out validation, etc.  For our purposes, we'll simply randomly split the data into 3 uneven groups.  The model will be trained on 70% of data, it will then be evaluated on 20% of data to give us an estimate of the accuracy we hope to have on "new" data, and 10% will be held back as a final testing dataset which will be used later on.

In [7]:
# cell 08
train_data, validation_data, test_data = np.split(model_data.sample(frac=1, random_state=1729), [int(0.7 * len(model_data)), int(0.9 * len(model_data))])   # Randomly sort the data then split out first 70%, second 20%, and last 10%

### Step 7 -  Create tranining matrix for XGBoost algorithm
Amazon SageMaker's XGBoost container expects data in the libSVM or CSV data format.  For this example, we'll stick to CSV.  Note that the first column must be the target variable and the CSV should not include headers.  Also, notice that although repetitive it's easiest to do this after the train|validation|test split rather than before.  This avoids any misalignment issues due to random reordering.

In [28]:
# Export data to your local EBS volume. This is not a mandatory step
#pd.concat([train_data['churn'], train_data.drop(['churn'], axis=1)], axis=1).to_csv('train.csv', index=False, header=False)
#pd.concat([validation_data['churn'], validation_data.drop(['churn'], axis=1)], axis=1).to_csv('validation.csv', index=False, header=False)

In [29]:
# inspect test data
test_data

Unnamed: 0,area_code,phone,vmail_message,day_mins,day_calls,day_charge,eve_mins,eve_calls,eve_charge,night_mins,...,location_BULUA2N,location_BULUACDON,location_BUNGADQCR,location_BURGOSPANDACANMLANCRR-401,location_BUTINGN,location_BYNSUNRISEPQUEN,intl_plan_no,intl_plan_yes,vmail_plan_no,vmail_plan_yes
203,878,9218943,0,4.892757,6,1.925049,4.821187,0,5.448975,5.771010,...,0,0,0,0,0,0,0,1,1,0
3567,659,7113017,0,7.234755,2,6.601624,2.906104,3,6.187517,5.412194,...,0,0,0,0,0,0,1,0,1,0
3199,716,3505720,300,3.054276,3,5.097610,6.923417,6,2.673867,4.722747,...,0,0,0,0,0,0,1,0,0,1
4996,716,3227415,0,9.215266,6,4.957564,5.335900,4,5.970772,6.669805,...,0,0,0,0,0,0,0,1,1,0
1024,736,1111222,0,2.014212,5,3.245725,6.622683,0,4.608945,1.447456,...,0,0,1,0,0,0,1,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3454,847,6865740,600,2.725406,2,4.090509,2.809921,4,3.827986,5.439851,...,0,0,0,0,0,0,0,1,0,1
3605,878,3461910,0,5.821524,4,0.041467,1.599231,2,1.254876,0.297150,...,0,0,0,0,0,0,0,1,1,0
1267,787,4351229,400,11.761176,4,8.557125,6.741170,4,8.463224,5.373586,...,0,0,0,0,0,0,1,0,0,1
3174,676,3797746,0,1.518489,0,3.951135,2.099465,4,2.660238,6.303480,...,0,0,0,0,0,0,1,0,1,0


Now we'll copy the file to S3 for Amazon SageMaker's managed training to pickup.

In [12]:
prefix = 'prepared'
# cell 10
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation/validation.csv')).upload_file('validation.csv')

---

## Training
Now we know that most of our features have skewed distributions, some are highly correlated with one another, and some appear to have non-linear relationships with our target variable.  Also, for targeting future prospects, good predictive accuracy is preferred to being able to explain why that prospect was targeted.  Taken together, these aspects make gradient boosted trees a good candidate algorithm.

There are several intricacies to understanding the algorithm, but at a high level, gradient boosted trees works by combining predictions from many simple models, each of which tries to address the weaknesses of the previous models.  By doing this the collection of simple models can actually outperform large, complex models.  Other Amazon SageMaker notebooks elaborate on gradient boosting trees further and how they differ from similar algorithms.

`xgboost` is an extremely popular, open-source package for gradient boosted trees.  It is computationally powerful, fully featured, and has been successfully used in many machine learning competitions.  Let's start with a simple `xgboost` model, trained using Amazon SageMaker's managed, distributed training framework.

First we'll need to specify the ECR container location for Amazon SageMaker's implementation of XGBoost.

In [13]:
# cell 11
container = sagemaker.image_uris.retrieve(region=boto3.Session().region_name, framework='xgboost', version='latest')

Then, because we're training with the CSV file format, we'll create `s3_input`s that our training function can use as a pointer to the files in S3, which also specify that the content type is CSV.

In [14]:
# cell 12
s3_input_train = sagemaker.inputs.TrainingInput(s3_data='s3://{}/{}/train/'.format(bucket, prefix), content_type='csv')
s3_input_validation = sagemaker.inputs.TrainingInput(s3_data='s3://{}/{}/validation/'.format(bucket, prefix), content_type='csv')

First we'll need to specify training parameters to the estimator.  This includes:
1. The `xgboost` algorithm container
1. The IAM role to use
1. Training instance type and count
1. S3 location for output data
1. Algorithm hyperparameters

And then a `.fit()` function which specifies:
1. S3 location for output data.  In this case we have both a training and validation set which are passed in.

In [15]:
# cell 13
sess = sagemaker.Session()

xgb = sagemaker.estimator.Estimator(container,
                                    role, 
                                    instance_count=1, 
                                    instance_type='ml.m4.xlarge',
                                    output_path='s3://{}/{}/output'.format(bucket, prefix),
                                    sagemaker_session=sess)
xgb.set_hyperparameters(max_depth=5,
                        eta=0.2,
                        gamma=4,
                        min_child_weight=6,
                        subsample=0.8,
                        silent=0,
                        objective='binary:logistic',
                        num_round=100)

xgb.fit({'train': s3_input_train, 'validation': s3_input_validation}) 

2022-11-18 13:46:11 Starting - Starting the training job...
2022-11-18 13:46:36 Starting - Preparing the instances for trainingProfilerReport-1668779170: InProgress
.........
2022-11-18 13:47:54 Downloading - Downloading input data...
2022-11-18 13:48:34 Training - Downloading the training image.....[34mArguments: train[0m
[34m[2022-11-18:13:49:27:INFO] Running standalone xgboost training.[0m
[34m[2022-11-18:13:49:27:INFO] File size need to be processed in the node: 2.18mb. Available memory size in the node: 8823.05mb[0m
[34m[2022-11-18:13:49:27:INFO] Determined delimiter of CSV input is ','[0m
[34m[13:49:27] S3DistributionType set as FullyReplicated[0m
[34m[13:49:27] 3500x148 matrix with 518000 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[34m[2022-11-18:13:49:27:INFO] Determined delimiter of CSV input is ','[0m
[34m[13:49:27] S3DistributionType set as FullyReplicated[0m
[34m[13:49:27] 1000x148 matrix with 148000 entries loaded

---

## Hosting
Now that we've trained the `xgboost` algorithm on our data, let's deploy a model that's hosted behind a real-time endpoint.

In [16]:
# cell 14
xgb_predictor = xgb.deploy(initial_instance_count=1,
                           instance_type='ml.m4.xlarge')

------!

---

## Evaluation
There are many ways to compare the performance of a machine learning model, but let's start by simply comparing actual to predicted values.  In this case, we're simply predicting whether the customer subscribed to a term deposit (`1`) or not (`0`), which produces a simple confusion matrix.

First we'll need to determine how we pass data into and receive data from our endpoint.  Our data is currently stored as NumPy arrays in memory of our notebook instance.  To send it in an HTTP POST request, we'll serialize it as a CSV string and then decode the resulting CSV.

*Note: For inference with CSV format, SageMaker XGBoost requires that the data does NOT include the target variable.*

In [17]:
# cell 15
xgb_predictor.serializer = sagemaker.serializers.CSVSerializer()

Now, we'll use a simple function to:
1. Loop over our test dataset
1. Split it into mini-batches of rows 
1. Convert those mini-batches to CSV string payloads (notice, we drop the target variable from our dataset first)
1. Retrieve mini-batch predictions by invoking the XGBoost endpoint
1. Collect predictions and convert from the CSV output our model provides into a NumPy array

In [18]:
# cell 16
def predict(data, predictor, rows=500 ):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = ''
    for array in split_array:
        predictions = ','.join([predictions, predictor.predict(array).decode('utf-8')])

    return np.fromstring(predictions[1:], sep=',')

predictions = predict(test_data.drop(['churn'], axis=1).to_numpy(), xgb_predictor)

Now we'll check our confusion matrix to see how well we predicted versus actuals.

In [22]:
# cell 17
pd.crosstab(index=test_data['churn'], columns=np.round(predictions), rownames=['actuals'], colnames=['predictions'])

predictions,0.0,1.0
actuals,Unnamed: 1_level_1,Unnamed: 2_level_1
0,256,6
1,5,233


So, of the ~500 potential customers, we predicted 247 would churn and 241 of them actually did.  We also had 253 customers who we predicted would not churn, out of which 14.  This is less than desirable, but the model can (and should) be tuned to improve this.  Most importantly, note that with minimal effort, our model produced accuracies similar to those published [here](http://media.salford-systems.com/video/tutorial/2015/targeted_marketing.pdf).

_Note that because there is some element of randomness in the algorithm's subsample, your results may differ slightly from the text written above._

Hyperparameter optimization and balancing of the classes is typically recommended to improve model performance. This is not part of today's session. 

---

## Extensions

This example analyzed a relatively small dataset, but utilized Amazon SageMaker features such as distributed, managed training and real-time model hosting, which could easily be applied to much larger problems.  In order to improve predictive accuracy further, we could tweak value we threshold our predictions at to alter the mix of false-positives and false-negatives, or we could explore techniques like hyperparameter tuning.  In a real-world scenario, we would also spend more time engineering features by hand and would likely look for additional datasets to include which contain customer information not available in our initial dataset.

### (Optional) Clean-up

If you are done with this notebook, please run the cell below.  This will remove the hosted endpoint you created and avoid any charges from a stray instance being left on.

In [22]:
# cell 28
xgb_predictor.delete_endpoint(delete_endpoint_config=True)