# Churn Prediction for Mobile Phone Customers with XGBoost Demo
_**Using Gradient Boosted Trees to Predict Mobile Phone Customer Departure**_

---
In this demo notebook, we can quickly send some data to an already deployed endpoint, get the response and visualize its results.

**To see more details of end-to-end model training with hyper-parameter optimization and deployement using SageMaker, please click [xgboost_customer_churn.ipynb](./xgboost_customer_churn.ipynb) notebook.**

---

## Runtime

This notebook takes less than a minute to run.

## Contents

1. Background
1. Setup
1. Data
1. Using the endpoint
    1. Evaluate
1. Extensions

---

## Background

_This notebook has been adapted from an [AWS blog post](https://aws.amazon.com/blogs/ai/predicting-customer-churn-with-amazon-machine-learning/)_

Losing customers is costly for any business.  Identifying unhappy customers early on gives you a chance to offer them incentives to stay.  This notebook describes using machine learning (ML) for the automated identification of unhappy customers, also known as customer churn prediction. ML models rarely give perfect predictions though, so this notebook is also about how to incorporate the relative costs of prediction mistakes when determining the financial outcome of using ML.

We use a familiar example of churn: leaving a mobile phone operator.  Seems like one can always find fault with their provider du jour! And if the provider knows that a customer is thinking of leaving, it can offer timely incentives - such as a phone upgrade or perhaps having a new feature activated – and the customer may stick around. Incentives are often much more cost-effective than losing and reacquiring a customer.

---

## Setup

Let's start by updating the required packages i.e. SageMaker Python SDK, `pandas` and `numpy`, and specifying:

- The S3 bucket and prefix that you want to use for training and model data.  This should be within the same region as the Notebook Instance or Studio, training, and hosting.
- The IAM role ARN used to give training and hosting access to your data. See the documentation for how to create these.  Note: if more than one role is required for notebook instances, training, and/or hosting, please replace the boto regexp with the appropriate full IAM role ARN string(s).

In [None]:
import sys

!{sys.executable} -m pip install sagemaker pandas numpy --upgrade --quiet

This solution relies on a config file to run the provisioned AWS resources. Run the cell below to generate that file.

In [None]:
import boto3
import os
import json

In [None]:
client = boto3.client('servicecatalog')
cwd = os.getcwd().split('/')
i= cwd.index('S3Downloads')
pp_name = cwd[i + 1]
pp = client.describe_provisioned_product(Name=pp_name)
record_id = pp['ProvisionedProductDetail']['LastSuccessfulProvisioningRecordId']
record = client.describe_record(Id=record_id)

keys = [ x['OutputKey'] for x in record['RecordOutputs'] if 'OutputKey' and 'OutputValue' in x]
values = [ x['OutputValue'] for x in record['RecordOutputs'] if 'OutputKey' and 'OutputValue' in x]
stack_output = dict(zip(keys, values))

with open(f'/root/S3Downloads/{pp_name}/stack_outputs.json', 'w') as f:
    json.dump(stack_output, f)

In [None]:
import sagemaker

sess = sagemaker.Session()

# Get config
sagemaker_config = json.load(open("stack_outputs.json"))
endpoint_name = sagemaker_config["DemoEndpointName"]
solution_bucket = sagemaker_config["SolutionS3Bucket"]
region = sagemaker_config["AWSRegion"]
library_version = sagemaker_config["LibraryVersion"]
solution_name = sagemaker_config["SolutionName"]
source_bucket = f"s3://{solution_bucket}-{region}/{library_version}/{solution_name}"
data_prefix = f"artifacts/data"

Next, we'll import the Python libraries we'll need for the remainder of the example.

In [None]:
import pandas as pd
import numpy as np
from sagemaker.serializers import CSVSerializer

---
## Data

Mobile operators have historical records on which customers ultimately ended up churning and which continued using the service. We can use this historical information to construct an ML model of one mobile operator’s churn using a process called training. After training the model, we can pass the profile information of an arbitrary customer (the same profile information that we used to train the model) to the model, and have the model predict whether this customer is going to churn. Of course, we expect the model to make mistakes. After all, predicting the future is tricky business! But we'll learn how to deal with prediction errors.

The dataset we use is publicly available and was mentioned in the book [Discovering Knowledge in Data](https://www.amazon.com/dp/0470908742/) by Daniel T. Larose. It is attributed by the author to the University of California Irvine Repository of Machine Learning Datasets. Let's read that dataset in now:

In [None]:
DATASET_NAME = "churn.txt"
data_source = f"{source_bucket}/{data_prefix}/{DATASET_NAME}"
print("original data: ")
!aws s3 cp $data_source .

In [None]:
churn = pd.read_csv("churn.txt")
churn.head(5)

In [None]:
len(churn.columns)

By modern standards, it’s a relatively small dataset, with only 5,000 records, where each record uses 21 attributes to describe the profile of a customer of an unknown US mobile operator. The attributes are:

- `State`: the US state in which the customer resides, indicated by a two-letter abbreviation; for example, OH or NJ
- `Account Length`: the number of days that this account has been active
- `Area Code`: the three-digit area code of the corresponding customer’s phone number
- `Phone`: the remaining seven-digit phone number
- `Int’l Plan`: whether the customer has an international calling plan: yes/no
- `VMail Plan`: whether the customer has a voice mail feature: yes/no
- `VMail Message`: the average number of voice mail messages per month
- `Day Mins`: the total number of calling minutes used during the day
- `Day Calls`: the total number of calls placed during the day
- `Day Charge`: the billed cost of daytime calls
- `Eve Mins, Eve Calls, Eve Charge`: the billed cost for calls placed during the evening
- `Night Mins`, `Night Calls`, `Night Charge`: the billed cost for calls placed during nighttime
- `Intl Mins`, `Intl Calls`, `Intl Charge`: the billed cost for international calls
- `CustServ Calls`: the number of calls placed to Customer Service
- `Churn?`: whether the customer left the service: true/false

The last attribute, `Churn?`, is known as the target attribute: the attribute that we want the ML model to predict.  Because the target attribute is binary, our model will be performing binary prediction, also known as binary classification.

We have cleaned up the dataset to rid of one feature from each of the highly correlated pairs: `Day Charge` from the pair with `Day Mins`, `Night Charge` from the pair with `Night Mins`, `Intl Charge` from the pair with `Intl Mins`. This is to because including these feature pairs in some machine learning algorithms can create catastrophic problems, while in others it will only introduce minor redundancy and bias.

We have randomly sampled 10% of the churn data for testing purposes and used the rest of the data for training and evaluating purposes.

In [None]:
DATASET_NAME = "test.csv"
data_source = f"{source_bucket}/{data_prefix}/{DATASET_NAME}"
print("test data: ")
!aws s3 cp $data_source .

In [None]:
test_data = pd.read_csv("test.csv", header=None)
test_data.head(5)

---
## Using the endpoint

Let's determine which algorithm to use. As mentioned above, there appear to be some variables where both high and low (but not intermediate) values are predictive of churn.  In order to accommodate this in an algorithm like linear regression, we'd need to generate polynomial (or bucketed) terms.  Instead, let's attempt to model this problem using gradient boosted trees.  Amazon SageMaker provides an XGBoost container that we can use to train in a managed, distributed setting, and then host as a real-time prediction endpoint.  XGBoost uses gradient boosted trees which naturally account for non-linear relationships between features and the target variable, as well as accommodating complex interactions between features.

We have deployed the demo endpoint for you .

In [None]:
xgb_predictor = sagemaker.predictor.Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sess,
    serializer=CSVSerializer()
)

### Evaluate

Now that we have a hosted endpoint running, we can make real-time predictions from our model very easily, simply by making a `http` POST request.  But first, we'll need to set up serializers and deserializers for passing our `test_data` NumPy arrays to the model behind the endpoint.

Now, we'll use a simple function to:
1. Loop over our test dataset
1. Split it into mini-batches of rows 
1. Convert those mini-batchs to CSV string payloads
1. Retrieve mini-batch predictions by invoking the XGBoost endpoint
1. Collect predictions and convert from the CSV output our model provides into a NumPy array

In [None]:
def predict(data, rows=500):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = ""
    for array in split_array:
        predictions = ",".join([predictions, xgb_predictor.predict(array).decode("utf-8")])
    return np.fromstring(predictions[1:], sep=",")

predictions = predict(test_data.to_numpy()[:, 1:])

In [None]:
print(predictions)

There are many ways to compare the performance of a machine learning model, but let's start by simply by comparing actual to predicted values.  In this case, we're simply predicting whether the customer churned (`1`) or not (`0`), which produces a confusion matrix.

In [None]:
pd.crosstab(
    index=test_data.iloc[:, 0],
    columns=np.round(predictions),
    rownames=["actual"],
    colnames=["predictions"],
)

_Note, due to randomized elements of the algorithm, your results may differ slightly._

Of the 247 churners, we've correctly predicted 236 of them (true positives). We also incorrectly predicted 18 customers would churn who then ended up not doing so (false positives).  There are also 11 customers who ended up churning, that we predicted would not (false negatives).

An important point here is that because of the `np.round()` function above, we are using a simple threshold (or cutoff) of 0.5.  Our predictions from `xgboost` yield continuous values between 0 and 1, and we force them into the binary classes that we began with.  However, because a customer that churns is expected to cost the company more than proactively trying to retain a customer who we think might churn, we should consider lowering this cutoff.  That will almost certainly increase the number of false positives, but it can also be expected to increase the number of true positives and reduce the number of false negatives.

---
## Extensions

This notebook showcased how to use a pre-trained model that predicts whether a customer is likely to churn. There are several means of extending it including:
- Some customers who receive retention incentives will still churn.  Including a probability of churning despite receiving an incentive in our cost function would provide a better ROI on our retention programs.
- Customers who switch to a lower-priced plan or who deactivate a paid feature represent different kinds of churn that could be modeled separately.
- Modeling the evolution of customer behavior. If usage is dropping and the number of calls placed to Customer Service is increasing, you are more likely to experience churn then if the trend is the opposite. A customer profile should incorporate behavior trends.
- Actual training data and monetary cost assignments could be more complex.
- Multiple models for each type of churn could be needed.

Regardless of additional complexity, similar principles described in this notebook are likely applied.

**To see more details of end-to-end model training with hyper-parameter optimization and deployement using SageMaker, please click [xgboost_customer_churn.ipynb](./xgboost_customer_churn.ipynb) notebook.**