# Churn Estimator
![image.png](attachment:image.png)

## Data Acquisition

In [2]:
# dowload data
!kaggle datasets download -d barun2104/telecom-churn

Downloading telecom-churn.zip to /home/simonmijares/ml-churn-estimator
  0%|                                               | 0.00/45.5k [00:00<?, ?B/s]
100%|██████████████████████████████████████| 45.5k/45.5k [00:00<00:00, 1.05MB/s]


In [3]:
# unzip it
!unzip telecom-churn.zip

Archive:  telecom-churn.zip
  inflating: telecom_churn.csv       


## Data Exploration

In [1]:
# import libraries
import pandas as pd
import numpy as np
import pandas_profiling
import os
from sklearn.model_selection import train_test_split

# Column Description from source
https://www.kaggle.com/barun2104/telecom-churn

### Churn 
1 if customer cancelled service, 0 if not

### AccountWeeks
number of weeks customer has had active account

### ContractRenewal
1 if customer recently renewed contract, 0 if not

### DataPlan
1 if customer has data plan, 0 if not

### DataUsage
gigabytes of monthly data usage

### CustServCalls
number of calls into customer service

### DayMins
average daytime minutes per month

### DayCalls
average number of daytime calls

### MonthlyCharge
average monthly bill

### OverageFee
largest overage fee in last 12 months

In [2]:
csv_file = 'telecom_churn.csv'
churn_df = pd.read_csv(csv_file)

# print out the first few rows of data info
churn_df.head(10)

Unnamed: 0,Churn,AccountWeeks,ContractRenewal,DataPlan,DataUsage,CustServCalls,DayMins,DayCalls,MonthlyCharge,OverageFee,RoamMins
0,0,128,1,1,2.7,1,265.1,110,89.0,9.87,10.0
1,0,107,1,1,3.7,1,161.6,123,82.0,9.78,13.7
2,0,137,1,0,0.0,0,243.4,114,52.0,6.06,12.2
3,0,84,0,0,0.0,2,299.4,71,57.0,3.1,6.6
4,0,75,0,0,0.0,3,166.7,113,41.0,7.42,10.1
5,0,118,0,0,0.0,0,223.4,98,57.0,11.03,6.3
6,0,121,1,1,2.03,3,218.2,88,87.3,17.43,7.5
7,0,147,0,0,0.0,0,157.0,79,36.0,5.16,7.1
8,0,117,1,0,0.19,1,184.5,97,63.9,17.58,8.7
9,0,141,0,1,3.02,0,258.6,84,93.2,11.1,11.2


In [3]:
# print out some stats about the data
print('Number of samples: ', churn_df.shape[0],' [Churns:',churn_df[churn_df['Churn']==1].shape[0],'(',int(churn_df[churn_df['Churn']==1].shape[0]/churn_df.shape[0]*100),'%) + Not Churns:',churn_df[churn_df['Churn']==0].shape[0],']') 

Number of samples:  3333  [Churns: 483 ( 14 %) + Not Churns: 2850 ]


In [4]:
pd.options.display.float_format = "{:,.2f}".format

churn_df.describe()

Unnamed: 0,Churn,AccountWeeks,ContractRenewal,DataPlan,DataUsage,CustServCalls,DayMins,DayCalls,MonthlyCharge,OverageFee,RoamMins
count,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0
mean,0.14,101.06,0.9,0.28,0.82,1.56,179.78,100.44,56.31,10.05,10.24
std,0.35,39.82,0.3,0.45,1.27,1.32,54.47,20.07,16.43,2.54,2.79
min,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,14.0,0.0,0.0
25%,0.0,74.0,1.0,0.0,0.0,1.0,143.7,87.0,45.0,8.33,8.5
50%,0.0,101.0,1.0,0.0,0.0,1.0,179.4,101.0,53.5,10.07,10.3
75%,0.0,127.0,1.0,1.0,1.78,2.0,216.4,114.0,66.2,11.77,12.1
max,1.0,243.0,1.0,1.0,5.4,9.0,350.8,165.0,111.3,18.19,20.0


In [5]:
churn_df.profile_report()

Summarize dataset:   0%|          | 0/24 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



Let's take a look at the correlation matrix in more details

In [6]:
# Create correlation matrix for just Features to determine different models to test
corr_matrix = churn_df.corr().abs().round(2)

# display shows all of a dataframe
display(corr_matrix)

Unnamed: 0,Churn,AccountWeeks,ContractRenewal,DataPlan,DataUsage,CustServCalls,DayMins,DayCalls,MonthlyCharge,OverageFee,RoamMins
Churn,1.0,0.02,0.26,0.1,0.09,0.21,0.21,0.02,0.07,0.09,0.07
AccountWeeks,0.02,1.0,0.02,0.0,0.01,0.0,0.01,0.04,0.01,0.01,0.01
ContractRenewal,0.26,0.02,1.0,0.01,0.02,0.02,0.05,0.0,0.05,0.02,0.05
DataPlan,0.1,0.0,0.01,1.0,0.95,0.02,0.0,0.01,0.74,0.02,0.0
DataUsage,0.09,0.01,0.02,0.95,1.0,0.02,0.0,0.01,0.78,0.02,0.16
CustServCalls,0.21,0.0,0.02,0.02,0.02,1.0,0.01,0.02,0.03,0.01,0.01
DayMins,0.21,0.01,0.05,0.0,0.0,0.01,1.0,0.01,0.57,0.01,0.01
DayCalls,0.02,0.04,0.0,0.01,0.01,0.02,0.01,1.0,0.01,0.02,0.02
MonthlyCharge,0.07,0.01,0.05,0.74,0.78,0.03,0.57,0.01,1.0,0.28,0.12
OverageFee,0.09,0.01,0.02,0.02,0.02,0.01,0.01,0.02,0.28,1.0,0.01


We can see that DataPlan and DataUsage share a **95%** correlation and since DataUsage is richer in data we will drop DataUsage and drop DataPlan for this analysis

In [7]:
churn_decorr_df=churn_df.drop('DataPlan', axis=1)
churn_decorr_df.head(10)

Unnamed: 0,Churn,AccountWeeks,ContractRenewal,DataUsage,CustServCalls,DayMins,DayCalls,MonthlyCharge,OverageFee,RoamMins
0,0,128,1,2.7,1,265.1,110,89.0,9.87,10.0
1,0,107,1,3.7,1,161.6,123,82.0,9.78,13.7
2,0,137,1,0.0,0,243.4,114,52.0,6.06,12.2
3,0,84,0,0.0,2,299.4,71,57.0,3.1,6.6
4,0,75,0,0.0,3,166.7,113,41.0,7.42,10.1
5,0,118,0,0.0,0,223.4,98,57.0,11.03,6.3
6,0,121,1,2.03,3,218.2,88,87.3,17.43,7.5
7,0,147,0,0.0,0,157.0,79,36.0,5.16,7.1
8,0,117,1,0.19,1,184.5,97,63.9,17.58,8.7
9,0,141,0,3.02,0,258.6,84,93.2,11.1,11.2


Sample and Split the data for training and validation

In [8]:
x=churn_decorr_df.drop('Churn', axis=1)
x.head(10)

Unnamed: 0,AccountWeeks,ContractRenewal,DataUsage,CustServCalls,DayMins,DayCalls,MonthlyCharge,OverageFee,RoamMins
0,128,1,2.7,1,265.1,110,89.0,9.87,10.0
1,107,1,3.7,1,161.6,123,82.0,9.78,13.7
2,137,1,0.0,0,243.4,114,52.0,6.06,12.2
3,84,0,0.0,2,299.4,71,57.0,3.1,6.6
4,75,0,0.0,3,166.7,113,41.0,7.42,10.1
5,118,0,0.0,0,223.4,98,57.0,11.03,6.3
6,121,1,2.03,3,218.2,88,87.3,17.43,7.5
7,147,0,0.0,0,157.0,79,36.0,5.16,7.1
8,117,1,0.19,1,184.5,97,63.9,17.58,8.7
9,141,0,3.02,0,258.6,84,93.2,11.1,11.2


In [9]:
y=churn_decorr_df[['Churn']]
y.head(10)

Unnamed: 0,Churn
0,0
1,0
2,0
3,0
4,0
5,0
6,0
7,0
8,0
9,0


In [10]:
# Creation of the data set for training, test and validation
x_train, x_not_train, y_train, y_not_train = train_test_split(x,y,test_size=0.4,random_state=21,stratify=y)
x_test, x_val, y_test, y_val = train_test_split(x_not_train,y_not_train,test_size=0.5,random_state=21,stratify=y_not_train)

In [11]:
print('Number of Train samples: ', x_train.shape[0],' [Churns:',y_train[y_train['Churn']==1].shape[0],'(',int(y_train[y_train['Churn']==1].shape[0]/y_train.shape[0]*100),'%) + Not Churns:',y_train[y_train['Churn']==0].shape[0],']')
print('Number of Validation samples: ', x_val.shape[0],' [Churns:',y_val[y_val['Churn']==1].shape[0],'(',int(y_val[y_val['Churn']==1].shape[0]/y_val.shape[0]*100),'%) + Not Churns:',y_val[y_val['Churn']==0].shape[0],']')
print('Number of Test samples: ', x_test.shape[0],' [Churns:',y_test[y_test['Churn']==1].shape[0],'(',int(y_test[y_test['Churn']==1].shape[0]/y_test.shape[0]*100),'%) + Not Churns:',y_test[y_test['Churn']==0].shape[0],']')

Number of Train samples:  1999  [Churns: 290 ( 14 %) + Not Churns: 1709 ]
Number of Validation samples:  667  [Churns: 97 ( 14 %) + Not Churns: 570 ]
Number of Test samples:  667  [Churns: 96 ( 14 %) + Not Churns: 571 ]


## File Creation in csv

In [12]:
def make_csv(x, y, filename, data_dir):
    '''Merges features and labels and converts them into one csv file with labels in the first column.
       :param x: Data features
       :param y: Data labels
       :param file_name: Name of csv file, ex. 'train.csv'
       :param data_dir: The directory where files will be saved
       '''
    # make data dir, if it does not exist
    if not os.path.exists(data_dir):
        os.makedirs(data_dir)
    
    data=pd.concat([pd.DataFrame(y),pd.DataFrame(x)],axis=1)
    
    key=os.path.join(data_dir,filename)
    
    data.to_csv(key, header=False, index=False)
    
    # nothing is returned, but a print statement indicates that the function has run
    print('Path created: '+str(data_dir)+'/'+str(filename))

In [49]:
# directory to save the data
data_dir = 'churn_data'

make_csv(x_train, y_train, filename='train.csv', data_dir=data_dir)
# make_csv(x_test, y_test, filename='test.csv', data_dir=data_dir)
make_csv(x_val, y_val, filename='validation.csv', data_dir=data_dir)
# pd.DataFrame(x_val).to_csv(os.path.join(data_dir, 'validation.csv'), header=False, index=False)
pd.DataFrame(x_test).to_csv(os.path.join(data_dir, 'test.csv'), header=False, index=False)

Path created: churn_data/train.csv
Path created: churn_data/validation.csv


## Load File to S3

In [25]:
import pandas as pd
import boto3
import sagemaker
import numpy as np

In [41]:
# session and role
session = sagemaker.Session()
role = sagemaker.get_execution_role()

# create an S3 bucket
bucket = session.default_bucket()

In [50]:
# should be the name of directory you created to save your features data
data_dir = 'churn_data'

# set prefix, a descriptive name for a directory  
prefix = 'Project_churn_predictor'

# upload all data to S3
#input_data = session.upload_data(path=data_dir, bucket=bucket, key_prefix=prefix)

test_location = session.upload_data(os.path.join(data_dir, 'test.csv'), key_prefix=prefix)
val_location = session.upload_data(os.path.join(data_dir, 'validation.csv'), key_prefix=prefix)
train_location = session.upload_data(os.path.join(data_dir, 'train.csv'), key_prefix=prefix)
print(test_location)
print(val_location)
print(train_location)

s3://sagemaker-us-east-1-701904821656/Project_churn_predictor/test.csv
s3://sagemaker-us-east-1-701904821656/Project_churn_predictor/validation.csv
s3://sagemaker-us-east-1-701904821656/Project_churn_predictor/train.csv


## Modeling

In [43]:
from sagemaker import get_execution_role, image_uris
from sklearn.metrics import accuracy_score

In [44]:
# Our current execution role is require when creating the model as the training
# and inference code will need to access the model artifacts.
role = get_execution_role()

In [45]:
# We need to retrieve the location of the container which is provided by Amazon for using XGBoost.
# As a matter of convenience, the training and inference code both use the same container.

container = image_uris.retrieve("xgboost", session.boto_region_name, version="latest")

In [51]:
xgb = sagemaker.estimator.Estimator(container, # The location of the container we wish to use
                                    role,                                    # What is our current IAM Role
                                    instance_count=1,                  # How many compute instances
                                    instance_type='ml.m4.xlarge',      # What kind of compute instances
                                    output_path='s3://{}/{}/output'.format(session.default_bucket(), prefix),
                                    sagemaker_session=session)


xgb.set_hyperparameters(max_depth=5,
                        eta=0.2,
                        gamma=4,
                        min_child_weight=6,
                        subsample=0.8,
                        silent=0,
                        objective='binary:logistic',
                        early_stopping_rounds=10,
                        num_round=500)

In [52]:
s3_input_train = sagemaker.TrainingInput(s3_data=train_location, content_type='csv')
s3_input_validation = sagemaker.TrainingInput(s3_data=val_location, content_type='csv')

In [53]:
xgb.fit({'train': s3_input_train, 'validation': s3_input_validation})

2021-05-29 20:20:53 Starting - Starting the training job...
2021-05-29 20:21:17 Starting - Launching requested ML instancesProfilerReport-1622319653: InProgress
.........
2021-05-29 20:22:39 Starting - Preparing the instances for training......
2021-05-29 20:23:53 Downloading - Downloading input data...
2021-05-29 20:24:18 Training - Downloading the training image..[34mArguments: train[0m
[34m[2021-05-29:20:24:31:INFO] Running standalone xgboost training.[0m
[34m[2021-05-29:20:24:31:INFO] File size need to be processed in the node: 0.1mb. Available memory size in the node: 8402.37mb[0m
[34m[2021-05-29:20:24:31:INFO] Determined delimiter of CSV input is ','[0m
[34m[20:24:31] S3DistributionType set as FullyReplicated[0m
[34m[20:24:31] 1999x9 matrix with 17991 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[34m[2021-05-29:20:24:31:INFO] Determined delimiter of CSV input is ','[0m
[34m[20:24:31] S3DistributionType set as FullyReplicated

In [54]:
xgb_transformer = xgb.transformer(instance_count = 1, instance_type = 'ml.m4.xlarge')

In [55]:
xgb_transformer.transform(test_location, content_type='text/csv', split_type='Line')

...........................
[34mArguments: serve[0m
[34m[2021-05-29 20:29:46 +0000] [1] [INFO] Starting gunicorn 19.9.0[0m
[34m[2021-05-29 20:29:46 +0000] [1] [INFO] Listening at: http://0.0.0.0:8080 (1)[0m
[34m[2021-05-29 20:29:46 +0000] [1] [INFO] Using worker: gevent[0m
[34m[2021-05-29 20:29:46 +0000] [20] [INFO] Booting worker with pid: 20[0m
[34m[2021-05-29 20:29:46 +0000] [21] [INFO] Booting worker with pid: 21[0m
[34m[2021-05-29 20:29:46 +0000] [22] [INFO] Booting worker with pid: 22[0m
[34m[2021-05-29 20:29:46 +0000] [23] [INFO] Booting worker with pid: 23[0m
  monkey.patch_all(subprocess=True)[0m
[34m[2021-05-29:20:29:46:INFO] Model loaded successfully for worker : 20[0m
  monkey.patch_all(subprocess=True)[0m
[34m[2021-05-29:20:29:46:INFO] Model loaded successfully for worker : 21[0m
  monkey.patch_all(subprocess=True)[0m
[34m[2021-05-29:20:29:46:INFO] Model loaded successfully for worker : 22[0m
  monkey.patch_all(subprocess=True)[0m
[34m[2021-05-29

In [56]:
!aws s3 cp --recursive $xgb_transformer.output_path $data_dir

Completed 12.8 KiB/12.8 KiB (143.8 KiB/s) with 1 file(s) remainingdownload: s3://sagemaker-us-east-1-701904821656/xgboost-2021-05-29-20-25-23-010/test.csv.out to churn_data/test.csv.out


In [57]:
predictions = pd.read_csv(os.path.join(data_dir, 'test.csv.out'), header=None)
predictions = [round(num) for num in predictions.squeeze().values]

In [58]:
accuracy_score(y_val, predictions)

0.7766116941529235

## Evaluation

In [60]:
from sklearn.metrics import confusion_matrix

In [61]:
#confusion matrix for XGBoost
confusion = confusion_matrix(y_val, predictions)
print("confusion matrix with XGBoost:\n{}".format(confusion))

confusion matrix with XGBoost:
[[508  62]
 [ 87  10]]


In [62]:
print("Accuracy: {:.2f}".format((confusion[0][0]+confusion[1][1])/sum(sum(confusion))))
print("Precision: {:.2f}".format(confusion[1][1]/sum(np.transpose(confusion)[1])))
print("Recall: {:.2f}".format(confusion[1][1]/sum(confusion[1])))

Accuracy: 0.78
Precision: 0.14
Recall: 0.10


## Tunning

In [72]:
xgb_tunned = sagemaker.estimator.Estimator(container, # The location of the container we wish to use
                                    role,                                    # What is our current IAM Role
                                    instance_count=1,                  # How many compute instances
                                    instance_type='ml.m4.xlarge',      # What kind of compute instances
                                    output_path='s3://{}/{}/output'.format(session.default_bucket(), prefix),
                                    sagemaker_session=session)

# https://xgboost.readthedocs.io/en/latest/parameter.html#learning-task-parameters
# https://machinelearningmastery.com/xgboost-for-imbalanced-classification/
xgb_tunned.set_hyperparameters(max_depth=5,
                        eta=0.2,
                        gamma=4,
                        min_child_weight=6,
                        subsample=0.8,
                        silent=0,
                        objective='binary:logistic',
                        eval_metric='auc',
                        scale_pos_weight=6,
                        early_stopping_rounds=10,
                        num_round=500)

In [73]:
xgb_tunned.fit({'train': s3_input_train, 'validation': s3_input_validation})

2021-05-29 20:51:52 Starting - Starting the training job...
2021-05-29 20:52:16 Starting - Launching requested ML instancesProfilerReport-1622321512: InProgress
......
2021-05-29 20:53:16 Starting - Preparing the instances for training......
2021-05-29 20:54:16 Downloading - Downloading input data...
2021-05-29 20:54:53 Training - Training image download completed. Training in progress.
2021-05-29 20:54:53 Uploading - Uploading generated training model.[34mArguments: train[0m
[34m[2021-05-29:20:54:48:INFO] Running standalone xgboost training.[0m
[34m[2021-05-29:20:54:48:INFO] File size need to be processed in the node: 0.1mb. Available memory size in the node: 8407.77mb[0m
[34m[2021-05-29:20:54:48:INFO] Determined delimiter of CSV input is ','[0m
[34m[20:54:48] S3DistributionType set as FullyReplicated[0m
[34m[20:54:48] 1999x9 matrix with 17991 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[34m[2021-05-29:20:54:48:INFO] Determined de

In [78]:
xgb_tunned_transformer = xgb_tunned.transformer(instance_count = 1, instance_type = 'ml.m4.xlarge')

In [79]:
xgb_tunned_transformer.transform(test_location, content_type='text/csv', split_type='Line')

.......................................
.[34mArguments: serve[0m
[35mArguments: serve[0m
[34m[2021-05-29 21:09:48 +0000] [1] [INFO] Starting gunicorn 19.9.0[0m
[34m[2021-05-29 21:09:48 +0000] [1] [INFO] Listening at: http://0.0.0.0:8080 (1)[0m
[34m[2021-05-29 21:09:48 +0000] [1] [INFO] Using worker: gevent[0m
[34m[2021-05-29 21:09:48 +0000] [21] [INFO] Booting worker with pid: 21[0m
  monkey.patch_all(subprocess=True)[0m
[34m[2021-05-29:21:09:48:INFO] Model loaded successfully for worker : 21[0m
[34m[2021-05-29 21:09:48 +0000] [22] [INFO] Booting worker with pid: 22[0m
  monkey.patch_all(subprocess=True)[0m
[34m[2021-05-29:21:09:48:INFO] Model loaded successfully for worker : 22[0m
[34m[2021-05-29 21:09:48 +0000] [23] [INFO] Booting worker with pid: 23[0m
[35m[2021-05-29 21:09:48 +0000] [1] [INFO] Starting gunicorn 19.9.0[0m
[35m[2021-05-29 21:09:48 +0000] [1] [INFO] Listening at: http://0.0.0.0:8080 (1)[0m
[35m[2021-05-29 21:09:48 +0000] [1] [INFO] Using wor

In [80]:
!aws s3 cp --recursive $xgb_tunned_transformer.output_path $data_dir+'tunned'

Completed 12.8 KiB/12.8 KiB (165.6 KiB/s) with 1 file(s) remainingdownload: s3://sagemaker-us-east-1-701904821656/xgboost-2021-05-29-21-03-23-174/test.csv.out to churn_data+tunned/test.csv.out


In [81]:
predictions_tunned = pd.read_csv(os.path.join(data_dir+'+tunned', 'test.csv.out'), header=None)
predictions_tunned = [round(num) for num in predictions_tunned.squeeze().values]

In [82]:
accuracy_score(y_val, predictions_tunned)

0.7211394302848576

In [70]:
#confusion matrix for XGBoost
confusion_tunned = confusion_matrix(y_val, predictions_tunned)
print("confusion matrix with XGBoost:\n{}".format(confusion_tunned))

confusion matrix with XGBoost:
[[459 111]
 [ 77  20]]


In [71]:
print("Accuracy: {:.2f}".format((confusion_tunned[0][0]+confusion_tunned[1][1])/sum(sum(confusion_tunned))))
print("Precision: {:.2f}".format(confusion_tunned[1][1]/sum(np.transpose(confusion_tunned)[1])))
print("Recall: {:.2f}".format(confusion_tunned[1][1]/sum(confusion_tunned[1])))

Accuracy: 0.72
Precision: 0.15
Recall: 0.21


### Linear Learner