# Customer Churn Prediction

The data is broadly divided into 9 features. They can be listed as:
* Cust ID
* Name
* Age
* Gender 
* Location
* Subsciption Length in Months
* Monthly Bill
* Total Usage in GBs

The Target variable is **Churn**

## Preprocessing

The Data did not require much preprocessing as there were no missing values or empty columns. 
Scalling the data effectivly reduced the accuracy of the models. Hence it was avoided.

The columns Name and CustID were only for identification purposes. Hence they were dropped.


## Feature Engineering

I effictivly created the following features:
* usage_to_bill	
* usage_subs	
* usage_age	
* relative_bill	
* cost_per_month

## Best Model
I tried running different techniques. ANNs, XGBoosting, Logistic Regression and RFCs. However none resulted in a good model. This was because the data was highly uncorrelated. 

The correlation of target with variables in order was:

0.011910	0.008063	-0.002137	-0.185357	0.000530	-0.001046	0.014573	-0.007460	-0.000033	0.005081	-0.162085	0.249998

None of them were sufficient to predict the churn.

Hence the best accuracy I achieved was with **WeightedEnsemble_L2**
The best accuracy score was 53%

## Deploying the model

I deployed the model using the streamlit library. It can be accessed by running 
> streamlit run model.py













.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.preprocessing import StandardScaler



In [3]:
db = pd.read_csv("/kaggle/input/custchurn/customer_churn_large_dataset.csv")
db.head()

Unnamed: 0,CustomerID,Name,Age,Gender,Location,Subscription_Length_Months,Monthly_Bill,Total_Usage_GB,Churn
0,1,Customer_1,63,Male,Los Angeles,17,73.36,236,0
1,2,Customer_2,62,Female,New York,1,48.76,172,0
2,3,Customer_3,24,Female,Los Angeles,5,85.47,460,0
3,4,Customer_4,36,Female,Miami,3,97.94,297,1
4,5,Customer_5,46,Female,Miami,19,58.14,266,0


In [4]:
encoded = pd.get_dummies(db, columns = ["Gender"])
encoded["Y"] = encoded["Churn"]
encoded = encoded.drop(["Churn", "Name", "CustomerID", 'Gender_Female', "Location"], axis = 1)

# Feature Engineering

* creating new and relevent features such as gbs per bill, usage cost per month, distribution of usage with age, relative bill of customer, and cost per month

In [5]:
encoded["usage_to_bill"] = encoded["Total_Usage_GB"]/encoded["Monthly_Bill"]
encoded["usage_subs"] = encoded["Total_Usage_GB"]/encoded["Subscription_Length_Months"]
encoded["usage_age"] = encoded["Total_Usage_GB"]/encoded["Age"]
encoded["relative_bill"] = encoded["Monthly_Bill"]/encoded["Monthly_Bill"].mean()
encoded["cost_per_month"] = encoded["Monthly_Bill"] / encoded["Subscription_Length_Months"]
encoded["diff"] = abs(encoded["Total_Usage_GB"] - encoded["Monthly_Bill"])

In [6]:
encoded.describe()

Unnamed: 0,Age,Subscription_Length_Months,Monthly_Bill,Total_Usage_GB,Gender_Male,Y,usage_to_bill,usage_subs,usage_age,relative_bill,cost_per_month,diff
count,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0
mean,44.02702,12.4901,65.053197,274.39365,0.49784,0.49779,4.713915,43.349682,7.203311,1.0,10.310682,210.713169
std,15.280283,6.926461,20.230696,130.463063,0.499998,0.499998,2.911449,65.786483,4.756042,0.310987,14.505246,129.755348
min,18.0,1.0,30.0,50.0,0.0,0.0,0.5002,2.083333,0.714286,0.461161,1.250417,0.01
25%,31.0,6.0,47.54,161.0,0.0,0.0,2.482949,12.6875,3.666667,0.730787,3.409121,96.09
50%,44.0,12.0,65.01,274.0,0.0,0.0,4.210801,21.909091,6.235294,0.999336,5.195,209.06
75%,57.0,19.0,82.64,387.0,1.0,1.0,6.204849,42.545455,9.382979,1.270345,10.016667,322.24
max,70.0,24.0,100.0,500.0,1.0,1.0,16.616717,500.0,27.777778,1.537203,99.98,469.68


In [7]:
encoded["Monthly_Bill"].mean()

65.05319680000001

In [8]:
encoded["y"] = encoded["Y"]
encoded = encoded.drop(["Y"], axis = 1)

In [9]:
encoded.head()

Unnamed: 0,Age,Subscription_Length_Months,Monthly_Bill,Total_Usage_GB,Gender_Male,usage_to_bill,usage_subs,usage_age,relative_bill,cost_per_month,diff,y
0,63,17,73.36,236,1,3.217012,13.882353,3.746032,1.127692,4.315294,162.64,0
1,62,1,48.76,172,0,3.527482,172.0,2.774194,0.74954,48.76,123.24,0
2,24,5,85.47,460,0,5.382005,92.0,19.166667,1.313848,17.094,374.53,0
3,36,3,97.94,297,0,3.032469,99.0,8.25,1.505537,32.646667,199.06,1
4,46,19,58.14,266,0,4.575163,14.0,5.782609,0.89373,3.06,207.86,0


In [10]:
encoded.cov()

Unnamed: 0,Age,Subscription_Length_Months,Monthly_Bill,Total_Usage_GB,Gender_Male,usage_to_bill,usage_subs,usage_age,relative_bill,cost_per_month,diff,y
Age,233.487045,0.357911,0.343111,3.840942,0.006358,0.080272,-1.426894,-42.747096,0.005274,-0.173255,3.606867,0.01191
Subscription_Length_Months,0.357911,47.975862,-0.741784,-1.990928,0.001109,-0.006388,-267.050882,-0.108104,-0.011403,-63.728888,-1.351199,0.008063
Monthly_Bill,0.343111,-0.741784,409.281055,8.410608,0.022648,-32.261882,7.058745,0.081725,6.291483,66.351177,-370.109609,-0.002137
Total_Usage_GB,3.840942,-1.990928,8.410608,17020.610716,-0.090376,292.090367,2670.08459,445.899835,0.129288,-2.0916,16720.986218,-0.185357
Gender_Male,0.006358,0.001109,0.022648,-0.090376,0.249998,-0.00433,0.044297,-0.002468,0.000348,0.007331,-0.099078,0.00053
usage_to_bill,0.080272,-0.006388,-32.261882,292.090367,-0.00433,8.476533,45.377524,7.652759,-0.495931,-5.254044,318.865388,-0.001046
usage_subs,-1.426894,-267.050882,7.058745,2670.08459,0.044297,45.377524,4327.861403,70.609402,0.108507,763.81999,2616.965138,0.014573
usage_age,-42.747096,-0.108104,0.081725,445.899835,-0.002468,7.652759,70.609402,22.619938,0.001256,0.000485,438.169078,-0.00746
relative_bill,0.005274,-0.011403,6.291483,0.129288,0.000348,-0.495931,0.108507,0.001256,0.096713,1.019953,-5.689338,-3.3e-05
cost_per_month,-0.173255,-63.728888,66.351177,-2.0916,0.007331,-5.254044,763.81999,0.000485,1.019953,210.402171,-63.739985,0.005081


In [11]:
from sklearn.feature_selection import *

In [12]:
trains = encoded.drop("y", axis = 1)

f = f_regression(trains, encoded["y"])
f1 = f_classif(trains, encoded["y"])
mi = mutual_info_classif(trains,encoded["y"])
chi = chi2(trains,encoded["y"])

cols = [x for x in trains]
vals = pd.DataFrame({'cols':cols, 'f_score' : f[0], 'p_value': f[1], 'mi':mi,'chi':chi[0],'p_chi':chi[1]})
vals = vals.sort_values(by="mi")
vals.head(n=10)

Unnamed: 0,cols,f_score,p_value,mi,chi,p_chi
0,Age,0.242999,0.622049,0.0,1.2887,0.2562876
2,Monthly_Bill,0.004465,0.946727,0.0,0.02809,0.8668984
3,Total_Usage_GB,0.807423,0.368885,0.0,50.084482,1.47267e-12
5,usage_to_bill,0.051598,0.820307,0.0,0.092783,0.760668
6,usage_subs,0.019628,0.88858,0.000406,1.959647,0.1615511
7,usage_age,0.984169,0.321174,0.000621,3.090502,0.07875051
1,Subscription_Length_Months,0.542063,0.461581,0.000717,2.082135,0.1490316
8,relative_bill,0.004465,0.946727,0.00105,0.000432,0.9834214
10,diff,0.624159,0.429508,0.001178,49.871909,1.641175e-12
9,cost_per_month,0.049087,0.82466,0.003874,1.001693,0.3169012


In [13]:
x = encoded.iloc[:, 0:-1]
y = encoded["y"]

In [14]:
required = ["Monthly_Bill"]
x1 = x[required]

In [15]:
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size = 0.8)

logreg = LogisticRegression(class_weight = "balanced")
logreg.fit(x_train, y_train)
y_pred = logreg.predict(x_test)

print(pd.Series(y_pred).value_counts())
print(classification_report(y_test, y_pred))
print(logreg.score(x_test, y_test))

0    11542
1     8458
dtype: int64
              precision    recall  f1-score   support

           0       0.50      0.57      0.54     10109
           1       0.49      0.42      0.45      9891

    accuracy                           0.50     20000
   macro avg       0.50      0.50      0.49     20000
weighted avg       0.50      0.50      0.50     20000

0.49815


# XG BOOSTING

In [16]:
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score

In [18]:
x_train.columns

Index(['Age', 'Subscription_Length_Months', 'Monthly_Bill', 'Total_Usage_GB',
       'Gender_Male', 'usage_to_bill', 'usage_subs', 'usage_age',
       'relative_bill', 'cost_per_month', 'diff'],
      dtype='object')

In [19]:
model = XGBClassifier()
model.fit(x_train, y_train)
y_pred_x = model.predict(x_test)
predictions = [round(value) for value in y_pred_x]

print(f'Accuracy score is {accuracy_score(y_test, predictions)}')
print()
print(f'the classification report is \n {classification_report(y_test, predictions)}')

Accuracy score is 0.5034

the classification report is 
               precision    recall  f1-score   support

           0       0.51      0.51      0.51     10112
           1       0.50      0.49      0.50      9888

    accuracy                           0.50     20000
   macro avg       0.50      0.50      0.50     20000
weighted avg       0.50      0.50      0.50     20000



In [20]:
model.save_model('xgboost.json')

# Autogluon

In [20]:
!pip install autogluon --target=/kaggle/working/mysitepackages

Collecting autogluon
  Downloading autogluon-0.8.2-py3-none-any.whl (9.7 kB)
Collecting autogluon.core[all]==0.8.2 (from autogluon)
  Downloading autogluon.core-0.8.2-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.0/224.0 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting autogluon.features==0.8.2 (from autogluon)
  Downloading autogluon.features-0.8.2-py3-none-any.whl (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.1/62.1 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting autogluon.tabular[all]==0.8.2 (from autogluon)
  Downloading autogluon.tabular-0.8.2-py3-none-any.whl (285 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m285.7/285.7 kB[0m [31m15.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting autogluon.multimodal==0.8.2 (from autogluon)
  Downloading autogluon.multimodal-0.8.2-py3-none-any.whl (372 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [23]:
import sys
sys.path.append('/kaggle/working/mysitepackages')

In [25]:
from autogluon.tabular import TabularDataset, TabularPredictor

tr = pd.concat([x_train, y_train], axis = 1)
te = pd.concat([x_test,y_test], axis = 1)

train_data1 = TabularDataset(tr)
test_data1 = TabularDataset(te)


predictor = TabularPredictor(label="y").fit(train_data=train_data1)

No path specified. Models will be saved in: "AutogluonModels/ag-20230825_174102/"
Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels/ag-20230825_174102/"
AutoGluon Version:  0.8.2
Python Version:     3.10.12
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Thu Aug 24 17:32:58 UTC 2023
Disk Space Avail:   15.19 GB / 20.96 GB (72.5%)
Train Data Rows:    80000
Train Data Columns: 11
Label Column: y
Preprocessing data ...
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  [0, 1]
	If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping:  class 1 = 1, class 0 = 0
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:          

In [32]:
train_data1

Unnamed: 0,Age,Subscription_Length_Months,Monthly_Bill,Total_Usage_GB,Gender_Male,usage_to_bill,usage_subs,usage_age,relative_bill,cost_per_month,diff,y
50030,56,8,85.16,291,1,3.417097,36.375000,5.196429,1.309082,10.645000,205.84,0
87572,30,16,73.53,348,0,4.732762,21.750000,11.600000,1.130306,4.595625,274.47,1
5303,54,23,93.56,392,0,4.189825,17.043478,7.259259,1.438208,4.067826,298.44,1
28394,24,15,93.32,360,0,3.857694,24.000000,15.000000,1.434518,6.221333,266.68,1
92526,47,17,85.77,175,0,2.040340,10.294118,3.723404,1.318459,5.045294,89.23,1
...,...,...,...,...,...,...,...,...,...,...,...,...
2420,48,23,84.88,212,0,2.497644,9.217391,4.416667,1.304778,3.690435,127.12,0
3396,45,20,94.76,187,1,1.973407,9.350000,4.155556,1.456654,4.738000,92.24,1
93856,62,7,69.77,461,1,6.607424,65.857143,7.435484,1.072507,9.967143,391.23,1
19991,49,17,60.70,310,0,5.107084,18.235294,6.326531,0.933083,3.570588,249.30,0


# RFC

In [25]:
from sklearn.ensemble import RandomForestClassifier

In [47]:
rfc1 = RandomForestClassifier(n_estimators = 200,criterion='entropy', max_depth = 15)
rfc1.fit(x_train, y_train)
y_pred_rfc = rfc1.predict(x_test)
print(classification_report(y_test, y_pred_rfc))
rfc1.score(x_test, y_test)

              precision    recall  f1-score   support

           0       0.51      0.58      0.54     10184
           1       0.49      0.42      0.45      9816

    accuracy                           0.50     20000
   macro avg       0.50      0.50      0.50     20000
weighted avg       0.50      0.50      0.50     20000



0.50155

# ANN

In [55]:
!pip install scikeras
import pandas as pd
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from scikeras.wrappers import KerasClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

Collecting scikeras
  Downloading scikeras-0.11.0-py3-none-any.whl (27 kB)
Installing collected packages: scikeras
Successfully installed scikeras-0.11.0


In [58]:
def create_baseline():
    model = Sequential()
    model.add(Dense(60, input_shape=(11,), activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

In [62]:
estimator = KerasClassifier(model=create_baseline, epochs=20, batch_size=200, verbose=0)
kfold = StratifiedKFold(n_splits=5, shuffle=True)
results = cross_val_score(estimator, x, y, cv=kfold)
print("Baseline: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

Baseline: 50.03% (0.20%)


In [64]:
estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('mlp', KerasClassifier(model=create_baseline, epochs=20, batch_size=200, verbose=0)))
pipeline = Pipeline(estimators)
kfold = StratifiedKFold(n_splits=5, shuffle=True)
results = cross_val_score(pipeline, x, y, cv=kfold)
print("Standardized: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

Standardized: 50.27% (0.28%)


In [68]:
def create_smaller():
    model = Sequential()
    model.add(Dense(30, input_shape=(11,), activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

In [69]:
estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('mlp', KerasClassifier(model=create_smaller, epochs=100, batch_size=400, verbose=0)))
pipeline = Pipeline(estimators)
kfold = StratifiedKFold(n_splits=10, shuffle=True)
results = cross_val_score(pipeline, x, y, cv=kfold)
print("Smaller: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

Smaller: 50.16% (0.36%)


In [78]:
def create_larger():
    model = Sequential()
    model.add(Dense(60, input_shape=(11,), activation='relu'))
    model.add(Dense(30, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

In [81]:
tf.debugging.set_log_device_placement(True)
estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('mlp', KerasClassifier(model=create_larger, epochs=60, batch_size=250, verbose=0)))
pipeline = Pipeline(estimators)
kfold = StratifiedKFold(n_splits=5, shuffle=True)
results = cross_val_score(pipeline, x, y, cv=kfold)
print("Larger: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

Larger: 50.06% (0.31%)
