# Customer Churn Prediction:
A Bank wants to take care of customer retention for its product: savings accounts. The bank wants you to identify customers likely to churn balances below the minimum balance. You have the customers information such as age, gender, demographics along with their transactions with the bank.
Your task as a data scientist would be to predict the propensity to churn for each customer.

## Data Dictionary
There are multiple variables in the dataset which can be cleanly divided into 3 categories:

## I. Demographic information about customers

       •	customer_id - Customer id 
       •	vintage - Vintage of the customer with the bank in a number of days 
       •	age - Age of customer 
       •	gender - Gender of customer 
       •	dependents - Number of dependents 
       •	occupation - Occupation of the customer 
       •	city - City of the customer (anonymized) 
       
## II. Customer Bank Relationship

       •	customer_nw_category - Net worth of customer (3: Low 2: Medium 1: High) 
       •	branch_code - Branch Code for a customer account 
       •	days_since_last_transaction - No of Days Since Last Credit in Last 1 year 

## III. Transactional Information
       •	current_balance - Balance as of today 
       •	previous_month_end_balance - End of Month Balance of previous month 
       •	average_monthly_balance_prevQ - Average monthly balances (AMB) in Previous Quarter 
       •	average_monthly_balance_prevQ2 - Average monthly balances (AMB) in previous to the previous quarter 
       •	current_month_credit - Total Credit Amount current month 
       •	previous_month_credit - Total Credit Amount previous month 
       •	current_month_debit - Total Debit Amount current month 
       •	previous_month_debit - Total Debit Amount previous month 
       •	current_month_balance - Average Balance of current month 
       •	previous_month_balance - Average Balance of previous month 
       •	churn - Average balance of customer falls below minimum balance in the next quarter (1/0) 

# Importing Modules

In [None]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

from sklearn.metrics import recall_score, classification_report, confusion_matrix
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.metrics import roc_curve, roc_auc_score

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

from sklearn.preprocessing import MinMaxScaler, StandardScaler, OneHotEncoder

from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

# Importing data

In [None]:
df = pd.read_csv("filled_churn_prediction.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,customer_id,vintage,age,gender,dependents,occupation,city,customer_nw_category,branch_code,days_since_last_transaction,current_balance,previous_month_end_balance,average_monthly_balance_prevQ,average_monthly_balance_prevQ2,current_month_credit,previous_month_credit,current_month_debit,previous_month_debit,current_month_balance,previous_month_balance,churn,dependents_with_missing_data,occupation_with_missing_data,city_with_missing_data,gender_with_missing_data,days_since_last_transaction_with_missing_data
0,0,1,3135,66,Male,0,self_employed,187,2,755,224.0,1458.71,1458.71,1458.71,1449.07,0.2,0.2,0.2,0.2,1458.71,1458.71,0,False,False,False,False,False
1,1,2,310,35,Male,0,self_employed,0,2,3214,60.0,5390.37,8704.66,7799.26,12419.41,0.56,0.56,5486.27,100.56,6496.78,8787.61,0,False,False,True,False,False
2,2,4,2356,31,Male,0,salaried,146,2,41,30.0,3913.16,5815.29,4910.17,2815.94,0.61,0.61,6046.73,259.23,5006.28,5070.14,0,False,False,False,False,True
3,3,5,478,90,missing,0,self_employed,1020,2,582,147.0,2291.91,2291.91,2084.54,1006.54,0.47,0.47,0.47,2143.33,2291.91,1669.79,1,True,False,False,True,False
4,4,6,2531,42,Male,2,self_employed,1494,3,388,58.0,927.72,1401.72,1643.31,1871.12,0.33,714.61,588.62,1538.06,1157.15,1677.16,1,False,False,False,False,False


In [None]:
df.columns

Index(['Unnamed: 0', 'customer_id', 'vintage', 'age', 'gender', 'dependents',
       'occupation', 'city', 'customer_nw_category', 'branch_code',
       'days_since_last_transaction', 'current_balance',
       'previous_month_end_balance', 'average_monthly_balance_prevQ',
       'average_monthly_balance_prevQ2', 'current_month_credit',
       'previous_month_credit', 'current_month_debit', 'previous_month_debit',
       'current_month_balance', 'previous_month_balance', 'churn',
       'dependents_with_missing_data', 'occupation_with_missing_data',
       'city_with_missing_data', 'gender_with_missing_data',
       'days_since_last_transaction_with_missing_data'],
      dtype='object')

In [None]:
df.drop(["Unnamed: 0"],axis =1, inplace =True)

# Selecting the target column for prediction.

* y is the target column for prediction
* X is the features which is used to predict y.

In [None]:
X = df.drop(["churn"], axis = 1)
y = df["churn"]

# OneHotEncoder

It is used to convert categorical data into binary vectors

In [None]:
df.dtypes

customer_id                                        int64
vintage                                            int64
age                                                int64
gender                                            object
dependents                                         int64
occupation                                        object
city                                               int64
customer_nw_category                               int64
branch_code                                        int64
days_since_last_transaction                      float64
current_balance                                  float64
previous_month_end_balance                       float64
average_monthly_balance_prevQ                    float64
average_monthly_balance_prevQ2                   float64
current_month_credit                             float64
previous_month_credit                            float64
current_month_debit                              float64
previous_month_debit           

In [None]:
from sklearn.compose import ColumnTransformer

categorical_features = ["gender", "occupation"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot", 
                                 one_hot, 
                                 categorical_features)],
                                 remainder="passthrough")

transformed_X = transformer.fit_transform(X)
transformed_X

array([[0.0, 1.0, 0.0, ..., False, False, False],
       [0.0, 1.0, 0.0, ..., True, False, False],
       [0.0, 1.0, 0.0, ..., False, False, True],
       ...,
       [0.0, 1.0, 0.0, ..., False, False, False],
       [0.0, 1.0, 0.0, ..., False, False, True],
       [0.0, 1.0, 0.0, ..., False, False, False]], dtype=object)

# Using Minmaxscaler

This feature is used to scale data into a given range i.e. zero and one.

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler_X = scaler.fit_transform(transformed_X)

In [None]:
dframe = pd.DataFrame(scaler_X)
dframe.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30
0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.23233,0.730337,0.0,0.113402,0.5,0.157708,0.613699,0.001178,0.000802,5e-06,0.003572,1.548512e-08,8.044683e-08,2.487609e-08,1.343546e-07,0.000836,0.001158,0.0,0.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,3.3e-05,0.010221,0.382022,0.0,0.0,0.5,0.672035,0.164384,0.001843,0.002064,0.001118,0.005754,4.482534e-08,2.328724e-07,0.0007182983,7.110188e-05,0.001707,0.002438,0.0,0.0,1.0,0.0,0.0
2,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,9.9e-05,0.171083,0.337079,0.0,0.088539,0.5,0.008366,0.082192,0.001593,0.001561,0.000611,0.003844,4.890037e-08,2.540426e-07,0.0007916775,0.0001833021,0.00145,0.001789,0.0,0.0,0.0,0.0,1.0
3,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.000132,0.02343,1.0,0.0,0.618557,0.5,0.121523,0.40274,0.001319,0.000947,0.000115,0.003484,3.749028e-08,1.94766e-07,6.022631e-08,0.001515605,0.00098,0.001195,1.0,0.0,0.0,1.0,0.0
4,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.000165,0.184842,0.460674,0.038462,0.906004,1.0,0.080945,0.158904,0.001088,0.000792,3.8e-05,0.003656,2.60802e-08,0.0003025648,7.706481e-05,0.001087601,0.000784,0.001196,0.0,0.0,0.0,0.0,0.0


# Different classification model

* RandomForestClassifier
* LogisticRegression
* KNeighborsClassifier

In [None]:
#RandomForestClassifier

np.random.seed(42)

X_train, X_test, y_train, y_test = train_test_split(dframe, y, test_size = 0.2)

n_est = [10,20,30,50,100,500]

for i in n_est:
  clf = RandomForestClassifier(n_estimators = i)
  clf.fit(X_train, y_train)
  print(clf.score(X_test, y_test))

0.8548529152721508
0.8633080852562973
0.8629557865069579
0.866654923375022
0.8638365333803065
0.8671833714990311


In [None]:
model = LogisticRegression(max_iter = 100, n_jobs = -1)
model.fit(X_train, y_train)
model.score(X_test,y_test)

0.8173330984675005

In [None]:
preds = model.predict(X_test)
print(classification_report(y_test, preds))

              precision    recall  f1-score   support

           0       0.82      1.00      0.90      4639
           1       1.00      0.00      0.00      1038

    accuracy                           0.82      5677
   macro avg       0.91      0.50      0.45      5677
weighted avg       0.85      0.82      0.74      5677



In [None]:
confusion_matrix(y_test, preds)

array([[4639,    0],
       [1037,    1]])

In [None]:
from sklearn.neighbors import KNeighborsClassifier

model2 = KNeighborsClassifier()
model2.fit(X_train,y_train)
model2.score(X_test,y_test)

0.7902060947683636

In [None]:
preds = model2.predict(X_test)
print(classification_report(y_test, preds))

              precision    recall  f1-score   support

           0       0.82      0.95      0.88      4639
           1       0.25      0.07      0.11      1038

    accuracy                           0.79      5677
   macro avg       0.53      0.51      0.50      5677
weighted avg       0.72      0.79      0.74      5677



In [None]:
clf = RandomForestClassifier(n_estimators = 50)
clf.fit(X_train,y_train)
cros_clf = cross_val_score(clf, dframe, y, scoring = "precision", cv= 5, n_jobs = -1)
cros_clf

array([0.75243665, 0.72058824, 0.7032967 , 0.74624374, 0.75638507])

In [None]:
clf = RandomForestClassifier(n_estimators = 50)
clf.fit(X_train,y_train)
cros_clf = cross_val_score(clf, dframe, y, scoring = "recall", cv= 5, n_jobs = -1)
cros_clf

array([0.37357414, 0.42585551, 0.42870722, 0.43060837, 0.39638783])

In [None]:
clf = RandomForestClassifier(n_estimators = 50)
clf.fit(X_train,y_train)
cros_clf = cross_val_score(clf, dframe, y, scoring = "accuracy", cv= 5, n_jobs = -1)
cros_clf

array([0.86031355, 0.86524573, 0.85782241, 0.86522199, 0.85870331])

Here **RandomForestClassifier** has the highest *accuracy* of **86.30** % and its *recall_score* is **40**%. 

# Standard_scaler

It transforms the data in such a manner that it has *mean* as 0 and *standard deviation* as 1. In short, it standardizes the data.

In [None]:
from sklearn.preprocessing import StandardScaler

s_scaler = StandardScaler()
s_scaler_X = s_scaler.fit_transform(transformed_X)

In [None]:
dframe2 = pd.DataFrame(s_scaler_X)
dframe2.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30
0,-0.813874,0.845655,-0.137282,-0.037568,-0.277108,-0.556106,0.785274,-0.279606,-1.731304,0.478644,0.999147,-0.330877,-1.313306,-0.341489,-0.182318,1.927027,-0.139017,-0.141953,-0.144709,-0.127317,-0.044545,-0.109858,-0.070378,-0.137427,-0.142564,-0.142265,-0.308264,-0.053166,-0.170635,-0.137282,-0.357918
1,-0.813874,0.845655,-0.137282,-0.037568,-0.277108,-0.556106,0.785274,-0.279606,-1.73119,-1.275909,-0.741756,-0.330877,-1.73198,-0.341489,2.439824,-0.066312,-0.04672,0.028425,0.007249,0.118793,-0.04454,-0.109846,0.035155,-0.133297,-0.022705,0.030459,-0.308264,-0.053166,5.860457,-0.137282,-0.357918
2,-0.813874,0.845655,-0.137282,-0.037568,-0.277108,1.798219,-1.273441,-0.279606,-1.730961,-0.005178,-0.966389,-0.330877,-1.405101,-0.341489,-0.943689,-0.430947,-0.081398,-0.039514,-0.061991,-0.096652,-0.044539,-0.109844,0.045936,-0.126767,-0.058165,-0.057152,-0.308264,-0.053166,-0.170635,-0.137282,2.793937
3,-0.813874,-1.182516,7.284295,-0.037568,-0.277108,-0.556106,0.785274,-0.279606,-1.730847,-1.171568,2.346943,-0.330877,0.551697,-0.341489,-0.366796,0.99113,-0.119457,-0.122361,-0.129711,-0.137244,-0.044541,-0.109849,-0.070372,-0.049234,-0.122742,-0.13729,3.243971,-0.053166,-0.170635,7.284295,-0.357918
4,-0.813874,0.845655,-0.137282,-0.037568,-0.277108,-0.556106,0.785274,-0.279606,-1.730733,0.103512,-0.348649,1.756001,1.612935,1.172672,-0.573667,-0.090621,-0.151482,-0.143293,-0.140285,-0.117848,-0.044543,-0.085794,-0.059058,-0.074142,-0.149738,-0.137116,-0.308264,-0.053166,-0.170635,-0.137282,-0.357918


# different classification model

* RandomForestClassifier
* LogisticRegression
* KNeighborsClassifier

In [None]:
#RandomForestClassifier

np.random.seed(42)

X_train, X_test, y_train, y_test = train_test_split(dframe2, y, test_size = 0.2)

n_est = [10,20,30,50,100,500]

for i in n_est:
  clf = RandomForestClassifier(n_estimators = i)
  clf.fit(X_train, y_train)
  print(clf.score(X_test, y_test))

0.8527391227761142
0.8618988902589396
0.8631319358816276
0.8671833714990311
0.8657741765016734
0.866654923375022


In [None]:
clf = RandomForestClassifier(n_estimators = 50)
clf.fit(X_train,y_train)
cros_clf = cross_val_score(clf, dframe2, y, scoring = "recall", cv= 5, n_jobs = -1)
cros_clf

array([0.36311787, 0.42870722, 0.42680608, 0.42585551, 0.39068441])

In [None]:
clf = RandomForestClassifier(n_estimators = 50)
clf.fit(X_train,y_train)
cros_clf = cross_val_score(clf, dframe2, y, scoring = "precision", cv= 5, n_jobs = -1)
cros_clf

array([0.72128378, 0.723229  , 0.70578778, 0.75340136, 0.77124183])

In [None]:
model = LogisticRegression(max_iter = 100, n_jobs = -1)
model.fit(X_train, y_train)
model.score(X_test,y_test)

0.8277259115730139

In [None]:
preds = model.predict(X_test)
print(classification_report(y_test, preds))

              precision    recall  f1-score   support

           0       0.83      0.99      0.90      4639
           1       0.77      0.08      0.15      1038

    accuracy                           0.83      5677
   macro avg       0.80      0.54      0.53      5677
weighted avg       0.82      0.83      0.77      5677



In [None]:
cros_model = cross_val_score(model, dframe2, y, scoring = "recall", cv= 5, n_jobs = -1)
cros_model

array([0.07129278, 0.07414449, 0.06368821, 0.0769962 , 0.06844106])

In [None]:
cros_model = cross_val_score(model, dframe2, y, scoring = "precision", cv= 5, n_jobs = -1)
cros_model

array([0.77319588, 0.72222222, 0.67      , 0.77142857, 0.75      ])

In [None]:
from sklearn.neighbors import KNeighborsClassifier

model2 = KNeighborsClassifier()
model2.fit(X_train,y_train)
model2.score(X_test,y_test)

0.8007750572485468

In [None]:
preds = model2.predict(X_test)
print(classification_report(y_test, preds))

              precision    recall  f1-score   support

           0       0.82      0.96      0.89      4639
           1       0.31      0.07      0.12      1038

    accuracy                           0.80      5677
   macro avg       0.57      0.52      0.50      5677
weighted avg       0.73      0.80      0.75      5677



In [None]:
confusion_matrix(y_test,preds)

array([[4469,  170],
       [ 961,   77]])

In [None]:
from sklearn.metrics import recall_score
recall_score(y_test, preds)

0.07418111753371869

In [None]:
cros_model2 = cross_val_score(model2, dframe2, y, scoring = "recall", cv= 5, n_jobs = -1)
cros_model2

array([0.05608365, 0.07509506, 0.07224335, 0.07794677, 0.05798479])

In [None]:
cros_model2 = cross_val_score(model2, dframe2, y, scoring = "precision", cv= 5, n_jobs = -1)
cros_model2

array([0.26222222, 0.28832117, 0.25      , 0.3129771 , 0.2699115 ])

# hyperparameter

## hyperparameter tuning for RandomForestClassifier

### RandomizedSearchCV

In [None]:
from sklearn.model_selection import RandomizedSearchCV

rs_grid = {"n_estimators": [50,200],
            "max_depth": [None,10,20],
            "min_samples_split": [2],
            "max_features": ["auto", "sqrt"],
            "min_samples_leaf": [1,4]}

clf = RandomForestClassifier()

rs_clf = RandomizedSearchCV(estimator = clf,
                            param_distributions = rs_grid,
                            n_jobs = -1,
                            n_iter = 10,
                            cv = 5,
                            verbose = 2,
                            random_state = 42,
                            refit = True)

rs_clf.fit(X_train,y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:  2.6min
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:  4.7min finished


RandomizedSearchCV(cv=5, error_score=nan,
                   estimator=RandomForestClassifier(bootstrap=True,
                                                    ccp_alpha=0.0,
                                                    class_weight=None,
                                                    criterion='gini',
                                                    max_depth=None,
                                                    max_features='auto',
                                                    max_leaf_nodes=None,
                                                    max_samples=None,
                                                    min_impurity_decrease=0.0,
                                                    min_impurity_split=None,
                                                    min_samples_leaf=1,
                                                    min_samples_split=2,
                                                    min_weight_fraction_leaf=0.0,
               

In [None]:
rs_clf.best_params_

{'max_depth': None,
 'max_features': 'sqrt',
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'n_estimators': 200}

In [None]:
rs_clf.score(X_test,y_test)

0.8668310727496917

In [None]:
clf = RandomForestClassifier( max_depth = None,
                          max_features = 'sqrt',
                          min_samples_leaf= 1,
                          min_samples_split= 2,
                          n_estimators = 200)

In [None]:
clf.fit(X_train,y_train)
clf.score(X_test,y_test)

0.8689448652457283

In [None]:
cros_clf = cross_val_score(clf, dframe2, y, scoring = "recall", cv= 5, n_jobs = -1)
cros_clf

array([0.4134981 , 0.43821293, 0.43536122, 0.43631179, 0.37642586])

In [None]:
cros_clf = cross_val_score(clf, dframe2, y, scoring = "accuracy", cv= 5, n_jobs = -1)
cros_clf

array([0.86718337, 0.86489343, 0.86275546, 0.86698379, 0.86275546])

### GridSearchCV

In [None]:
gs_grid = {"n_estimators": [50,100,200],
           "max_depth": [10,20],
           "max_features": ["auto", "sqrt"],
           "min_samples_split": [2],
           "min_samples_leaf": [1,2,4]}

clf = RandomForestClassifier()

gs_clf = GridSearchCV(estimator = clf,
                      param_grid = gs_grid,
                      cv = 5,
                      n_jobs = -1,
                      verbose = 2,
                      refit = True
                     )

gs_clf.fit(X_train,y_train)

Fitting 5 folds for each of 36 candidates, totalling 180 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:  2.0min
[Parallel(n_jobs=-1)]: Done 158 tasks      | elapsed: 10.9min
[Parallel(n_jobs=-1)]: Done 180 out of 180 | elapsed: 13.0min finished


GridSearchCV(cv=5, error_score=nan,
             estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                              class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              max_samples=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators=100, n_jobs=None,
                                              oob_score=False,
                                              rando

In [None]:
gs_clf.best_params_

{'max_depth': 20,
 'max_features': 'sqrt',
 'min_samples_leaf': 2,
 'min_samples_split': 2,
 'n_estimators': 200}

In [None]:
gs_clf.score(X_test,y_test)

0.8655980271270037

In [None]:
clf = RandomForestClassifier( max_depth = 20,
                          max_features = 'sqrt',
                          min_samples_leaf= 2,
                          min_samples_split= 2,
                          n_estimators = 200)

In [None]:
clf.fit(X_train,y_train)
clf.score(X_test,y_test)

0.8664787740003523

In [None]:
cros_clf = cross_val_score(clf, dframe2, y, scoring = "recall", cv= 5, n_jobs = -1)
cros_clf

array([0.40494297, 0.42775665, 0.42395437, 0.42585551, 0.38688213])

In [None]:
cros_clf = cross_val_score(clf, dframe2, y, scoring = "accuracy", cv= 5, n_jobs = -1)
cros_clf

array([0.86489343, 0.86542188, 0.86099366, 0.86980268, 0.86222692])

# export file

In [None]:
import pickle
filename = "finalized_model.save"
pickle.dump(clf, open(filename, "wb"))

In [None]:
loaded_model = pickle.load(open(filename,"rb"))
loaded_model.fit(X_train,y_train)
result =loaded_model.score(X_test,y_test)
print(result)

0.8648934296283248
