**Problem Statement:**
 To decrease the Customer Churning, a key goal for any business. Predicting Customer Churn represents an additional potential revenue source for any business. Higher Customer Churn leads to loss in revenue.
This would help the bank to have the right engagement with customers at the right time.

**Objective:**
Our objective is to build a machine learning model to predict whether the customer will churn or not in the next six months.

In [1]:
#install missing libraries 
#!pip install pandas_profiling --upgrade
#!pip install catboost

In [2]:
#import the libraries
import pandas as pd
import pandas_profiling
from pandas_profiling import ProfileReport
import numpy as np
import missingno as msno
pd.set_option("display.max_columns", 100)

# Importing visualization packages
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder, OrdinalEncoder
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_curve, roc_auc_score, accuracy_score
from sklearn.metrics import f1_score, classification_report
import warnings
import requests
warnings.filterwarnings("ignore")
from pylab import rcParams
rcParams['figure.figsize'] = 14, 8
RANDOM_SEED = 42
pd.set_option('display.max_columns', 500)

In [3]:
#import datasets for jupyter notebook
train = pd.read_csv("/content/drive/MyDrive/data/womens_hackathon_AV/train_PDjVQMB.csv")
test = pd.read_csv("/content/drive/MyDrive/data/womens_hackathon_AV/test_lTY72QC.csv")
submission = pd.read_csv("/content/drive/MyDrive/data/womens_hackathon_AV/sample_OoSmYo5.csv")

In [4]:
train.head(5)

Unnamed: 0,ID,Age,Gender,Income,Balance,Vintage,Transaction_Status,Product_Holdings,Credit_Card,Credit_Category,Is_Churn
0,84e2fcc9,36,Female,5L - 10L,563266.44,4,0,1,0,Average,1
1,57fea15e,53,Female,Less than 5L,875572.11,2,1,1,1,Poor,0
2,8df34ef3,35,Female,More than 15L,701607.06,2,1,2,0,Poor,0
3,c5c0788b,43,Female,More than 15L,1393922.16,0,1,2,1,Poor,1
4,951d69c4,39,Female,More than 15L,893146.23,1,1,1,1,Good,1


In [5]:
test.head(5)

Unnamed: 0,ID,Age,Gender,Income,Balance,Vintage,Transaction_Status,Product_Holdings,Credit_Card,Credit_Category
0,55480787,50,Female,More than 15L,1008636.39,2,1,2,1,Average
1,9aededf2,36,Male,5L - 10L,341460.72,2,0,2,1,Average
2,a5034a09,25,Female,10L - 15L,439460.1,0,0,2,1,Good
3,b3256702,41,Male,Less than 5L,28581.93,0,1,2,1,Poor
4,dc28adb5,48,Male,More than 15L,1104540.03,2,1,3+,0,Good


In [6]:
train.describe()

Unnamed: 0,Age,Balance,Vintage,Transaction_Status,Credit_Card,Is_Churn
count,6650.0,6650.0,6650.0,6650.0,6650.0,6650.0
mean,41.130226,804595.4,2.250226,0.515789,0.664361,0.231128
std,9.685747,515754.9,1.458795,0.499788,0.472249,0.421586
min,21.0,63.0,0.0,0.0,0.0,0.0
25%,34.0,392264.2,1.0,0.0,0.0,0.0
50%,40.0,764938.6,2.0,1.0,1.0,0.0
75%,47.0,1147124.0,3.0,1.0,1.0,0.0
max,72.0,2436616.0,5.0,1.0,1.0,1.0


In [7]:
profile = ProfileReport(train)
profile

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



**Observation from Profile report**

1. There is no missing value in the dataset
2. There is no duplicate rows in dataset.
3. "Income", "Product_Holdings", "Credit_Category" these variables have values of multiple groups/ types. We will go for ordinal encoding for these variables in the pre-processing steps.
4. Target column(Is_Churn) is im-balanced. Hence we will need to undergo sampling for model building. Class 0 = 5113, Class 1 = 1537

Hence, we can start exploring our dataset

In [8]:
#Drop ID column from both train & test dataset
train.drop('ID',axis='columns', inplace= True)
df_test = test.drop(['ID'], axis=1)

In [9]:
train.columns

Index(['Age', 'Gender', 'Income', 'Balance', 'Vintage', 'Transaction_Status',
       'Product_Holdings', 'Credit_Card', 'Credit_Category', 'Is_Churn'],
      dtype='object')

In [10]:
#Encoding Categorical variables. 
encoder = LabelEncoder()
train["Gender"] = encoder.fit_transform(train["Gender"])
df_test["Gender"] = encoder.fit_transform(df_test["Gender"])

As "Income", "Product_Holdings", "Credit_Category" have more than 2 categories, hence opted for ordinal encoding

In [11]:
or_encoder = OrdinalEncoder()
oe_col = ["Income", "Product_Holdings", "Credit_Category"]

def ordinal_encode(df, column):
    df[column] = or_encoder.fit_transform(df[column])
    return df

train=ordinal_encode(train, oe_col)
df_test=ordinal_encode(df_test, oe_col)

In [12]:
# Outlier Observation Analysis
for feature in train[['Balance', 'Vintage','Transaction_Status']]:
    
    Q1 = train[feature].quantile(0.25)
    Q3 = train[feature].quantile(0.75)
    IQR = Q3-Q1
    lower = Q1- 1.5*IQR
    upper = Q3 + 1.5*IQR
    
    if train[(train[feature] > upper)].any(axis=None):
        print(feature,"yes")
    else:
        print(feature, "no")

Balance yes
Vintage no
Transaction_Status no


##Model building

In [13]:
#Split dataset as feature variables & target variable
y= train.Is_Churn
X = train.drop("Is_Churn", axis = 1)

In [14]:
## FEature Scaling with Minmax Scaler
# Standarscaler
#scaler = StandardScaler()

#Minmax Scaler
scaler = MinMaxScaler()
# all columns to all_cols
all_cols = X.columns
test_cols = df_test.columns

# fit scaler
X_scaled = scaler.fit_transform(X[all_cols])
test_scaled = scaler.fit_transform(df_test[test_cols])

In [15]:
# Spltting the into 80:20 train test size
X_train, X_val, y_train, y_val = train_test_split(X_scaled, y, test_size=0.2, random_state = 42)

In [16]:
X_train.shape

(5320, 9)

As a part of model building we are implementing below models:
1. Random Forest Classifier
2. Catboost Algorithm
3. Artificial Neural Network
4. Extra Tree Classifier
5. Light GBM Classifier

As a Base model, First applied random forest classifier on an unsampled dataset.
Later, Performed SMOTEEN for oversampling and have built the models and checked for macro f1 score.

In [17]:
# Building and fitting Random Forest
model_RF = RandomForestClassifier()

In [18]:
model_RF.fit(X_train, y_train)
y_pred_rf = model_RF.predict(X_val)

In [19]:
preds = model_RF.predict(test_scaled)

In [20]:
fmod_acc_rf = (f1_score(y_val, y_pred_rf, average='macro'))*100
print("F1 score for the Best Model is:", fmod_acc_rf)

F1 score for the Best Model is: 50.15526287611094


In [21]:
from imblearn.over_sampling import SMOTE
from imblearn.combine import SMOTEENN

#sm = SMOTE(random_state=0)
sm = SMOTEENN()
X_train_smote, y_train_smote = sm.fit_resample(X_scaled, y)

In [22]:
# Spltting the into 80:20 train test size
X_tr_smote, X_val_smote, y_tr_smote, y_val_smote = train_test_split(X_train_smote, y_train_smote, test_size=0.2, random_state = 42)

In [23]:
#train RF model 
model_RF.fit(X_tr_smote, y_tr_smote)

RandomForestClassifier()

In [24]:
y_pred_smote = model_RF.predict(X_val_smote)
test_smote = model_RF.predict(test_scaled)

In [25]:
fmod_acc_rf_smote = (f1_score(y_val_smote, y_pred_smote, average='macro'))*100
print("F1 score for the Best Model is:", fmod_acc_rf_smote)

F1 score for the Best Model is: 90.0111507911167


In [26]:
test['Is_Churn'] = test_smote
test[['ID', 'Is_Churn']].to_csv('final_submission_rf.csv', index=False)

In [27]:
#Building model with Catboost
import catboost
classifier = catboost.CatBoostClassifier(verbose=1)

In [28]:
classifier.fit(X_tr_smote, y_tr_smote)

Learning rate set to 0.018833
0:	learn: 0.6873354	total: 56.7ms	remaining: 56.6s
1:	learn: 0.6808943	total: 73.3ms	remaining: 36.6s
2:	learn: 0.6751983	total: 92.4ms	remaining: 30.7s
3:	learn: 0.6701993	total: 105ms	remaining: 26s
4:	learn: 0.6648975	total: 112ms	remaining: 22.3s
5:	learn: 0.6598785	total: 124ms	remaining: 20.6s
6:	learn: 0.6556561	total: 140ms	remaining: 19.9s
7:	learn: 0.6508028	total: 147ms	remaining: 18.2s
8:	learn: 0.6456379	total: 152ms	remaining: 16.7s
9:	learn: 0.6411740	total: 161ms	remaining: 15.9s
10:	learn: 0.6368221	total: 173ms	remaining: 15.6s
11:	learn: 0.6325462	total: 187ms	remaining: 15.4s
12:	learn: 0.6291737	total: 192ms	remaining: 14.6s
13:	learn: 0.6254658	total: 200ms	remaining: 14.1s
14:	learn: 0.6222923	total: 208ms	remaining: 13.7s
15:	learn: 0.6185763	total: 217ms	remaining: 13.3s
16:	learn: 0.6146587	total: 225ms	remaining: 13s
17:	learn: 0.6116429	total: 237ms	remaining: 12.9s
18:	learn: 0.6079290	total: 241ms	remaining: 12.5s
19:	learn: 0

<catboost.core.CatBoostClassifier at 0x7f8a1d1182d0>

In [29]:
y_cat_pred = classifier.predict(X_val_smote)
test_cat = classifier.predict(test_scaled)

In [30]:
fmod_acc_cat = (f1_score(y_val_smote, y_cat_pred, average='macro'))*100
print("F1 score for the Best Model is:", fmod_acc_cat)

F1 score for the Best Model is: 88.26289153731418


In [31]:
#Using Tensorflow backend for building model with ANN
import tensorflow as tf
from tensorflow import keras
import keras
from keras.models import Sequential 
from keras.layers import Dense
from keras.layers import Dropout
from sklearn.metrics import classification_report

In [32]:
from sklearn.utils import class_weight
from keras.layers import LeakyReLU

In [33]:
model = Sequential()
model.add(Dense(100, activation=LeakyReLU(alpha=0.3)))
model.add(Dense(32,activation='sigmoid'))
model.add(Dense(1,activation='sigmoid'))

In [34]:
model.compile(optimizer='Adam',loss='binary_crossentropy',metrics=['accuracy'])

In [35]:
history = model.fit(X_tr_smote,y_tr_smote,batch_size=32,epochs=50,verbose=1,validation_split=0.2)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


In [36]:
y_eval = model.predict(X_val_smote)
y_eval1 = np.where(y_eval > 0.5, 1, 0)

In [37]:
y_pred = model.predict(test_scaled)
y_pred1 = np.where(y_pred > 0.5, 1, 0)

In [38]:
test['Is_Churn'] = y_pred

In [39]:
test['Is_Churn'] = np.where(y_pred > 0.5, 1, 0)

In [40]:
test[['ID', 'Is_Churn']].to_csv('final_submission_ann_smoteen.csv', index=False)

In [41]:
#con = tf.math.confusion_matrix(labels = y_val_smote, predictions = y_pred1)
fmod_acc_ann = (f1_score(y_val_smote, y_eval1, average='macro'))*100
print("F1 score for the Best Model is:", fmod_acc_ann)

F1 score for the Best Model is: 73.19851670766866


In [42]:
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold

from keras.models import Sequential
from keras.layers import Dense
def build_classifier():
    classifier = Sequential()
    classifier.add(Dense(output_dim = 6, init = "uniform", activation = 'relu', input_dim = 9))
    classifier.add(Dense(output_dim = 6, init = "uniform", activation = "relu"))
    classifier.add(Dense(output_dim = 1, init = "uniform", activation = "sigmoid"))
    classifier.compile(optimizer = keras.optimizers.Adam(lr = 0.001), loss = "binary_crossentropy", metrics = ["accuracy"])
    return classifier
classifier = KerasClassifier(build_fn = build_classifier, batch_size = 10, epochs = 100)
accuracies = cross_val_score(estimator = classifier, X = X_tr_smote, y = y_tr_smote, cv = 10)

display(accuracies.mean())

nan

In [43]:
accuracies.mean()

nan

In [44]:
#Building model with Extra tree Classifier & Light GBM
from lightgbm import LGBMClassifier
from sklearn.ensemble import ExtraTreesClassifier

In [114]:
et_clf = ExtraTreesClassifier(n_estimators = 90, max_features = 1.0, criterion= 'entropy', max_depth=20, n_jobs= 5, random_state= 525)

In [115]:
et_clf.fit(X_tr_smote, y_tr_smote)
y_et_smote = et_clf.predict(X_val_smote)
et_test_smote = et_clf.predict(test_scaled)

In [116]:
fmod_acc_et = (f1_score(y_val_smote, y_et_smote, average='macro'))*100
print("F1 score for the Best Model is:", fmod_acc_et)

F1 score for the Best Model is: 91.72951972407508


In [98]:
test['Is_Churn'] = et_test_smote
test[['ID', 'Is_Churn']].to_csv('final_submission_extra_tree.csv', index=False)

In [60]:
lgbm_Model = LGBMClassifier(learning_rate=0.1, max_depth=30, n_estimators=100, n_jobs=-2, random_state=10, 
                             num_iterations=100 , boosting_type='gbdt', num_leaves= 45)
Classifier = lgbm_Model.fit(X_tr_smote, y_tr_smote)
fmod_pred = lgbm_Model.predict(X_val_smote)
fmod_acc_lgbm = (f1_score(y_val_smote, fmod_pred, average='macro'))*100
print("F1 score for the Best Model is:", fmod_acc_lgbm)

F1 score for the Best Model is: 89.81792521053491


In [50]:
test_lgbm = lgbm_Model.predict(test_scaled)

In [51]:
test['Is_Churn'] = test_lgbm
test[['ID', 'Is_Churn']].to_csv('final_submission_lgbm.csv', index=False)

**Conclusion**: 

Extra Tree Classifier worked for me. It is giving an F1 score of 91.72%.