# Predict whether a customer will churn or not :
Customer churning or attrition in banking is when the customer no longer using the bank's credit cards. In this notebook the tasks covered are:

1. Data Visualisation of features
2. Data Cleaning and Preprocessing
3. Applying models(KNN,Random Forest,SVC,RandomForest etc) and choosing the model with best model score


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

First,let's look at the top 5 rows of the data.

In [None]:
data=pd.read_csv('/kaggle/input/credit-card-customers/BankChurners.csv')
data.head()

We notice we do not require the column that has client id (CLIENTNUM)in our analysis and prediction so let's drop it.The last two columns are also unnecessary so dropping it.

In [None]:
data=data.drop(['CLIENTNUM'],axis=1)
data = data.drop(['Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1'], axis=1)
data = data.drop(['Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2'], axis=1)
data.head()

# Data Visualisation
 
Since our target column is 'Attrition_Flag',we will firstly see the percentage of Existing and attrited Customer.

In [None]:
import matplotlib.pyplot as plt
count=pd.value_counts(data['Attrition_Flag']).tolist()
plt.figure(figsize=(11,11))
plt.title("Percentage of Attrited Customer and Existing Customer")
plt.pie(x=count,labels=["Attrited Customer","Existing Customers"],autopct='%.2f%%')


It turns out that the **ratio is highly imbalanced**.Attrited customers account to much higher percentage of the total data. 

Let's check with respect to the **genders- male and female** .

In [None]:
fig,(ax1,ax2)=plt.subplots(1,2,figsize=(15,15))

attrited_gender = data.loc[data["Attrition_Flag"] == "Attrited Customer", ["Gender"]].value_counts().tolist()
ax1.pie(x=attrited_gender,labels=["Male","Female"],autopct='%.2f%%')
ax1.set_title('Gender vs Attrited Customer')

existing_gender=data.loc[data["Attrition_Flag"] == "Existing Customer", ["Gender"]].value_counts().tolist()
ax2.pie(x=existing_gender,labels=["Male","Female"],autopct='%.2f%%')
ax2.set_title('Gender vs Existing Customer')

In both the cases,the male to female ratio is **almost same and comparable.** 

In [None]:
import seaborn as sns
plt.figure(figsize=(28,11))
plt.title("Distribution of Age with respect to Churned or not")
sns.countplot(data=data,x=data["Customer_Age"],hue="Attrition_Flag")

The distribution of age is clearly a **Gaussian distribution** meaning that most of the data are clustered around the mean value. 


In [None]:
fig, axes = plt.subplots(2, 2, figsize=(15, 15), sharey=True)
fig.suptitle('Group by Attrition Flag')
sns.countplot(x="Gender", hue = "Attrition_Flag",  data=data, ax=axes[0,0], palette="Set2")
axes[0,0].set_title("GENDER & ATTRITION FLAG")

sns.countplot(x="Income_Category", hue = "Attrition_Flag",  data=data, ax=axes[0,1], palette="Set2",order=data["Income_Category"].value_counts().index)
axes[0,1].set_title("INCOME CATEGORY & ATTRITION FLAG")

sns.countplot(x="Education_Level", hue = "Attrition_Flag",  data=data, ax=axes[1,0], palette="Set1",order=data["Education_Level"].value_counts().index)
axes[1,0].set_title("EDUCATION LEVEL & ATTRITION FLAG")

sns.countplot(x="Card_Category", hue = "Attrition_Flag",  data=data, ax=axes[1,1], palette="Set1",order=data["Card_Category"].value_counts().index)
axes[1,1].set_title("CARD CATEGORY & ATTRITION FLAG")

# Correlation using heatmap 

In a visually appealing way,correlation heatmaps show which variables are correlated, to what degree and in which direction. 

In [None]:
fig,ax=plt.subplots(figsize=(10,10))
sns.heatmap(data.corr(),annot=True)

From the heatmap,it can be seen that the columns **'Avg_Open_To_Buy' and 'Credit_Limit','Total_Revolving_Bal' and 'Avg_Utilization_Ratio', 'Months_on_book' and 'Customer_Age**'are highly correlated**(value of 1)**.

We check if there are any null values in the dataframe.

In [None]:
data.isnull().any()

# Checking for Outliers and removing them

Using a **boxplot** it is easy to visualize whether there are any outliers in the data or not.The outliers for the column"Total_Ct_Chng_Q4_Q1" can seen for example. 


In [None]:
import seaborn as sns
sns.boxplot(x=data['Total_Ct_Chng_Q4_Q1'])

It can be seen from the plot that the values above 3 are much farther than the mean and we consider them as outliers. 

# Removing Outlier

Using the **concept of z-score** it is possible to check for outliers in the columns. We set a particular threshold(3 in this case) and then remove those values which have a z-score less than 3.

In [None]:
#outlier cleanup
from scipy import stats
import numpy as np
columns = ["Customer_Age", 'Dependent_count', 'Months_on_book',
       'Total_Relationship_Count', 'Months_Inactive_12_mon',
       'Contacts_Count_12_mon', 'Credit_Limit', 'Total_Revolving_Bal',
       'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt',
       'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio']
print(data.shape)
for column in columns : 
    z = np.abs(stats.zscore(data[column]))
    data=data[(z < 3)]
print(data.shape)
    
    

After removing outliers the size of the data frame is **(9317,20)** ie reduced from (10127,20) of the original data.

In [None]:
#boxplot after removing outlier
import seaborn as sns
sns.boxplot(x=data['Total_Ct_Chng_Q4_Q1'],palette=[sns.xkcd_rgb["pale red"]])


Since the target column for prediction is "Attrition_Flag" we can drop that from the dataframe and seperate it. 

In [None]:
X=data.drop("Attrition_Flag",axis=1)
Y=data["Attrition_Flag"]
Y.shape

# Data Preprocessing 

Since we require **numerical values** for the predictive model ,the categorical columns need to be transformed. Hence **label encoding** is done.

In [None]:
from sklearn.preprocessing import LabelEncoder
categorical_col =['Gender','Education_Level', 'Marital_Status', 'Income_Category', 'Card_Category']
for cols in categorical_col :
    le=LabelEncoder()
    X[cols]=le.fit_transform(X[cols])
Y=le.fit_transform(Y)
data.info()

# Prediction Part 

We split the data into train and test data.The predictive models are then applied and checked which one has the best model score. Consequently,we choose the model with the best score. The **model score**,the **cross validation score** and the **ROC-AUC** scores are calculated.

In [None]:
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2,random_state=42)

In [None]:
from sklearn.model_selection import KFold, GridSearchCV
from sklearn import metrics
from sklearn.metrics import confusion_matrix, make_scorer, recall_score
from sklearn.metrics import roc_auc_score
from xgboost.sklearn import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import LinearSVC, SVC

classifiers = [[XGBClassifier(),'XGB Classifier'], [RandomForestClassifier(),'Random Forest'], 
    [KNeighborsClassifier(), 'K-Nearest Neighbours'], [SGDClassifier(),'SGD Classifier'], [SVC(),'SVC']]
score_list=[]
roc_auc_list=[]
cross_val_list=[]

for classifier in classifiers :
    model=classifier[0]
    model.fit(X_train,Y_train)
    model_name=classifier[1]
    prediction=model.predict(X_test)
    
    scores=model.score(X_test,Y_test)
    cross_val=cross_val_score(model,X_test,Y_test).mean()
    roc_auc = roc_auc_score(Y_test, prediction)
    
    score_list.append(scores)
    cross_val_list.append(cross_val)
    roc_auc_list.append(roc_auc)
    
    print(model_name,"Score :"+str(round(scores*100,2))+'%')
    print(model_name,"Cross Validation Score :"+str(round(cross_val*100,2))+'%')
    print(model_name,"ROC AUC score:"+str(round(roc_auc*100,2))+'%')

# Best Score :

From the above output,it can be seen that **XGBoost** performs the best with a model score of **97.48%**. The **Random Forest classifier** also performs comparably well with a score of **96.41**. 

So we use XGBoost and plot the confusion matrix for this model's prediction.

In [None]:
from sklearn.metrics import plot_confusion_matrix

model = XGBClassifier()
model.fit(X_train, Y_train)
pred = model.predict(X_test)
plot_confusion_matrix(model, X_test, Y_test)

I hope this notebook was useful for Kaggle learners. I appreciate every feedback and if you think my code can be improved better please suggest me so in the comment. 