# **Who is going to leave the bank?**

<img src="https://media.giphy.com/media/3ov9jWgOYIJ9k5Elyw/giphy.gif">

**In this notebook, I tried to analyze the dataset with the help of seaborn to reduce the amount of work my models have to do. Then applied Machine Learning and Deep Learning methods to it. Let's see how it shaped out.**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df = pd.read_csv('/kaggle/input/credit-card-customers/BankChurners.csv')

In [None]:
df.head()

In [None]:
df = df.drop(['Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2',
             'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1'], axis=1)

In [None]:
print(df.shape)
df.head()

Before we start with anything, I want to see if I can drop any of the columns that are unnecessary. I will look at how some columns effected the important ones. For me some of the important columns are: *Income, Credit Limit, Amount of change between Q1 and Q4 and Total transaction amount etc.*

In [None]:
sns.set(rc={'figure.figsize':(15,7)})
sns.barplot(x="Education_Level", y="Total_Trans_Amt", hue='Attrition_Flag', data=df)

Here we see that *Education Level* did not effected *Credit Limit* and *Attrition.*

In [None]:
sns.barplot(x="Card_Category", y="Total_Trans_Amt", hue='Attrition_Flag', data=df)

In [None]:
sns.barplot(x="Attrition_Flag", y="Dependent_count", data=df)

There is an slight difference within the groups but since we have enough variables, *dependent count* is droppable.

In [None]:
sns.barplot(x="Months_on_book", y="Total_Trans_Amt", hue='Attrition_Flag', data=df)

We can say that newly members spend more money. But membership length does not effect the churn. I mean it is inconsistent within the length of the membership.

In [None]:
sns.barplot(x="Marital_Status", y="Total_Trans_Amt", hue='Attrition_Flag', data=df)

In [None]:
sns.set(rc={'figure.figsize':(25,6)})
sns.barplot(x="Customer_Age", y="Total_Trans_Amt", hue='Attrition_Flag', data=df, palette=["C0", "C1", "k"])

So far we are dropping: *Client number, Dependent count, Education level, Marital status and Month on book*. Because they did not effected the attrition. In addition, obviously Client number is not important for us, either.

Let see how our data will look after we drop these columns.

In [None]:
df = df.drop(['CLIENTNUM', 'Dependent_count', 'Education_Level', 'Marital_Status', 'Months_on_book'], axis=1)
df.head()

I divided my analysis because there are many columns on the data. Now, we can continue.

In [None]:
sns.set(rc={'figure.figsize':(5,5)})
sns.barplot(x="Attrition_Flag", y="Months_Inactive_12_mon", hue='Gender', data=df)

In [None]:
sns.barplot(x="Attrition_Flag", y="Total_Relationship_Count", data=df)

In [None]:
sns.barplot(x="Attrition_Flag", y="Contacts_Count_12_mon", data=df)

In [None]:
sns.barplot(x="Attrition_Flag", y="Total_Revolving_Bal", data=df)

In [None]:
sns.barplot(x="Attrition_Flag", y="Credit_Limit", data=df)

In [None]:
sns.barplot(x="Attrition_Flag", y="Avg_Open_To_Buy", data=df)

In [None]:
sns.barplot(x="Attrition_Flag", y="Total_Amt_Chng_Q4_Q1", data=df)

In [None]:
sns.barplot(x="Attrition_Flag", y="Total_Trans_Amt", hue='Income_Category', data=df)

In [None]:
sns.barplot(x="Attrition_Flag", y="Total_Trans_Ct", data=df)

In [None]:
sns.barplot(x="Attrition_Flag", y="Avg_Utilization_Ratio", data=df)

*Number of inactive months, total relationship count, contact count, revolving balance, amount of change between quarters, transaction amount&count and average utilization ratio* are valuable information for this problem. The rest, we can drop.

In [None]:
df = df.drop(['Gender', 'Income_Category', 'Credit_Limit', 'Avg_Open_To_Buy', 'Total_Ct_Chng_Q4_Q1'], axis=1)

In [None]:
df.head()

Every variable except *Attrition Flag* and *Card Category* are numeric. We need to turn non-numeric categories to numeric categories before the process.

In [None]:
customer = pd.get_dummies(df['Attrition_Flag'], drop_first=True)
card = pd.get_dummies(df['Card_Category'], drop_first=False)

In [None]:
df = pd.concat([df, customer, card], axis=1)
df = df.drop(['Attrition_Flag', 'Card_Category'], axis=1)

In [None]:
df.head()

**Now, we are ready.**

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

In [None]:
x = df.drop(['Existing Customer'], axis=1)
y = df['Existing Customer']

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=42)

**LOGISTIC REGRESSION**

In [None]:
from sklearn.linear_model import LogisticRegression
logistic = LogisticRegression()
logistic.fit(x_train, y_train)
prediction_lr = logistic.predict(x_test)
print(classification_report(y_test,prediction_lr))

**DECISION TREE CLASSIFIER**

In [None]:
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier()
tree.fit(x_train, y_train)
prediction_dt = tree.predict(x_test)
print(classification_report(y_test, prediction_dt))

**RANDOM FOREST CLASSIFIER**

In [None]:
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier()
forest.fit(x_train, y_train)
prediction_rf = forest.predict(x_test)
print(classification_report(y_test, prediction_rf))

**XGBOOST**

In [None]:
import xgboost
xgb = xgboost.XGBClassifier()
xgb.fit(x_train,y_train)
prediction_xgb = xgb.predict(x_test)
print(classification_report(y_test, prediction_xgb))

We had very succesful results with Decision Tree, Random Forest and XGBOOST, up to **%96 accuracy**. I want to try Neural Networks just for the fun of it.

**DNN**

In [None]:
df.shape

In [None]:
import keras
from keras.layers import Dense
from keras.models import Sequential
from keras.layers import Dropout

In [None]:
model = Sequential([
    Dense(32, activation='relu', input_dim=13),
    Dropout(0.5),
    Dense(32, activation='relu'),
    Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam', loss='mean_squared_error', metrics=['accuracy'])
model.summary()

In [None]:
model.fit(x_train, y_train, batch_size=32, epochs=50,verbose=0)

In [None]:
prediction_nn = model.predict(x_test)
prediction_nn = [1 if y>=0.5 else 0 for y in prediction_nn]
print(classification_report(y_test, prediction_nn))

I have used ***regression, random forest, decision tree and neural network*** approaches with this dataset. I was able to achieve ***%96 accuracy*** with xgboost which is pretty good. Overall, every model performed good but some of them were more suitable for this problem.

**We have come to an end to our notebook. Thank you for sticking with me this far! I hope it was a good experience for you.**

<img src="https://media.giphy.com/media/xUPOqo6E1XvWXwlCyQ/giphy.gif">