# Credit Card Customer - EDA & Modelling

* [Import Data and Libraries](#import_data)
    * [Libraries](#Libraries)
    * [Import Data](#import_data)
    * [The Function](#function)
* [Data Visualization](#Data_Visualization)
* [Data Preprocessing](#Data_Preprocessing)
* [Prediction](#Prediction)
* [Conclusion](#Conclusion)
* [Feedback](#Feedback)


<a id="import_data"></a>
# Import data and Libraries

<a id="Libraries"></a>
### Libraries

In [None]:
#Libraries
import pandas as pd
import os
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
np.set_printoptions(precision=4)
from sklearn.metrics import accuracy_score, recall_score, precision_score , confusion_matrix
import warnings
warnings.filterwarnings('ignore')

<a id="import_data"></a>
### Import data

In [None]:
df = pd.read_csv('/kaggle/input/credit-card-customers/BankChurners.csv')

# Drop unnecessary feature
df = df.drop(['Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2','Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1','CLIENTNUM'], axis=1) 

In [None]:
#check missing data
df.isnull().sum()

In [None]:
#all population
df_all = df.copy()

#churned population
df_churned = df[df['Attrition_Flag'] == "Attrited Customer"]

#non churned population
df_nonchurned = df[df['Attrition_Flag'] == "Existing Customer"]

<a id="function"></a>
### The Function

In [None]:
def plot_pie(column, title=""):
    data = df_all[column].value_counts()
    plt.pie(data,autopct='%1.2f%%',labels=data.index)
    plt.title(title)
    plt.show()
    
def plot_hist(column, title=""):
    plt.hist(df_all[column],density=True)
    plt.title(title)
    plt.show()

def plot_bar(column, sort=False, title=""):
    if sort:
        data_all = df_all[column].value_counts().sort_index()
    else:
        data_all = df_all[column].value_counts()
    plt.bar(data_all.index,data_all)
    plt.title(title)
    plt.show()
    
def plot_bar_compare(column, sort=False):
    if sort:
        data_churned = df_churned[column].value_counts().sort_index()
        data_nonchurned = df_nonchurned[column].value_counts().sort_index()
    else:
        data_churned = df_churned[column].value_counts()
        data_nonchurned = df_nonchurned[column].value_counts()
    
    fig,axs = plt.subplots(2,1)
    plt.subplots_adjust(left=0, bottom=0, right=1, top=2, wspace=0, hspace=0.2)
    axs[0].bar(data_nonchurned.index,data_nonchurned)
    axs[0].title.set_text('Existing')
    axs[1].bar(data_churned.index,data_churned)
    axs[1].title.set_text('Attrited')
    plt.show()

def plot_hist_compare(column, bins=5):
    plt.hist([df_nonchurned[column], df_churned[column]] , color=['c','r'])
    plt.legend(('Existing Customer', 'Attrited Customer'))
    plt.show()
    
def plot_pie_compare(column):
    data_churned = df_churned[column].value_counts()
    data_nonchurned = df_nonchurned[column].value_counts()
    
    fig,axs = plt.subplots(2,1)
    plt.subplots_adjust(left=0, bottom=0, right=1, top=2, wspace=0, hspace=0.2)
    axs[0].pie(data_nonchurned,autopct='%1.2f%%',labels=data_nonchurned.index)
    axs[0].title.set_text('Existing')
    axs[1].pie(data_churned,autopct='%1.2f%%',labels=data_churned.index)
    axs[1].title.set_text('Attrited')
    plt.show()

def plot_boxplot(column, title=""):
    sns.boxplot(x="Attrition_Flag", y=column, palette=["c", "r"],
            hue="Attrition_Flag",  data=df_all).set_title(title, fontsize=15)

def check_median(column):
    data_churned = df_churned[column].describe()
    data_nonchurned = df_nonchurned[column].describe()
    print('Existing Customer: {}'.format(data_nonchurned['50%']))
    print('Attrited Customer: {}'.format(data_churned['50%']))

def check_most(column):
    data_churned = df_churned[column].value_counts()
    data_nonchurned = df_nonchurned[column].value_counts()
    print('Existing Customer: {}'.format(data_nonchurned.index[0]))
    print('Attrited Customer: {}'.format(data_churned.index[0]))

<a id="Data_Visualization"></a>
# Data Visualization

### Attrition Flag

In [None]:
plot_pie("Attrition_Flag")

The Majority of The Cusomter are Existing

### Customer_Age

In [None]:
plot_hist('Customer_Age', title="All")
plot_hist_compare('Customer_Age')

In [None]:
plot_boxplot("Customer_Age")

In [None]:
check_median('Customer_Age')

The median of Existing Population is 46 and Attried Population is 47. But there is no clear difference distribution on both population

### Gender 

In [None]:
plot_pie('Gender',"All Population")
plot_pie_compare('Gender')

In [None]:
check_most('Gender')

The difference is too small. the majority of the customers gender on both population are Female.

### Dependent count

In [None]:
plot_bar('Dependent_count', sort=True, title="All")

In [None]:
plot_bar_compare('Dependent_count', sort=True)

In [None]:
plot_boxplot('Dependent_count')

In [None]:
check_median('Dependent_count')

The median of both population are same (2). there is no clear difference distribution on both population 

### Education Level

In [None]:
plot_pie('Education_Level', title="All Population")
plot_pie_compare('Education_Level')

In [None]:
plot_bar_compare('Education_Level')

In [None]:
check_most('Education_Level')

there is no clear difference distribution on both population 

### Marital Status

In [None]:
plot_pie('Marital_Status', title="All Population")
plot_pie_compare('Marital_Status')

In [None]:
plot_bar_compare('Marital_Status')

In [None]:
check_most('Marital_Status')

there is no clear difference distribution on both population 

### Income Category

In [None]:
plot_pie('Income_Category', title="All Population")
plot_pie_compare('Income_Category')

In [None]:
plot_bar_compare('Income_Category')

In [None]:
check_most('Income_Category')

The majority of income category on both population are same. It is "less than $40K". 

But there is no clear difference distribution on both population

### Card Category

In [None]:
plot_pie('Card_Category', title="All Population")
plot_pie_compare('Card_Category')

In [None]:
plot_bar_compare('Card_Category')

In [None]:
check_most('Card_Category')

The majority of the customers on both population have the "Blue" card. 

But there is no clear difference distribution on both population

### Months on book

In [None]:
plot_bar('Months_on_book', title="All Population")
plot_bar_compare('Months_on_book')

In [None]:
plot_boxplot('Months_on_book')

In [None]:
check_median('Months_on_book')

there is no clear difference distribution on both population

### Total Relationship Count

In [None]:
plot_bar('Total_Relationship_Count', title='All Population')
plot_bar_compare('Total_Relationship_Count')

In [None]:
plot_boxplot('Total_Relationship_Count')

In [None]:
check_median('Total_Relationship_Count')

The median of the Attrited Customers are lower than the Existing Customers. it means the Existing Customers have a tendency to buy more products

### Months Inactive 12 mon

In [None]:
plot_bar('Months_Inactive_12_mon', title='All')
plot_bar_compare('Months_Inactive_12_mon')

In [None]:
plot_boxplot('Months_Inactive_12_mon')

In [None]:
check_median('Months_Inactive_12_mon')

the majority of the customers have 2-3 months inactivity. 

But the Existing Customers have more member which have less than 2-3 months inactivity.

### Contacts Count 12 mon

In [None]:
plot_bar('Contacts_Count_12_mon', title='All')
plot_bar_compare('Contacts_Count_12_mon')

In [None]:
plot_boxplot('Contacts_Count_12_mon')

In [None]:
check_median('Contacts_Count_12_mon')

The Attrited Customers have more contact than Existing Customers.

### Credit Limit

In [None]:
plot_hist('Credit_Limit', title='All')
plot_hist_compare('Credit_Limit')

In [None]:
plot_boxplot('Credit_Limit')

In [None]:
check_median('Credit_Limit')

there is no clear difference distribution on both population

### Total Revolving Bal

In [None]:
plot_hist('Total_Revolving_Bal',title='All')
plot_hist_compare('Total_Revolving_Bal')

In [None]:
plot_boxplot('Total_Revolving_Bal')

In [None]:
check_median('Total_Revolving_Bal')

The Attrited Customers have lower value on Total_Revolving_Bal

### Avg_Open_To_Buy

In [None]:
plot_hist('Avg_Open_To_Buy',title='All')
plot_hist_compare('Avg_Open_To_Buy')

In [None]:
plot_boxplot('Avg_Open_To_Buy')

In [None]:
check_median('Avg_Open_To_Buy')

there is no clear difference distribution on both population

### Total_Amt_Chng_Q4_Q1

In [None]:
plot_hist('Total_Amt_Chng_Q4_Q1', title='all')
plot_hist_compare('Total_Amt_Chng_Q4_Q1')

In [None]:
plot_boxplot('Total_Amt_Chng_Q4_Q1')

In [None]:
check_median('Total_Amt_Chng_Q4_Q1')

there is no clear difference distribution on both population

### Total_Trans_Amt

In [None]:
plot_hist('Total_Trans_Amt', title='All')
plot_hist_compare('Total_Trans_Amt')

In [None]:
plot_boxplot('Total_Trans_Amt')

In [None]:
check_median('Total_Trans_Amt')

The Attrited Customers have lower value on Total_Trans_Amt

### Total_Trans_Ct

In [None]:
plot_hist('Total_Trans_Ct', title='All Population')
plot_hist_compare('Total_Trans_Ct')

In [None]:
plot_boxplot('Total_Trans_Ct')

In [None]:
check_median('Total_Trans_Ct')

The Attrited Customers have lower value on Total_Trans_Ct

### Total_Ct_Chng_Q4_Q1

In [None]:
plot_hist('Total_Ct_Chng_Q4_Q1', title='all')
plot_hist_compare('Total_Ct_Chng_Q4_Q1')

In [None]:
plot_boxplot('Total_Ct_Chng_Q4_Q1')

In [None]:
check_median('Total_Ct_Chng_Q4_Q1')

The Attrited Customers have lower value on Total_Ct_Chng_Q4_Q1

### Avg_Utilization_Ratio

In [None]:
plot_hist('Avg_Utilization_Ratio', title='all')
plot_hist_compare('Avg_Utilization_Ratio')

In [None]:
plot_boxplot('Avg_Utilization_Ratio')

In [None]:
check_median('Avg_Utilization_Ratio')

The Attrited Customers have lower value on Total_Ct_Chng_Q4_Q1

### Result

| | Existing Customer | Attrited Customer |
| :- | :-: | :-: |
| Customer_Age (Median) | 46 | 47 |
| Gender (Most) | F | F |
| Dependent_count (Median) | 2 | 2 |
| Education_Level (Most) | Graduate | Graduate |
| Marital_Status (Most) | Married | Married |
| Income_Category (Most) | Less than 40K | Less than 40K |
| Card_Category (Most) | Blue | Blue |
| Months_on_book (Median) | 36 | 36 |
| Total_Relationship_Count (Median) | 4 | 3 |
| Months_Inactive_12_mon (Median) | 2 | 3 |
| Contacts_Count_12_mon (Median) | 2 | 3 |
| Credit_Limit (Median) | 4643.5 | 4178 |
| Total_Revolving_Bal (Median) | 1364 | 0 |
| Avg_Open_To_Buy (Median) | 3469.5 | 3488.0 |
| Total_Amt_Chng_Q4_Q1 (Median) | 0.743 | 0.701 |
| Total_Trans_Amt (Median) | 4100 | 2329 |
| Total_Trans_Ct (Median) | 71 | 43 |
| Total_Ct_Chng_Q4_Q1 (Median) | 0.721 | 0.531 |
| Avg_Utilization_Ratio (Median) | 0.211 | 0 |

<a id="Data_Preprocessing"></a>
# Data Preprocessing

In [None]:
X = df.copy()

y = X['Attrition_Flag']

#Drop the Attrition_Flag Column
X = X.drop(['Attrition_Flag'], axis=1)

In [None]:
# transform categorical data
X = pd.get_dummies(X, columns=[ 'Gender', 'Education_Level', 'Marital_Status', 'Income_Category', 'Card_Category'])

In [None]:
X.isnull().sum()

In [None]:
X = X.drop(columns=[ 'Gender_F', 'Education_Level_College', 
                                 'Marital_Status_Divorced', 'Income_Category_Unknown', 'Card_Category_Blue'])

In [None]:
X.columns

In [None]:
#Split to data train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1234)

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()

X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [None]:
X_test

<a id="Prediction"></a>
# Prediction

In [None]:
#Model Evaluation

from xgboost import XGBClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.inspection import permutation_importance

classifiers = [[CatBoostClassifier(verbose=0),'CatBoost Classifier'],[XGBClassifier(),'XGB Classifier'], [RandomForestClassifier(),'Random Forest'], 
    [KNeighborsClassifier(), 'K-Nearest Neighbours'], [SGDClassifier(),'SGD Classifier'], [SVC(),'SVC'],[LGBMClassifier(),'LGBM Classifier'],
              [GaussianNB(),'GaussianNB'],[DecisionTreeClassifier(),'Decision Tree Classifier']]

In [None]:
for cls in classifiers:
    model = cls[0]
    model.fit(X_train, y_train)
    
    y_pred = model.predict(X_test)
    print(cls[1])
    print ('Confusion Matrix:')
    print(confusion_matrix(y_test, y_pred))
    print("Accuracy : ", accuracy_score(y_test, y_pred) *  100)
    print("Recall : ", recall_score(y_test, y_pred, pos_label = 'Attrited Customer') *  100)
    print("Precision : ", precision_score(y_test, y_pred, pos_label = 'Attrited Customer') *  100)

In [None]:
# permutation importance

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1234)

model = CatBoostClassifier(verbose=0)
model.fit(X_train, y_train)

feature = permutation_importance(model, X_test, y_test, n_repeats=10,
                                random_state=1234, n_jobs=2)

idx_sort = feature.importances_mean.argsort()

fig, ax = plt.subplots(figsize=(15,15))
ax.boxplot(feature.importances[idx_sort].T,
           vert=False, labels=X_test.columns[idx_sort])
ax.set_title("Permutation Importances (test set)")
fig.tight_layout()
plt.show()

<a id="Conclusion"></a>
# Conclusion

The best classifier is 'Catboost' with 97.6% Accuracy and 89% Recall. 

The top 5 most influential features : "Total_Trans_Ct" , "Total_Trans_Amt", "Total_Amt_Chng_Q4_Q1", "Total_Relationship_Count", "Total_Revolving_Bal"

<a id="Feedback"></a>
# Feedback

Hi guys, I'm a newbie here. Need your feedback about this kernel. Thank you