# What will you find here?

* Exploring & Visualising the data
* Transform the data for building better models
* Comparing the results and easily to choose the best model
* Trying Dimension Reduction, Scaling
* Mix of PCA and original features
* Explanation of the correlations and other plots

This notebook is an extended version of the original posted in kaggle. You can find it [here](https://www.kaggle.com/alpertml/credit-card-customers-eda-ml-97-5-accuracy).

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# Importing neccesary packages

import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator
import seaborn as sns
plt.style.use('ggplot') # default plot style.

from scipy import stats
from scipy.stats import norm


from sklearn.preprocessing import LabelEncoder 
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer 

import pickle
import math
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Adjusting the plotting style
sns.set(color_codes=True)

# What do we have?, let's see

So.. we have only one dataset which in total 10127 observations 21 features. Our target column is **Attrition_Flag(binary)** and we will try to predict it. Dataset is not include any missing values(NaN/Null/NotANumber), it's a good new. We see that the dataset consists mostly of numerical data.

In [None]:
full_df = pd.read_csv('/kaggle/input/credit-card-customers/BankChurners.csv')

display(full_df.shape)
# display 5 sample randomly
full_df.sample(5)

In [None]:
# We don't need the unique ids'
full_df.drop(columns=['CLIENTNUM',
                      'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1',
                      'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2'], axis=1, inplace=True)

display(full_df.shape)

The columns were dropped succesfully!

In [None]:
full_df[full_df.duplicated()]

There's no duplicated data.

In [None]:
# Checking dtypes
display(full_df.info())
# Checking numeric values stats
display(full_df.describe())

We can see that there are no missing data or NaN values, and the types of the data consist on objects and mostly numeric values.

Attrition Flag is the target variable, we should encode it to a binary form to correct train a classification model.

# Exploring the Data

## Categorical Features

* **Attrition_Flag** (1: Existing Customer, 0: Attrited Customer): The Customer leave or not
* **Gender** (1: Male, 0: Female)
* **Education_Level** (Graduate , High School, Unknown, Uneducated, College, Post-Graduate, Doctorate)
* **Marital_Status** (Married, Single, Unknown, Divorced)
* **Income_Category** (Less than 40K, 40K - 60K, 80K - 120K, 60K - 80K, Unknown, 120K +) in dollar
* **Card_Category** (Blue, Silver, Gold, Platinum)

In [None]:
cats = ['Attrition_Flag', 'Gender', 'Education_Level', 'Marital_Status', 'Income_Category', 'Card_Category']

def pltCountplot(cats):
    
    fig, axis = plt.subplots(len(cats) // 3,3, figsize=(20,12))  

    index = 0
    for i in range(len(cats) // 3):
        for j in range(3):
            
            ax = sns.countplot(cats[index], data=full_df, ax=axis[i][j])
            
            if cats[index] in ['Education_Level', 'Income_Category']:
                for item in ax.get_xticklabels():
                    item.set_rotation(15)
                
            for p in ax.patches:
                height = p.get_height()
                ax.text(p.get_x()+p.get_width()/2.,
                        height + 3,
                        '{:1.2f}%'.format(height/len(full_df)*100),
                        ha="center") 
            index += 1

In [None]:
pltCountplot(cats)

### Observations

* We can see that the dataset is not equally distribute according to Attrition_Flag. We have samples which are mostly Existing.
* We can say that if education level is improved, using the credit card is decresing.
* Generally people use blue card, it's must be correlated with income.

In [None]:
def pltCountplotHueTarget(cats, target):
    
    fig, axis = plt.subplots(len(cats) // 3,3, figsize=(20,12))  

    index = 0
    for i in range(len(cats) // 3):
        for j in range(3):
            
            ax = sns.countplot(cats[index], data=full_df, hue=target, ax=axis[i][j])
            
            ax.legend(title='Customer exit?',
                      loc='upper right',
                      labels=['Yes', 'No'])
            
            if cats[index] in ['Education_Level', 'Income_Category']:
                for item in ax.get_xticklabels():
                    item.set_rotation(15)
                
            for p in ax.patches:
                height = p.get_height()
                ax.text(p.get_x()+p.get_width()/2.,
                        height + 3,
                        '{:1.2f}%'.format(height/len(full_df)*100),
                        ha="center") 
            index += 1

In [None]:
pltCountplotHueTarget(cats, 'Attrition_Flag')

The higher the income, the less likely is the person to use a credit card, and if they use it, it is more likely to keep using it.

People with the lowest income apparently uses more the blue card. The causes of this massive churn could be a change in the finantial situation of the country, or the creation of a more suitable way of payment.

## Numerical Features

* **Customer_Age**: Customer's Age in Years
* **Dependent_count:** Number of dependents
* **Months_on_book:** Period of relationship with bank
* **Total_Relationship_Count:** Total no. of products held by the customer
* **Months_Inactive_12_mon:** No. of months inactive in the last 12 months
* **Contacts_Count_12_mon:** No. of Contacts in the last 12 months
* **Credit_Limit:** Credit Limit on the Credit Card
* **Total_Revolving_Bal:** Total Revolving Balance on the Credit Card
* **Avg_Open_To_Buy:** Open to Buy Credit Line (Average of last 12 months)
* **Total_Amt_Chng_Q4_Q1:** Change in Transaction Amount (Q4 over Q1)
* **Total_Trans_Amt:** Total Transaction Amount (Last 12 months)
* **Total_Trans_Ct:** Total Transaction Count (Last 12 months)
* **Total_Ct_Chng_Q4_Q1:** Change in Transaction Count (Q4 over Q1)
* **Avg_Utilization_Ratio:** Average Card Utilization Ratio

In [None]:
numeric_columns = ['Customer_Age','Credit_Limit','Months_on_book','Avg_Utilization_Ratio','Avg_Open_To_Buy','Total_Trans_Amt','Dependent_count',
                  'Total_Relationship_Count','Months_Inactive_12_mon','Contacts_Count_12_mon','Total_Revolving_Bal',
                  'Total_Amt_Chng_Q4_Q1','Total_Trans_Ct','Total_Ct_Chng_Q4_Q1']

some_columns = ['Customer_Age','Credit_Limit','Months_on_book','Avg_Utilization_Ratio','Avg_Open_To_Buy','Total_Trans_Amt']


def plotDistPlot(columns):
    fig, ax = plt.subplots(len(columns)//3, 3,figsize=(20, 12))
    
    index = 0
    for i in range(2):
        for j in range(3):
            sns.distplot(full_df.loc[:, columns[index]],
                         hist=True,
                         fit=norm,
                         kde=True,
                         ax=ax[i][j])
            ax[i][j].set_title(columns[index])
            ax[i][j].legend(labels=['Normal', 'Actual'])
            index += 1

In [None]:
plotDistPlot(some_columns)

In [None]:
corr_data = full_df.loc[:, numeric_columns].corr()

plt.figure(figsize=(20,12))
sns.heatmap(corr_data.abs(), annot=True, fmt='.3f',cmap='coolwarm',square=True)
plt.show()

### NOTICE:  
Generally features have not strong correlation with each other. This is not mean they are not correleted. In corr matrix, we can see linearly corelated features. Maybe our features are correlated quadratic or n-degree polynomial. We can't see if features are correlated n-degree polynomial in the corr matrix.

## Missing Values

In the beginning of the notebook, i indicated that the dataset has not include missing values (If you check dataset page in kaggle, you see it). But we should check the dataset again, We should ensure. Heatmap is all dark. It's mean there is no missing data.

In [None]:
# detecting the missing data

fig, ax = plt.subplots(figsize=(20, 6))

ax.set_title('Train Data Missing Values')
plt.xticks(rotation=90)

sns.heatmap(full_df.iloc[:,:-2].isnull(),
            yticklabels=False,
            cbar=False,
            cmap='magma',
            ax=ax)

plt.show()

## Time to Feature Engineering!!

We will play with the data.

## Object, Category to Numeric, Encode

ML algorithms works on numeric values. That's why we should transform Object, Category, etc. values to numeric values.

### Binary Flags

In [None]:
updated_df = pd.DataFrame()

def tobinary():
    
    # full_df['Attrition_Flag'] = full_df.Attrition_Flag // same thing
    updated_df['Attrition'] = full_df.Attrition_Flag.map({'Existing Customer':1, 'Attrited Customer':0})
    
    updated_df['Gender'] = full_df.Gender.map({'M':1, 'F':0})

### String to integer

In [None]:
def stringtoint():
        #organized in such way that follows the probability trend of the plots

    income_data = full_df['Income_Category'].replace({ 'Less than $40K':0, '$40K - $60K':1, '$60K - $80K':2,
                                                      '$80K - $120K':3,'Unknown': 4 , '$120K +':5})
    education_data = full_df['Education_Level'].replace({'Uneducated': 0, 'High School':1, 'Graduate':2, 'Unknown':3,
                                                         'College':4,'Post-Graduate':5,'Doctorate':6})
    
    updated_df['Income_Category'] = income_data
    updated_df['Education_Level'] = education_data

### Dummies

In [None]:
def encode():
    global updated_df
    card_dummies = pd.get_dummies(full_df['Card_Category'], prefix='Card')
    marital_dummies = pd.get_dummies(full_df['Marital_Status'], prefix='Marital')
    updated_df = pd.concat([updated_df, marital_dummies, card_dummies], axis=1)

In [None]:
def concat_with_numerics():
    global updated_df
    updated_df = pd.concat([updated_df, full_df.loc[:, numeric_columns]], axis=1)

Let's excecute all the previus functions.

In [None]:
tobinary()
stringtoint()
encode()
concat_with_numerics()

In [None]:
print('Data shapes """including target value"""')
print(f'Old shape : {full_df.shape}')
print(f'Updated shape : {updated_df.shape}')

## Look updated data

We're going to make sure the data is ready for modeling. Let's see the updated data with big picture.

In [None]:
updated_df.sample(5)

In [None]:
updated_df.describe()

In [None]:
updated_df.info()

### Saving the Dataframe for future works

In [None]:
updated_df.to_csv('./BankChurners_all_numeric.csv')

# Modelling

In [None]:
# Importing packages for modelling.

import xgboost as xgb
import lightgbm as lgb

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import StratifiedKFold, cross_validate, learning_curve
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

In [None]:
def estimates(X_data, y_data, models, cv):
    
    train_acc_dict = dict()
    test_acc_dict = dict()
    time_dict = dict()
    
    for model in models:
        
        current_model_name = model.__class__.__name__
        
        cv_results = cross_validate(model, X_data, y_data, cv=cv,
                                    return_train_score=True, scoring='accuracy')
        
        train_acc_dict[current_model_name] = cv_results['train_score'].mean()
        test_acc_dict[current_model_name] = cv_results['test_score'].mean()
        time_dict[current_model_name] = cv_results['fit_time'].mean()
        
    return train_acc_dict, test_acc_dict, time_dict

In [None]:
#m_logreg = LogisticRegression()

m_gbc = GradientBoostingClassifier(random_state=14)

#m_rfc = RandomForestClassifier(criterion='gini', n_estimators=999,
                            #max_depth=4, random_state=14)

m_lgb = lgb.LGBMClassifier(num_iterations=550, learning_rate=0.01055,
                        max_depth=3, random_state=14)

m_xgb = xgb.XGBClassifier(n_estimators=2250,
                       max_depth=2, random_state=14)

#m_gnb = GaussianNB()

#m_mlpc = MLPClassifier(random_state=14)

#m_svc = SVC(probability=True)

In [None]:
cv = StratifiedKFold(11, shuffle=True, random_state=14)
#will use only the 3 best models of the example
models = [m_gbc, m_lgb, m_xgb]

X = updated_df.drop('Attrition', axis=1)
y = updated_df['Attrition']

print(X.shape)
print(y.shape)

train_acc_dict, test_acc_dict, time_dict = estimates(X, y, models, cv)

# Model results

In [None]:
# Training accuracy
for key, value in train_acc_dict.items():
    print('{} - {:.1f}%'.format(key, value*100))

In [None]:
# Test accuracy
for key, value in test_acc_dict.items():
    print('{} - {:.1f}%'.format(key, value*100))

In [None]:
# Fitting time
for key, value in time_dict.items():
    print('{} - {:.1f} seconds'.format(key, value))

### **NOTICE:** Fitting time can be changed according to your process unit. TPU & GPU faster than CPU. So, Fitting time can be different.

# Feature Importance

In [None]:
def plot_importance_features(models, X, y):
    
    fig, axes = plt.subplots(3, len(models) // 2, figsize=(23, 12))

    for ax, model in zip(axes.flatten(), models):
        try:
            model.fit(X, y)
            importance_features = pd.DataFrame(sorted(
                zip(model.feature_importances_, X.columns)),
                                       columns=['Value', 'Feature'])

            importance_features = importance_features.sort_values('Value', ascending=False)
            sns.barplot(y="Feature", x="Value", ax=ax,
                        data=importance_features)
            current_model_name = model.__class__.__name__
            ax.set(title=f'{current_model_name} Feature Importances')
            ax.xaxis.set_major_locator(MaxNLocator(nbins=11))
        except:
            pass

In [None]:
# some estimators don't have feature_importance that's why i choosed the estimators which are include feature_importance
plot_importance_features(models[0:3], X, y)

We can see that each model uses a different set of variables for the prediction.

**Total revolving balance of the credit card** and **total transaction count** seems to be 2 powerfull predictors. Let's check. 

In [None]:
corr_data = updated_df.corr()

plt.figure(figsize=(30,22))
sns.heatmap(corr_data, annot=True, fmt='.3f',cmap='coolwarm',square=True)
plt.show()

The updated dataframe with the categorical variables in a numerical format, and the churn included, shows a potential correlation with some variables with the Attrition, despite the fact that correlations plot only look for linear correlations.

In this plot we have removed the absolute value calculation in order to understand if the correlation was positive, or negative. Now we can see that the more the people used this credit card, it is more likely to **quit** using it.

In the same manner we can see that the more contacts the people made in the last 12 months, the less likely is to quit.

### Dimensional Reduction & Fit models again

Maybe some features are decreasing our models' accuracy. We try to reduce dimension then check accuracy again. Also, we try to improve models' accuracy using StandartScaler

In [None]:
# creates pipeline
my_pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('reducer', PCA(n_components=2)),
])

X_red = my_pipe.fit_transform(X)

**Fit again**

In [None]:
train_acc_dict_red, test_acc_dict_red, time_dict_red = estimates(X_red, y, models, cv)

### Print & Plot the model's accuracy again.

In [None]:
# Training accuracy
for key, value in train_acc_dict_red.items():
    print('{} - {:.1f}%'.format(key, value*100))

In [None]:
# Test accuracy
for key, value in test_acc_dict_red.items():
    print('{} - {:.1f}%'.format(key, value*100))

In [None]:
# Fitting time
for key, value in time_dict.items():
    print('{} - {:.1f} seconds'.format(key, value))

### Adding all up
Mix of PCA with previous features.

In [None]:
X_full = pd.concat([pd.DataFrame(X_red),X], axis=1)

In [None]:
X_full.head(3)

In [None]:
train_acc_dict_full, test_acc_dict_full, time_dict_full = estimates(X_full, y, models, cv)

In [None]:
# Training accuracy
for key, value in train_acc_dict_full.items():
    print('{} - {:.1f}%'.format(key, value*100))

In [None]:
# Test accuracy
for key, value in test_acc_dict_full.items():
    print('{} - {:.1f}%'.format(key, value*100))

In [None]:
# Fitting time
for key, value in time_dict.items():
    print('{} - {:.1f} seconds'.format(key, value))

We can see that combining the features generated with PCA with the others its what gives the better results, this can be due to higher degrees of freedom for the model.

This implementation was based on a previus Kaggle Implementation of the member **Alper Temel**, you can find [here](https://www.kaggle.com/alpertml/credit-card-customers-eda-ml-97-5-accuracy) the original implementarion.