# Introduction
<p>Welcome! In this notebook i'm going to analyze credit card's customers data and implement a Machine Learning Classfier to predict the attrition probabilty of customers</p>
<h3>My main objectives on this project are:</h3>   
<ul>
    <li>Applying exploratory data analysis and trying to get some insights about our dataset</li>
    <li>Getting data in better shape by transforming and feature engineering to help us in building better models</li>
    <li>Building and tuning different ML algorithms to get some results on predicting Attrition</li>
</ul>

<h2>Importing Libraries</h2>
<p>Lets start by importing some packages we are going to need</p>

In [None]:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
from matplotlib.ticker import MaxNLocator
import seaborn as sns
from sklearn.preprocessing import LabelBinarizer

# Meeting the data
<p>Lets open the data and see what we have</p>

In [None]:
#Opening the data
data = pd.read_csv('../input/credit-card-customers/BankChurners.csv')

In [None]:
#Lets see the shapes of the data so we know what we are dealing with
data.shape

In [None]:
#lets observe some of his elements
data.head(10)

In [None]:
#Lets delete the last two columns as they are irrelevant
data.drop(columns=["Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1",
                  "Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2"],
         inplace=True)


In [None]:
# Dividing the label and features columns in X, y and then eliminating ids as they are irrelevants for the analysis and modeling
X = data.copy()
X.drop(columns=['CLIENTNUM', 'Attrition_Flag'], inplace=True)
y = data['Attrition_Flag']

In [None]:
# Using a label binarizer to convert y label into 1's and 0's
labelBinarizer = LabelBinarizer()
y = labelBinarizer.fit_transform(y)
y = np.reshape(y, -1)
y = pd.Series(y)

# EDA
<p>Exploratory Data Analysis</p>

<p>Lets create a heatmap graphic here. With this graphic we can see the correlation between different features</p>

In [None]:
#For this purpose, i'll concatenate y and X
analysisData = X.copy()
analysisData['Attrition_Flag'] = y
correlation = analysisData.corr()

f, ax = plt.subplots(figsize=(14,12))
plt.title('Correlation of numerical attributes', size=16)
sns.heatmap(correlation)
plt.show()

<h4>Observations</h4>
<li>Let's focus on the lighter parts of the graph</li>
<ol>
    <li>Customer age and Months on book have a high correlation because these customers just got the possibility of getting a credit card</li>
    <li>Avg_Open_To_Buy and Credit_Limit have a high correlation because they are telling the "same thing"</li>
    <li>Total Transaction Amount is high correlated with Total Transacion Count because usually the amount tends to get higher as the count of transactions grow</li>
</ol>

In [None]:
# To be clearer, let's take a look at a simple plot Total_Trans_Amt in function of Total_Trans_Ct

amount = analysisData['Total_Trans_Amt']
count = analysisData['Total_Trans_Ct']

fig, ax = plt.subplots()
ax.plot(count, amount)

ax.set(xlabel='Number of transactions', ylabel='Total amount of transactions',
       title='Transactions total amount in function of the number of transactions')
ax.grid()

plt.show()

In [None]:
#Lets see the variability and some other statistics of categorical columns
cat_columns = X.select_dtypes(include=['object']).columns
for col in cat_columns:
    print(X[col].value_counts(ascending=True, normalize=True))
    print(X[col].describe())
    print("---------------------------------------------------")

# Missing Data
<ul>
    <li>Lets see if there any missing values and visualize them</li>
</ul>

In [None]:
X.isnull().sum()

<li>Luckily we don't have any missing values, so we can proceed with modeling</li>

# Preprocessing + Pipeline
<li>First, lets split the data into train and test dataframes</li>
<p>Pipeline Steps:</p>
<ol>
    <li>One Hot Encoding</li>
    <li>Quantile Proccesing</li>
    <li>Fit the model</li>
</ol>

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=10)

In [None]:
from xgboost import XGBClassifier
xgb = XGBClassifier(colsample_bytree= 0.7, learning_rate= 0.07, max_depth=7, min_child_weight=4,
                  n_estimators = 500, nthread=4, objective= 'reg:linear', subsample= 0.7, tree_method='gpu_hist')

In [None]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import QuantileTransformer

In [None]:
catTransformer = ColumnTransformer([('encoder', OneHotEncoder(), cat_columns)], remainder='passthrough')

In [None]:
from sklearn.pipeline import Pipeline

model_pipeline = Pipeline(steps=[
                                ('One Hot Encoding', catTransformer),
                                ('Quantile_Proccesing', QuantileTransformer(n_quantiles=10, random_state=0)),
                                ('XGBoost', xgb)
                                ])
model_pipeline.fit(X_train, y_train)

In [None]:
# Accuracy Metrics
from sklearn.metrics import accuracy_score

y_pred = model_pipeline.predict(X_test)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

# End
<p>Thanks for going all the way down through my notebook! I hope you were able to get something usefull from this. Feel free to ask your questions and use my code</p>