# Credit Card Customer Churn - EDA & Modelling

## Table of content
* [1. Introduction](#Introduction)
    * [1.1. Goals](#Goals)
    * [1.2. Libraries](#Libraries)
* [2. The Data](#TheData)
    * [2.1. Data Sample](#DataSample)
    * [2.2. Data Preprocessing](#DataPreprocessing)
* [3. Customer Profiles](#CustomerProfiles)
    * [3.1. Exploratory Data Analysis](#EDA)
    * [3.2. Churn and Non Churn Profiles](#Profiles)
* [4. Customer Churn Prediction](#CustomerChurnPrediction)
    * [4.1. Data Preperation](#DataPrep)
    * [4.2. Model Training](#ModelTraining)
    * [4.3. Model Evaluation](#ModelEvaluation)
    * [4.4. Hyperparameter tuning](#Hyperparameter)
    * [4.5. Feature Importance](#FeatureImportance)
* [5. Conclusion](#Conclusion)


<a id="introduction"></a>
# 1. Introduction

<a id="Goals"></a>
## 1.1. Goals
The goal of this notebook is to answer both tasks given for the ["Credit Card Customers"](https://www.kaggle.com/sakshigoyal7/credit-card-customers/tasks) - dataset.

The first goal of this project is to provide an analysis which shows the **difference** between a **non-churning and churning customer**. This will provide us insight into which customers are eager to churn.

The top priority of this case is to identify if a customer will churn or won't. It's important that we don't **predict** churning as non-churning customers. That's why the model needs to be evaluated on the **"Recall"**- metric (goal > 62%).

<a id="Libraries"></a>
## 1.2. Libraries
Libraries used can be found in the code block underneed.

In [None]:
!pip install imbalanced-learn

In [None]:
# Libraries
import os

# Used for EDA, Customer profiling
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from imblearn.over_sampling import SMOTE

from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV, learning_curve, train_test_split
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, f1_score, precision_score, recall_score, roc_auc_score, roc_curve
from sklearn.inspection import permutation_importance

from xgboost import XGBClassifier

from scipy import stats
from scipy.stats import randint
from scipy.stats import uniform


# Presets
%matplotlib inline
sns.set()

<a id=TheData></a>
# 2. The Data
<a id="DataSample"></a>
## 2.1. Data sample

The building block of any data science project is the data. Underneed you can find one data record which will be used in further analysis. The dataset consist of 10000 samples describing the customers and it's behavior.

The following columns/features can be split up in the following groups:

* ***Basic information***:
    * **CLIENTNUM** : Unique identifier for the customer holding the account.


* ***Target/Label***:
    * **Attrition_Flag**: Internal event (customer activity) variable - if the account is closed then 1 else 0.


* ***Demographic variables***:
    * **Customer_Age**: Demographic variable - Customer's Age in Years.
    * **Gender**: Demographic variable - M=Male, F=Female.
    * **Dependent_count**: Demographic variable - Number of dependents.
    * **Education_Level**: Demographic variable - Educational Qualification of the account holder (example: high school, college graduate, etc.).
    * **Marital_Status**: Demographic variable - Married, Single, Divorced, Unknown.
    * **Income_Category**: Demographic variable - Annual Income Category of the account holder (< $40K, $40K - 60K, $60K - $80K, $80K-$120K, >).
    

* ***Product variables***:
    * **Card_Category**: Product Variable - Type of Card (Blue, Silver, Gold, Platinum).
    * **Months_on_book**: Period of relationship with bank.
    * **Total_Relationship_Count**: Total no. of products held by the customer.
    * **Months_Inactive_12_mon**: No. of Months in the last 12 months.
    * **Contacts_Count_12_mon**: No. of Contacts in the last 12 months.
    * **Credit_Limit**: Credit Limit on the Credit Card.
    * **Total_Revolving_Bal**: Total Revolving Balance on the Credit Card.
    * **Avg_Open_To_Buy**: Open to Buy Credit Line (Average of last 12 months
    * **Total_Amt_Chng_Q4_Q1**: Change in Transaction Amount (Q4 over Q1).
    * **Total_Trans_Amt**: Total Transaction Amount (Last 12 months).
    * **Total_Trans_Ct**: Total Transaction Count (Last 12 months).
    * **Total_Ct_Chng_Q4_Q1**: Change in Transaction Count (Q4 over Q1).
    * **Avg_Utilization_Ratio**: Average Card Utilization Ratio.


* ***Unimportant variables***:
    * **Naive_Bayes**: It was mentioned that all columns containing the "N.B."-tag should be disregarded.


In [None]:
data = pd.read_csv('/kaggle/input/credit-card-customers/BankChurners.csv')
data.head(5)

<a id="DataPreprocessing" ></a>
## 2.2. Data Preprocessing
In this phase we'll quickly explore the data and remove/impute incorrect values. So that a cleaned data can be used for further analysis/modelling.

* Remove unnecessary columns.
* Check for duplicates.
* Change ID to client number.
* Check for null values.



### Remove N.B. columns

In [None]:
# Removing the N.B. columns
data = data.drop(columns= ['Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1','Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2'])

In [None]:
for column in data.columns:
    print("Column name: " + column)

print("Column count: " + str(len(data.columns)))

### Check for duplicates and change ID to ClientNumber

In [None]:
# True if duplicates are present
data.duplicated().any()

In [None]:
# Change the ID to the ClientNumber
data = data.set_index("CLIENTNUM")

### Null values?

In [None]:
# Describe columns
data.isnull().any()

No null values are found.

<a id="CustomerProfiles"></a>
# 3. Customer Profiles
Let's now explore and understand our data! 
<a id="EDA"></a>
## 3.1. Exploratory Data Analysis (EDA)

EDA tasks:

* Check the target variable:
    * Amount of attrition.


* Check the demographic variables:
    * Age vs attrition.
    * Gender vs churn.
    * Number of dependents vs churn.
    * Education level vs churn.
    * Marital status vs churn.
    * Income category vs churn.


* Check the product variables:
    * Type of card vs churn.
    * Relationship with the bank vs churn.
    * Number of products vs churn.
    * Inactive months vs churn.
    * Number of contacts vs churn.
    * Credit Limit vs churn.
    * Total resolving balance vs churn.
    * Openness To Buy Credit Line vs churn.
    * Transaction Amount Change vs churn.
    * Transaction Count Change vs churn.
    * Average Card Utilization Ratio vs churn.

--> click [here](#Profiles) to skip forward to the profiling result!


## Check the target variable
### Amount of churned customers
How many customers have churned?

In [None]:
target = data["Attrition_Flag"].value_counts()

fig1, ax1 = plt.subplots()

ax1.pie(target, labels=target.index, autopct='%1.1f%%', shadow=None)
ax1.axis('equal')
plt.title("Amount of churned customers", fontsize=14)
plt.show()

It's clear that the majority of our customers (83.9 %) stays. Since "attrited" or "churned" label is less then 20% of the total all customers.  We can say that we have an imbalanced data. Upsampling will be required to receive a better results.

In [None]:
# Some functions

churned = data[data['Attrition_Flag'] == "Attrited Customer"]
nonchurned = data[data['Attrition_Flag'] == "Existing Customer"]

def plot_pie(column):
    target = data[column].value_counts()    
    fig1, ax1 = plt.subplots()    
    ax1.pie(target, labels=target.index, autopct='%1.1f%%', shadow=None)
    ax1.axis('equal')
    plt.title("All customers", fontsize=14)
    plt.show()
    

def plot_compare(column, category_name):
    NChurned = len(churned[column].unique())
    NNonChurned = len(nonchurned[column].unique())
    
    ChurnedCounts = churned[column].value_counts().sort_index()
    NonChurnedCounts = nonchurned[column].value_counts().sort_index()
    
    indchurned = np.arange(NChurned)    # the x locations for the groups
    indnonchurned = np.arange(NNonChurned)    
    width = 1       # the width of the bars: can also be len(x) sequence
    
    figs, axs = plt.subplots(1,2, figsize=(12,5))
    
    axs[1].bar(indchurned, ChurnedCounts, width, color='#DD8452')
    axs[1].set_title('Churned ' + category_name, fontsize=20)
    axs[1].set_xticks(indchurned)
    axs[1].set_xticklabels(ChurnedCounts.index.tolist(), rotation=45)
    
    axs[0].bar(indnonchurned, NonChurnedCounts, width, color='b')
    axs[0].set_title('Non Churned ' + category_name, fontsize=20)
    axs[0].set_ylabel('Amount of People')
    axs[0].set_xticks(indnonchurned)
    axs[0].set_xticklabels(NonChurnedCounts.index.tolist(), rotation=45)
    
    
    plt.show()

## Checking the demographic variables
### Age compared to the churn

In [None]:
# Checking the overal distribution
data["Customer_Age"].hist()
plt.xlabel("Age")
plt.ylabel("Amount of customers")
plt.title("Age distribution", fontsize=15)
plt.show()

In [None]:
# Comparing the age distribution vs the target
sns.boxplot(x="Attrition_Flag", y="Customer_Age",
            hue="Attrition_Flag", palette=["b", "#DD8452"],
            data=data).set_title("Age vs Churn", fontsize=15)
sns.despine(offset=10, trim=True)

The age is normally distributed. There is no clear difference between the age distribution.

### Gender vs churn
Are males of females more eager to churn?

In [None]:
churnedtarget = churned["Gender"].value_counts()
nonchurnedtarget = nonchurned["Gender"].value_counts()

fig1, axs = plt.subplots(1, 2)

axs[0].pie(churnedtarget, labels=churnedtarget.index, autopct='%1.1f%%', shadow=None)
axs[0].axis('equal')
axs[0].set_title('Existing customers')

axs[1].pie(nonchurnedtarget, labels=nonchurnedtarget.index, autopct='%1.1f%%', shadow=None)
axs[1].axis('equal')
axs[1].set_title('Churning customers')

plt.show()

The difference is too small to say that one gender is more eager to churn.

### Number of dependents vs churn

In [None]:
N = 6
ChurnedCounts = churned["Dependent_count"].value_counts().sort_index()
NonChurnedCounts = nonchurned["Dependent_count"].value_counts().sort_index()

ind = np.arange(N)    # the x locations for the groups
width = 0.3       # the width of the bars: can also be len(x) sequence

figs, axs = plt.subplots(figsize=(10,7))

axs.bar(ind - width/2, ChurnedCounts, width, color = "#DD8452")
axs.bar(ind + width/2, NonChurnedCounts, width)

axs.set_xlabel('Dependent Count')
axs.set_ylabel('Amount of People')
axs.set_title('Distribution of the dependent count', fontsize=20)
# axs.set_xticks(ind, ('0', '1', '2', '3', '4', '5'))
axs.legend(('Churned Customers','Existing Customers',))

plt.show()

In [None]:
churned['Dependent_count'].describe()

In [None]:
nonchurned['Dependent_count'].describe()

The dependent count shows us a normal distribution. No clear shift is visible when comparing the churned- and non churned distribution.

### Education level vs churn

In [None]:
plot_pie("Education_Level")

The majority of people has a graduate education level followed by high school. 15% of the population has an unknown education level.

In [None]:
plot_compare("Education_Level", "Education Level")

The "Education level" - distribution of the churn/nonchurned customers shows no difference.

### Marital status vs churn

In [None]:
plot_pie("Marital_Status")

In [None]:
plot_compare("Marital_Status", "Marital Status")

The main part of the population is married. Both churn and non churned have the same distribution.

### Income category vs churn

In [None]:
plot_pie("Income_Category")

In [None]:
plot_compare("Income_Category", "Income Categories")

We notice that the larget amount of our customers earns less then $40k a year. Like the other demographic variables no clear shift in the distributions can be noticed.

## Checking the product variables
### Types of cards vs churn

In [None]:
plot_pie("Card_Category")

In [None]:
plot_compare("Card_Category", "Types of cards")

We can clearly see that most of our customers have the "Blue" card. The distribution of churned/not churned is the same.

### Relationship with bank vs churn

In [None]:
column = "Months_on_book"

N = len(data[column].unique())
DataCounts = data[column].value_counts().sort_index()

ind = np.arange(N) 
width = 1       

figs, axs = plt.subplots(figsize=(12,5))

axs.bar(ind, DataCounts, width, color='b')
axs.set_ylabel('Amount of People')
axs.set_title('Length of relationship with the bank', fontsize=20)
axs.set_xticks(ind)
axs.set_xticklabels(DataCounts.index.tolist(), rotation=45)

plt.show()

In [None]:
plot_compare("Months_on_book", "Length of relationship")

In [None]:
# Comparing the age distribution vs the target
sns.boxplot(x="Attrition_Flag", y="Months_on_book",
            hue="Attrition_Flag", palette=["b", "#DD8452"],
            data=data).set_title("Length Of Relationship vs Churn", fontsize=15)
sns.despine(offset=10, trim=True)

### Number of products bought vs churn

In [None]:
plot_pie("Total_Relationship_Count")

In [None]:
plot_compare("Total_Relationship_Count", "Number Of Products")

Here we see a shift in the distribution when we compare the churned vs the nonchurned. It's clear that the non churned customers tend to buy more products then the churned customers.

### Months inactive vs churn

In [None]:
plot_pie("Months_Inactive_12_mon")

In [None]:
plot_compare("Months_Inactive_12_mon", "Inactive Months")

In [None]:
# Comparing the age distribution vs the target
sns.boxplot(x="Attrition_Flag", y="Months_Inactive_12_mon",
            hue="Attrition_Flag", palette=["b", "#DD8452"],
            data=data).set_title("Inactive months vs Churn", fontsize=15)
sns.despine(offset=10, trim=True)

It's clear that the most of the customers show a 3 month inactivity. It can be said that most of the active members (less then 3 months) can be found within the non churning customers.

### Number of contacts vs churn

In [None]:
plot_pie("Contacts_Count_12_mon")

In [None]:
plot_compare("Contacts_Count_12_mon", "Number Of Contacts")

In [None]:
# Comparing the age distribution vs the target
sns.boxplot(x="Attrition_Flag", y="Contacts_Count_12_mon",
            hue="Attrition_Flag", palette=["b", "#DD8452"],
            data=data).set_title("Number Of Contacts vs Churn", fontsize=15)
sns.despine(offset=10, trim=True)

Churned customers are most likely to have more contact then the non churned customers. All though the difference isn't large it's still noticable. 

### Credit limit vs churn

In [None]:
# Comparing the age distribution vs the target
sns.boxplot(x="Attrition_Flag", y="Credit_Limit",
            hue="Attrition_Flag", palette=["b", "#DD8452"],
            data=data).set_title("Credit Limit vs Churn", fontsize=15)
sns.despine(offset=10, trim=True)

In [None]:
column = "Credit_Limit"
category_name = "Credit Limit" 

NChurned = len(churned[column].unique())
NNonChurned = len(nonchurned[column].unique())

figs, axs = plt.subplots(figsize=(12,5))
    
axs.hist([churned[column], nonchurned[column]] , color=['#DD8452','b'])

axs.set_ylabel('Amount of People')
axs.set_title('Churned ' + category_name, fontsize=20)
axs.legend(('Churned Customers', 'Existing Customers'))


plt.show()

There is no clear difference in the credit limit.

### Total revolving balance vs churn

In [None]:
# Comparing the age distribution vs the target
sns.boxplot(x="Attrition_Flag", y="Total_Revolving_Bal",
            hue="Attrition_Flag", palette=["b", "#DD8452"],
            data=data).set_title("Credit Limit vs Churn", fontsize=15)
sns.despine(offset=10, trim=True)

In [None]:
column = "Total_Revolving_Bal"
category_name = "Revolving Balance" 

NChurned = len(churned[column].unique())
NNonChurned = len(nonchurned[column].unique())

figs, axs = plt.subplots(figsize=(12,5))
    
axs.hist([churned[column], nonchurned[column]] , color=['#DD8425', 'b'])

axs.set_ylabel('Amount of People')
axs.set_title('Churned ' + category_name, fontsize=20)
axs.legend(('Churned Customers', 'Existing Customers'))


plt.show()

It's clear that the churned customers have a lower revolving balance then the existing customers.

### Openness To Buy Credit Line vs churn

In [None]:
# Comparing the age distribution vs the target
sns.boxplot(x="Attrition_Flag", y="Avg_Open_To_Buy",
            hue="Attrition_Flag", palette=["b", "#DD8452"],
            data=data).set_title("Buy New Credit Line vs Churn", fontsize=15)
sns.despine(offset=10, trim=True)

In [None]:
column = "Avg_Open_To_Buy"
category_name = "Buy New Credit Line" 

NChurned = len(churned[column].unique())
NNonChurned = len(nonchurned[column].unique())

figs, axs = plt.subplots(figsize=(12,5))
    
axs.hist([churned[column], nonchurned[column]] , color=['#DD8425', 'b'])

axs.set_ylabel('Amount of People')
axs.set_title('Churned ' + category_name, fontsize=20)
axs.legend(('Churned Customers', 'Existing Customers'))


plt.show()

No distinctive difference.


### Change in Transaction vs Churn

In [None]:
sns.boxplot(x="Attrition_Flag", y="Total_Amt_Chng_Q4_Q1",
            hue="Attrition_Flag", palette=["b", "#DD8452"],
            data=data).set_title("Change in Transaction Amount vs Churn", fontsize=15)
sns.despine(offset=10, trim=True)

In [None]:
column = "Total_Amt_Chng_Q4_Q1"
category_name = "Change in Transaction Amount" 

NChurned = len(churned[column].unique())
NNonChurned = len(nonchurned[column].unique())

figs, axs = plt.subplots(figsize=(12,5))
    
axs.hist([churned[column], nonchurned[column]] , color=[ '#DD8425','b'])

axs.set_ylabel('Amount of People')
axs.set_title('Churned ' + category_name, fontsize=20)
axs.legend(('Churned Customers', 'Existing Customers'))


plt.show()

No clear difference.

### Total transaction amount vs churn

In [None]:
sns.boxplot(x="Attrition_Flag", y="Total_Trans_Amt",
            hue="Attrition_Flag", palette=["b", "#DD8452"],
            data=data).set_title("Transaction Amount vs Churn", fontsize=15)
sns.despine(offset=10, trim=True)

In [None]:
column = "Total_Trans_Amt"
category_name = "Transaction Amount" 

NChurned = len(churned[column].unique())
NNonChurned = len(nonchurned[column].unique())

figs, axs = plt.subplots(figsize=(12,5))
    
axs.hist([churned[column], nonchurned[column]] , color=[ '#DD8425','b'])

axs.set_ylabel('Amount of People')
axs.set_title('Churned ' + category_name, fontsize=20)
axs.legend(('Churned Customers', 'Existing Customers'))


plt.show()

It's clear that the transaction amount is lower for the churned customers compared to the existing customers.

### Total transaction count vs Churn 

In [None]:
sns.boxplot(x="Attrition_Flag", y="Total_Trans_Ct",
            hue="Attrition_Flag", palette=["b", "#DD8452"],
            data=data).set_title("Transanction Count vs Churn", fontsize=15)
sns.despine(offset=10, trim=True)

In [None]:
column = "Total_Trans_Ct"
category_name = "Transaction Count" 

NChurned = len(churned[column].unique())
NNonChurned = len(nonchurned[column].unique())

figs, axs = plt.subplots(figsize=(12,5))
    
axs.hist([churned[column], nonchurned[column]] , color=[ '#DD8425','b'])

axs.set_ylabel('Amount of People')
axs.set_title('Churned ' + category_name, fontsize=20)
axs.legend(('Churned Customers', 'Existing Customers'))


plt.show()

It's clear that churned customers mostly have a lower transaction count then the existing customers.

### Change in transaction count vs Churn

In [None]:
sns.boxplot(x="Attrition_Flag", y="Total_Ct_Chng_Q4_Q1",
            hue="Attrition_Flag", palette=["b", "#DD8452"],
            data=data).set_title("Transanction Count vs Churn", fontsize=15)
sns.despine(offset=10, trim=True)

In [None]:
column = "Total_Ct_Chng_Q4_Q1"
category_name = "Transaction Count Change" 

NChurned = len(churned[column].unique())
NNonChurned = len(nonchurned[column].unique())

figs, axs = plt.subplots(figsize=(12,5))
    
axs.hist([churned[column], nonchurned[column]] , color=['#DD8425','b'])

axs.set_ylabel('Amount of People')
axs.set_title('Churned ' + category_name, fontsize=20)
axs.legend(('Churned Customers', 'Existing Customers'))

plt.show()

Again there's a clear difference between in the distribution. The average is higher for the existing customers.

### Average Card Utilization Ratio

In [None]:
sns.boxplot(x="Attrition_Flag", y="Avg_Utilization_Ratio",
            hue="Attrition_Flag", palette=["b", "#DD8452"],
            data=data).set_title("Card Utilization vs Churn", fontsize=15)
sns.despine(offset=10, trim=True)

In [None]:
column = "Avg_Utilization_Ratio"
category_name = "Card Utilization Ratio" 

NChurned = len(churned[column].unique())
NNonChurned = len(nonchurned[column].unique())

figs, axs = plt.subplots(figsize=(12,5))
    
axs.hist([churned[column], nonchurned[column]] , color=['#DD8425','b'])

axs.set_ylabel('Amount of People')
axs.set_title('Churned ' + category_name, fontsize=20)
axs.legend(('Churned Customers', 'Existing Customers'))

plt.show()

It's clear that the average card utilization ratio is higher for the existing customers.

<a id="Profiles"></a> 
## 3.2. Non Churn and Churn Profiles

According to the EDA above, the profiles underneed can be made. It's clear that the main difference lays in the "product variables" of the customers. A churning customers tends to be less active then an existing customer. It's clear that the most influential parameters are features related to the activity of the customer.


|            |  Non Churning Customer | Churning Customer | 
|:----------:|:-------------:|:--------:|
||||
| ***Demographic variables*** |
| Age | 47 | 46 | 
| Gender | F/M | F/M |
| Dependents | 2 | 2 |
| Education Level | Graduate | Graduate |
| Marital Level | Married/Single | Married/Single |
| Income Category | Less then \$40K | Less  then \$40K |
||||
| ***Product variables*** |
| Type Of Card | Blue | Blue |
| Length Of Relationship | 36 months | 36 months |
| Products Bought | 4 | 3 |
| Inactive Months | 2 | 3 |
| Number Of Contact | 2 | 3 |
| Credit Limit | \$8726 | \$8136 |
| Revolving Balance | 1256 | 672 |
| Open To Buy Credit Line | 7470 | 7463 |
| Transaction Amount Change | 0.77 | 0.69 |
| Total Transaction Amount | 4650 | 3095 |
| Total Transaction Count | 69 | 45 |
| Transaction Count Change | 0.74 | 0.55 |
| Card Utilization Ratio | 0.3 | 0.16 |

In [None]:
churned.describe()

In [None]:
nonchurned.describe()

<a id="CustomerChurnPrediction"></a>
# 4. Customer Churn Prediction
Here we will train an optimized (treebased) model which will predict if a customer will or won't churn. 

<a id="DataPrep"></a>
## 4.1. Data Preperation
Before we start training a model we must prepare our data. Different steps that we can undertake:
* Encode all categorical data (watch out with one hot encoding and tree-based models...).
* Scale data
* Check correlation matrix to extract the most influential features.
* Generate new columns from data.
* Upsample the imbalanced dataset (SMOTE/ADASYN).

In this notebook we shall focus on the upsampling method. The data wrangling performed is to make sure that the upsampling is performed in a correct manner.



### SMOTE (Synthetic Minority Oversampling Technique)
We saw that our dataset was imbalanced. This could gives problems when creating a classification model since it might not learn the decision boundary. This ofcourse an be solved with upsampling.

One technique used for this is SMOTE, this technique creates new synthetic samples which can be used for training.

> SMOTE first selects a minority class instance a at random and finds its k nearest minority class neighbors. The synthetic instance is then created by choosing one of the k nearest neighbors b at random and connecting a and b to form a line segment in the feature space. The synthetic instances are generated as a convex combination of the two chosen instances a and b.
[SMOTE Paper](https://arxiv.org/abs/1106.1813)

To use SMOTE we'll need to encode our categorical features.

Note: It's important to only upsample the training data and so that no synthetic data is present in the validation dataset.

In [None]:
encoder = LabelEncoder()

def make_categorical(data: pd.DataFrame, column: str, categories: list, ordered: bool = False):
    data[column] = pd.Categorical(data[column], categories=categories, ordered=ordered)

In [None]:
make_categorical(data, 'Marital_Status', ['Unknown', 'Single', 'Divorced','Married'])

make_categorical(data, 'Income_Category', ['Unknown','Less than $40K', '$40K - $60K', '$60K - $80K', '$80K - $120K', '$120K +'], True)

make_categorical(data, 'Card_Category', ['Blue', 'Silver', 'Gold', 'Platinum'], True)

In [None]:
data["Attrition_Flag"] = data["Attrition_Flag"].replace({'Attrited Customer':1,'Existing Customer':0})
data["Gender"] = data["Gender"].replace({'F':1,'M':0})

In [None]:
ClassesToEncode = ['Education_Level' ,'Marital_Status', 'Income_Category', 'Card_Category']

In [None]:
for Class in ClassesToEncode:
    data[Class] = encoder.fit_transform(data[Class])

In [None]:
y_data = data["Attrition_Flag"]
X_data = data.drop(columns = ["Attrition_Flag"])

In [None]:
# for testing purposes
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.3, random_state=0)

In [None]:
# Transform the dataset (only training data)
oversample = SMOTE()
X_up, y_up = oversample.fit_resample(X_train, y_train)

<a id="ModelTraining"></a>
## 4.2. Model Training
RandomForest and XGBoost

### RandomForestClassifier

In [None]:
rf = RandomForestClassifier()
rf.fit(X_up, y_up)

### XGBClassifier

In [None]:
xgb = XGBClassifier()
xgb.fit(X_up, y_up)

<a id="ModelEvaluation"></a>
## 4.3. Model Evaluation

### RandomForestClassifier

In [None]:
rfpred = rf.predict(X_test)
print(classification_report(y_test, rfpred))

In [None]:

ypred = rfpred
model = rf
print ('Confusion Matrix:')
print(confusion_matrix(y_test, ypred))
print('\nAccuracy:', accuracy_score(y_test, ypred))
print("Overall Precision:",precision_score(y_test, ypred))
print("Overall Recall:",recall_score(y_test, ypred))
print("Overall f1-score:", f1_score(y_test, ypred))
auc = roc_auc_score(y_test,ypred)
plt.show()

### XGBClassifier

In [None]:
xgbpred = xgb.predict(X_test)
print(classification_report(y_test, xgbpred))

In [None]:

ypred = xgbpred
model = xgb
print ('Confusion Matrix:')
print(confusion_matrix(y_test, ypred))
print('\nAccuracy:', accuracy_score(y_test, ypred))
print("Overall Precision:",precision_score(y_test, ypred))
print("Overall Recall:",recall_score(y_test, ypred))
print("Overall f1-score:", f1_score(y_test, ypred))
auc = roc_auc_score(y_test,ypred)
plt.show()

It's clear that the performance XGBoostClassifier is better.
With a recall of 92.5 % we clearly reached our goal. ( goal: ... > 0.62 )

<a id="Hyperparameter"></a>
## 4.4. Hyperparameter tuning

### RandomizedSearchCV

First we'll use a RandomizedSearchCV to find narrow down on the most optimal parameters. For further finetuning GridSearchCV will be used.


In [None]:
# Tuning hyperparameters with RandomizedSearchCV

#params = {
#    "colsample_bytree": uniform(0.3, 0.7),
#    "min_child_weight": [1,2,3,4],
#    "learning_rate": uniform(0.1, 0.5), # default 0.1 
#    "max_depth": randint(6, 9), # default 3
#    "n_estimators": randint(100, 300), # default 100
#    "subsample": uniform(0.6, 0.4)
#}
#xgbnew = XGBClassifier()

#search = RandomizedSearchCV(xgbnew, param_distributions=params, random_state=123, n_iter=100, cv=3, verbose=2, n_jobs=-1)

#search.fit(X_up, y_up)

In [None]:
#search.best_params_

In [None]:
#myxgb = search.best_estimator_
#thisypred = myxgb.predict(X_test)

#print(classification_report(y_test, thisypred))

In [None]:
# Parameters from RandomizedSearchCV
#{'colsample_bytree': 0.7025947001725772,
# 'learning_rate': 0.2612838738188591,
# 'max_depth': 7,
# 'min_child_weight': 1,
# 'n_estimators': 229,
# 'subsample': 0.8518910536188189}

myxgb = XGBClassifier(colsample_bytree=0.7025947001725772, learning_rate= 0.2612838738188591,max_depth= 7,min_child_weight=1, n_estimators = 229, subsample = 0.8518910536188189 )
myxgb.fit(X_up, y_up)
thisypred = myxgb.predict(X_test)


ypred = thisypred
model = myxgb
print ('Confusion Matrix:')
print(confusion_matrix(y_test, ypred))
print('Accuracy:', accuracy_score(y_test, ypred))
print("Overall Precision:",precision_score(y_test, ypred))
print("Overall Recall:",recall_score(y_test, ypred))
auc = roc_auc_score(y_test,ypred)

print("AUC:", auc)
plt.show()

### GridSearchCV

In [None]:
# GridSearchCV for finetuning

#params = {
#    "colsample_bytree": [0.670, 0.680, 0.690],
#    "min_child_weight": [1],
#    "learning_rate": [0.275, 0.3, 0.325], # default 0.1 
#    "max_depth": [7,8,9], # default 3
#    "n_estimators": [212, 215, 217], # default 100
#    "subsample": [0.75, 0.80, 0.85]
#}

#gridxgb = XGBClassifier()

#gridsearch = GridSearchCV(estimator = gridxgb, param_grid = params, cv = 3, n_jobs = -1, verbose = 2)

#gridsearch.fit(X_up, y_up)

In [None]:
#gridsearch.best_params_

In [None]:
#mymodel = gridsearch.best_estimator_
#mymodelpred = mymodel.predict(X_test)
#recall_score(y_test, mymodelpred)

In [None]:
# {'colsample_bytree': 0.67,
# 'learning_rate': 0.3,
# 'max_depth': 8,
# 'min_child_weight': 1,
# 'n_estimators': 215,
# 'subsample': 0.8}

mymodel = XGBClassifier(colsample_bytree = 0.67, learning_rate=0.3, max_depth=8, min_child_weight=1, n_estimators=215, subsample=0.8)
mymodel.fit(X_up, y_up)
mymodelpred = mymodel.predict(X_test)

ypred = mymodelpred
model = mymodel
print ('Confusion Matrix:')
print(confusion_matrix(y_test, ypred))
print('Accuracy:', accuracy_score(y_test, ypred))
print("Overall Precision:",precision_score(y_test, ypred))
print("Overall Recall:",recall_score(y_test, ypred))
auc = roc_auc_score(y_test,ypred)

print("AUC:", auc)
plt.show()

<a id="FeatureImportance"></a>
## 4.5. Feature Importance
In this step we'll have a look at the relative importance of each feature used in the predictions.

In [None]:
resultmymodel = permutation_importance(mymodel, X_test, y_test, n_repeats=10,
                                random_state=42, n_jobs=2)
sorted_idx = resultmymodel.importances_mean.argsort()

fig, ax = plt.subplots(figsize=(10,10))
ax.boxplot(resultmymodel.importances[sorted_idx].T,
           vert=False, labels=X_test.columns[sorted_idx])
ax.set_title("Permutation Importances (test set)")
fig.tight_layout()
plt.show()

Like we noticed in the EDA the top 3 most important features are within the product variables, more specifically: "Total_Trans_Ct", "Total_Trans_Amt", "Total_Amt_Chng_Q4_Q1".

<a id="Conclusion"></a>
## 5. Conclusion

We can conclude that the top 3 most influential features are the product variables: "Total_Trans_Ct", "Total_Trans_Amt", "Total_Amt_Chng_Q4_Q1". Using the existing data we managed to train a model with upsampled data which reaches a recall score of 92%.


### Future improvements
* Use correlation matrix in EDA to find the most influential features.
* Use iterative imputer to get rid of the "Unknown" values?
* Use PCA for feature selection.
* Create a training and inferencing pipeline.
* Data Upsampling with ADASYN instead of SMOTE
