Introduction: Credit cards are a good source of income for banks due to the different types of fees they charge, such as annual fees, balance transfer fees, cash advance fees, late payment fees, and foreign transaction fees, among others.

Some fees are charged to all users, regardless of usage, while others are charged under certain circumstances.

Customers abandoning credit card services would be a loss to the bank, so the bank wants to analyze customer data and identify those who abandon credit card services and the reasons for doing so, so that the bank can improve in those areas.

A churn model is a mathematical representation of how churn affects your business. Churn calculations are built from existing data (the number of customers who abandoned your service during a given time period). A predictive churn model extrapolates this data to show churn rates of potential customers in the future.

Churn (also known as customer defection) is a problem for subscription companies. When your revenue is based on recurring monthly or annual contracts, every customer that leaves puts a dent in your cash flow. High retention rates are vital to your survival. What if we told you that there is a way to predict, at least to some extent, how and when your customers will cancel?

Creating a churn prediction model will help you make proactive changes to your retention efforts that reduce churn rates. Understanding the impact of churn on your current revenue goals and making predictions about how to manage those issues in the future also helps you stem the flow of lost customers. If you don't take action against churn now, any business growth you experience will not be sustainable.

Comprehensive customer profiles help you see what types of customers are canceling their accounts. Now is the time to find out how and why they churn. Ask yourself the following questions to learn more about the pain points in your product and customer experience that cause customers to decide to churn.

What is churn? Customer churn (or customer attrition) is the tendency of customers to abandon a brand and stop being paying customers of a given company. The percentage of customers who stop using a company's products or services over a given period of time is called the customer churn rate (attrition). One way to calculate the attrition rate is to divide the number of customers lost during a given time interval by the number of customers acquired, and then multiply that figure by 100%. For example, if you have acquired 150 customers and lost three in the past month, your monthly churn rate is 2%.

The churn rate is a health indicator for companies whose customers are subscribed and pay for services on a recurring basis, so a customer remains open to more interesting or advantageous offers. In addition, each time their current engagement ends, customers have the opportunity to reconsider and choose not to continue with the company. Of course, some natural churn is inevitable, and the figure differs from industry to industry. But having a higher churn figure than that is a definite sign that a company is doing something wrong."

There are many things brands can do wrong, from complicated onboarding where customers are not given easy-to-understand information about product usage and capabilities, to poor communication, e.g., lack of feedback or delayed responses to queries. Another situation: Long-time customers may feel unappreciated because they do not receive as many bonuses as new customers.

Objective Explore and visualize the data set. Build a classification model to predict whether or not the customer will drop out. Optimize the model using appropriate techniques Generate a set of ideas and recommendations to help the bank.

Problem Does income have any effect on dropout? Does gender have any relationship with dropout? What are the signs of attrition?

Members Martin Viera Matias Franco

Problem Indication of the source of the dataset and the selection criteria (Data Acquisition)

Dataset https://www.kaggle.com/code/xavier14/predicting-churn-with-tree-based-models





Data Dictionary
CLIENTNUM: Client number. Unique identifier for the customer holding the account
 
Attrition_Flag: Internal event (customer activity) variable - if the account is closed then "Attrited Customer" else "Existing Customer"
 
Customer_Age: Age in Years
 
Gender: Gender of the account holder
 
Dependent_count: Number of dependents
 
Education_Level: Educational Qualification of the account holder - Graduate, High School, Unknown, Uneducated, College(refers to a college student), Post-Graduate, Doctorate.
 
Marital_Status: Marital Status of the account holder
 
Income_Category: Annual Income Category of the account holder
 
Card_Category: Type of Card
 
Months_on_book: Period of relationship with the bank
 
Total_Relationship_Count: Total no. of products held by the customer
 
Months_Inactive_12_mon: No. of months inactive in the last 12 months
 
Contacts_Count_12_mon: No. of Contacts between the customer and bank in the last 12 months
 
Credit_Limit: Credit Limit on the Credit Card
 
Total_Revolving_Bal: The balance that carries over from one month to the next is the revolving balance
 
Avg_Open_To_Buy: Open to Buy refers to the amount left on the credit card to use (Average of last 12 months)
 
Total_Trans_Amt: Total Transaction Amount (Last 12 months)
 
Total_Trans_Ct: Total Transaction Count (Last 12 months)
 
Total_Ct_Chng_Q4_Q1: Ratio of the total transaction count in 4th quarter and the total transaction count in 1st quarter
 
Total_Amt_Chng_Q4_Q1: Ratio of the total transaction amount in 4th quarter and the total transaction amount in 1st quarter
 
Avg_Utilization_Ratio: Represents how much of the available credit the customer spent
 

In [1]:
# To help with reading and manipulating data
import pandas as pd
import numpy as np
!pip install pyforest
from pyforest import *
lazy_imports()
import pyforest
import warnings
warnings.filterwarnings("ignore")
from sklearn import metrics
from sklearn.metrics import accuracy_score

# To help with data visualization
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib as mpl

# To be used for missing value imputation
from sklearn.impute import SimpleImputer

# To help with model building
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
    AdaBoostClassifier,
    GradientBoostingClassifier,
    RandomForestClassifier,
    BaggingClassifier,
    )
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier

!pip install lightgbm
import lightgbm as lgb

from sklearn.dummy import DummyClassifier

# To get different metric scores, and split data
from sklearn import metrics
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
    roc_auc_score,
    plot_confusion_matrix,
    plot_roc_curve,
)

# To be used for data scaling and encoding
from sklearn.preprocessing import (
    StandardScaler,
    MinMaxScaler,
    OneHotEncoder,
    RobustScaler,
)
from sklearn.impute import SimpleImputer

# To be used for tuning the model
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# To be used for creating pipelines and personalizing them
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.base import TransformerMixin
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler, normalize
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

# To oversample and undersample data
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)

# To supress scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x)

# set the background for the graphs
plt.style.use("ggplot")

# For pandas profiling
!pip install pandas-profiling
from pandas_profiling import ProfileReport

# Printing style
!pip install tabulate
from tabulate import tabulate

# To supress warnings
import warnings

# date time
from datetime import datetime

warnings.filterwarnings("ignore")

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyforest
  Downloading pyforest-1.1.0.tar.gz (15 kB)
Building wheels for collected packages: pyforest
  Building wheel for pyforest (setup.py) ... [?25l[?25hdone
  Created wheel for pyforest: filename=pyforest-1.1.0-py2.py3-none-any.whl size=14607 sha256=489e384c4319a3bfbbaf557c456c11e06bca5a9cf867d5d935ceeb7997015a26
  Stored in directory: /root/.cache/pip/wheels/61/1c/da/48e6c884142d485475d852d69d20a096aba5beceb338822893
Successfully built pyforest
Installing collected packages: pyforest
Successfully installed pyforest-1.1.0
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
# Loading the dataset
churner =pd.read_csv("/content/sample_data/credit_card_churn.csv")

In [3]:
# Checking the number of rows and columns in the data
churner.shape

(10127, 23)

In [4]:
additional_droppable_columns = [
    'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1',
    'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2'
]

for col in additional_droppable_columns:
    if col in churner.columns.unique().tolist():
        churner.drop(columns=[col], inplace=True)

In [5]:
# Creating a copy dataset for analysis

data = churner.copy()

In [6]:
# let's view the first 5 rows of the data
data.head()

Unnamed: 0,CLIENTNUM,Attrition_Flag,Customer_Age,Gender,Dependent_count,Education_Level,Marital_Status,Income_Category,Card_Category,Months_on_book,Total_Relationship_Count,Months_Inactive_12_mon,Contacts_Count_12_mon,Credit_Limit,Total_Revolving_Bal,Avg_Open_To_Buy,Total_Amt_Chng_Q4_Q1,Total_Trans_Amt,Total_Trans_Ct,Total_Ct_Chng_Q4_Q1,Avg_Utilization_Ratio
0,768805383,Existing Customer,45,M,3,High School,Married,$60K - $80K,Blue,39,5,1,3,12691.0,777,11914.0,1.335,1144,42,1.625,0.061
1,818770008,Existing Customer,49,F,5,Graduate,Single,Less than $40K,Blue,44,6,1,2,8256.0,864,7392.0,1.541,1291,33,3.714,0.105
2,713982108,Existing Customer,51,M,3,Graduate,Married,$80K - $120K,Blue,36,4,1,0,3418.0,0,3418.0,2.594,1887,20,2.333,0.0
3,769911858,Existing Customer,40,F,4,High School,Unknown,Less than $40K,Blue,34,3,4,1,3313.0,2517,796.0,1.405,1171,20,2.333,0.76
4,709106358,Existing Customer,40,M,3,Uneducated,Married,$60K - $80K,Blue,21,5,1,0,4716.0,0,4716.0,2.175,816,28,2.5,0.0


In [7]:
# let's view the last 5 rows of the data
data.tail()

Unnamed: 0,CLIENTNUM,Attrition_Flag,Customer_Age,Gender,Dependent_count,Education_Level,Marital_Status,Income_Category,Card_Category,Months_on_book,Total_Relationship_Count,Months_Inactive_12_mon,Contacts_Count_12_mon,Credit_Limit,Total_Revolving_Bal,Avg_Open_To_Buy,Total_Amt_Chng_Q4_Q1,Total_Trans_Amt,Total_Trans_Ct,Total_Ct_Chng_Q4_Q1,Avg_Utilization_Ratio
10122,772366833,Existing Customer,50,M,2,Graduate,Single,$40K - $60K,Blue,40,3,2,3,4003.0,1851,2152.0,0.703,15476,117,0.857,0.462
10123,710638233,Attrited Customer,41,M,2,Unknown,Divorced,$40K - $60K,Blue,25,4,2,3,4277.0,2186,2091.0,0.804,8764,69,0.683,0.511
10124,716506083,Attrited Customer,44,F,1,High School,Married,Less than $40K,Blue,36,5,3,4,5409.0,0,5409.0,0.819,10291,60,0.818,0.0
10125,717406983,Attrited Customer,30,M,2,Graduate,Unknown,$40K - $60K,Blue,36,4,3,3,5281.0,0,5281.0,0.535,8395,62,0.722,0.0
10126,714337233,Attrited Customer,43,F,2,Graduate,Married,Less than $40K,Silver,25,6,2,4,10388.0,1961,8427.0,0.703,10294,61,0.649,0.189


In [8]:
# let's check the Data Types of the columns in the dataset
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10127 entries, 0 to 10126
Data columns (total 21 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   CLIENTNUM                 10127 non-null  int64  
 1   Attrition_Flag            10127 non-null  object 
 2   Customer_Age              10127 non-null  int64  
 3   Gender                    10127 non-null  object 
 4   Dependent_count           10127 non-null  int64  
 5   Education_Level           10127 non-null  object 
 6   Marital_Status            10127 non-null  object 
 7   Income_Category           10127 non-null  object 
 8   Card_Category             10127 non-null  object 
 9   Months_on_book            10127 non-null  int64  
 10  Total_Relationship_Count  10127 non-null  int64  
 11  Months_Inactive_12_mon    10127 non-null  int64  
 12  Contacts_Count_12_mon     10127 non-null  int64  
 13  Credit_Limit              10127 non-null  float64
 14  Total_

-There are a total of 21 columns and 10,127 observations in the dataset
-We can see that Education_Level and Marital_Status have less than 10,127 non-null values i.e. columns have missing values.

In [9]:
# let's check for duplicate values in the data
data.duplicated().sum()

0

In [10]:
# let's check for missing values in the data
df_null_summary = pd.concat(
    [data.isnull().sum(), data.isnull().sum() * 100 / data.isnull().count()], axis=1
)
df_null_summary.columns = ["Null Record Count", "Percentage of Null Records"]
df_null_summary[df_null_summary["Null Record Count"] > 0].sort_values(
    by="Percentage of Null Records", ascending=False
).style.background_gradient(cmap="YlOrRd")

Unnamed: 0,Null Record Count,Percentage of Null Records


No missing values

Let's check the number of unique values in each column

In [11]:
data.select_dtypes(include="object").nunique()

Attrition_Flag     2
Gender             2
Education_Level    7
Marital_Status     4
Income_Category    6
Card_Category      4
dtype: int64

In [12]:
#Unique values for Numerical Columns
data.select_dtypes(exclude="object").nunique()

CLIENTNUM                   10127
Customer_Age                   45
Dependent_count                 6
Months_on_book                 44
Total_Relationship_Count        6
Months_Inactive_12_mon          7
Contacts_Count_12_mon           7
Credit_Limit                 6205
Total_Revolving_Bal          1974
Avg_Open_To_Buy              6813
Total_Amt_Chng_Q4_Q1         1158
Total_Trans_Amt              5033
Total_Trans_Ct                126
Total_Ct_Chng_Q4_Q1           830
Avg_Utilization_Ratio         964
dtype: int64

Age has only 45 unique values i.e. most of the customers are of similar age

In [13]:
# let's view the statistical summary of the numerical columns in the data
data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
CLIENTNUM,10127.0,739177606.334,36903783.45,708082083.0,713036770.5,717926358.0,773143533.0,828343083.0
Customer_Age,10127.0,46.326,8.017,26.0,41.0,46.0,52.0,73.0
Dependent_count,10127.0,2.346,1.299,0.0,1.0,2.0,3.0,5.0
Months_on_book,10127.0,35.928,7.986,13.0,31.0,36.0,40.0,56.0
Total_Relationship_Count,10127.0,3.813,1.554,1.0,3.0,4.0,5.0,6.0
Months_Inactive_12_mon,10127.0,2.341,1.011,0.0,2.0,2.0,3.0,6.0
Contacts_Count_12_mon,10127.0,2.455,1.106,0.0,2.0,2.0,3.0,6.0
Credit_Limit,10127.0,8631.954,9088.777,1438.3,2555.0,4549.0,11067.5,34516.0
Total_Revolving_Bal,10127.0,1162.814,814.987,0.0,359.0,1276.0,1784.0,2517.0
Avg_Open_To_Buy,10127.0,7469.14,9090.685,3.0,1324.5,3474.0,9859.0,34516.0


-Mean value for the Customer Age column is approx 46 and the median is also 46. This shows that majority of the customers are under 46 years of age.
-Dependent Count column has mean and median of ~2
-Months on Book column has mean and median of 36 months. Minimum value is 13 months, showing that the dataset captures data for customers with the bank at least 1 whole years
-Total Relationship Count has mean and median of ~4
-Credit Limit has a wide range of 1.4K to 34.5K, the median being 4.5K, way less than the mean 8.6K
-Total Transaction Count has mean of ~65 and median of 67

In [14]:
# let's view the statistical summary of the categorical columns in the data
data.describe(include="object").T

Unnamed: 0,count,unique,top,freq
Attrition_Flag,10127,2,Existing Customer,8500
Gender,10127,2,F,5358
Education_Level,10127,7,Graduate,3128
Marital_Status,10127,4,Married,4687
Income_Category,10127,6,Less than $40K,3561
Card_Category,10127,4,Blue,9436


In [15]:
# Below function prints unique value counts and percentages for the category/object type variables


def category_unique_value():
    for cat_cols in (
        data.select_dtypes(exclude=[np.int64, np.float64]).columns.unique().to_list()
    ):
        print("Unique values and corresponding data counts for feature: " + cat_cols)
        print("-" * 90)
        df_temp = pd.concat(
            [
                data[cat_cols].value_counts(),
                data[cat_cols].value_counts(normalize=True) * 100,
            ],
            axis=1,
        )
        df_temp.columns = ["Count", "Percentage"]
        print(df_temp)
        print("-" * 90)

In [16]:
category_unique_value()

Unique values and corresponding data counts for feature: Attrition_Flag
------------------------------------------------------------------------------------------
                   Count  Percentage
Existing Customer   8500      83.934
Attrited Customer   1627      16.066
------------------------------------------------------------------------------------------
Unique values and corresponding data counts for feature: Gender
------------------------------------------------------------------------------------------
   Count  Percentage
F   5358      52.908
M   4769      47.092
------------------------------------------------------------------------------------------
Unique values and corresponding data counts for feature: Education_Level
------------------------------------------------------------------------------------------
               Count  Percentage
Graduate        3128      30.888
High School     2013      19.878
Unknown         1519      15.000
Uneducated      1487      14.6

The target variable Attrition Flag has Existing to Attrited ratio of 83.9 : 16.1. There is imbalance in the dataset
~93% customers are having Blue Card
Income Category has a value abc for 10% records, which we'll change to Unknown

Pre-EDA data processing

In [17]:
#drop column Clientnum
data.drop(columns=["CLIENTNUM"], inplace=True)

Treating missing values in Education Level and Marital Status
Note:   The missing value treatment should be done after splitting the data into Train, Validation and Test sets. However, in this case, the treatment is generic, since we are filling in the data with Unknown. Thus, the treatment can be done on the overall dataset. Similar strategy is applicable for treating the Income Category column value abc

In [18]:
data["Education_Level"] = data["Education_Level"].fillna("Unknown")
data["Marital_Status"] = data["Marital_Status"].fillna("Unknown")

Treating Income Category = abc

In [19]:
data.loc[data[data["Income_Category"] == "abc"].index, "Income_Category"] = "Unknown"

In [20]:
#Checking operation outcome
category_unique_value()

Unique values and corresponding data counts for feature: Attrition_Flag
------------------------------------------------------------------------------------------
                   Count  Percentage
Existing Customer   8500      83.934
Attrited Customer   1627      16.066
------------------------------------------------------------------------------------------
Unique values and corresponding data counts for feature: Gender
------------------------------------------------------------------------------------------
   Count  Percentage
F   5358      52.908
M   4769      47.092
------------------------------------------------------------------------------------------
Unique values and corresponding data counts for feature: Education_Level
------------------------------------------------------------------------------------------
               Count  Percentage
Graduate        3128      30.888
High School     2013      19.878
Unknown         1519      15.000
Uneducated      1487      14.6

In [21]:
# let's check for missing values in the data
df_null_summary = pd.concat(
    [data.isnull().sum(), data.isnull().sum() * 100 / data.isnull().count()], axis=1
)
df_null_summary.columns = ["Null Record Count", "Percentage of Null Records"]
df_null_summary[df_null_summary["Null Record Count"] > 0].sort_values(
    by="Percentage of Null Records", ascending=False
).style.background_gradient(cmap="YlOrRd")

Unnamed: 0,Null Record Count,Percentage of Null Records


All the null data values have been treated along with the incorrect/junk data in Income Category column

Data type conversions
Converting the data type of the category variables from object/float to category

In [22]:
category_columns = data.select_dtypes(include="object").columns.tolist()
data[category_columns] = data[category_columns].astype("category")

Standardizing column names
Removing the spaces from column names, and standardizing the column names to lower case

In [23]:
data.columns = [i.replace(" ", "_").lower() for i in data.columns]

In [24]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10127 entries, 0 to 10126
Data columns (total 20 columns):
 #   Column                    Non-Null Count  Dtype   
---  ------                    --------------  -----   
 0   attrition_flag            10127 non-null  category
 1   customer_age              10127 non-null  int64   
 2   gender                    10127 non-null  category
 3   dependent_count           10127 non-null  int64   
 4   education_level           10127 non-null  category
 5   marital_status            10127 non-null  category
 6   income_category           10127 non-null  category
 7   card_category             10127 non-null  category
 8   months_on_book            10127 non-null  int64   
 9   total_relationship_count  10127 non-null  int64   
 10  months_inactive_12_mon    10127 non-null  int64   
 11  contacts_count_12_mon     10127 non-null  int64   
 12  credit_limit              10127 non-null  float64 
 13  total_revolving_bal       10127 non-null  int6

We'll move on to data analysis now

Exploratory Data Analysis
Univariate Analysis
Numerical Feature Summary
The first step of univariate analysis is to check the distribution/spread of the data. This is done using primarily histograms and box plots. Additionally we'll plot each numerical feature on violin plot and cumulative density distribution plot. For these 4 kind of plots, we are building below summary() function to plot each of the numerical attributes. Also, we'll display feature-wise 5 point summary.

In [25]:
def summary(data: pd.DataFrame, x: str):
    """
    The function prints the 5 point summary and histogram, box plot,
    violin plot, and cumulative density distribution plots for each
    feature name passed as the argument.

    Parameters:
    ----------

    x: str, feature name

    Usage:
    ------------

    summary('age')
    """
    x_min = data[x].min()
    x_max = data[x].max()
    Q1 = data[x].quantile(0.25)
    Q2 = data[x].quantile(0.50)
    Q3 = data[x].quantile(0.75)

    dict = {"Min": x_min, "Q1": Q1, "Q2": Q2, "Q3": Q3, "Max": x_max}
    df = pd.DataFrame(data=dict, index=["Value"])
    print(f"5 Point Summary of {x.capitalize()} Attribute:\n")
    print(tabulate(df, headers="keys", tablefmt="psql"))

    fig = plt.figure(figsize=(16, 8))
    plt.subplots_adjust(hspace=0.6)
    sns.set_palette("Pastel1")

    plt.subplot(221, frameon=True)
    ax1 = sns.distplot(data[x], color="purple")
    ax1.axvline(
        np.mean(data[x]), color="purple", linestyle="--"
    )  # Add mean to the histogram
    ax1.axvline(
        np.median(data[x]), color="black", linestyle="-"
    )  # Add median to the histogram
    plt.title(f"{x.capitalize()} Density Distribution")

    plt.subplot(222, frameon=True)
    ax2 = sns.violinplot(x=data[x], palette="Accent", split=True)
    plt.title(f"{x.capitalize()} Violinplot")

    plt.subplot(223, frameon=True, sharex=ax1)
    ax3 = sns.boxplot(
        x=data[x], palette="cool", width=0.7, linewidth=0.6, showmeans=True
    )
    plt.title(f"{x.capitalize()} Boxplot")

    plt.subplot(224, frameon=True, sharex=ax2)
    ax4 = sns.kdeplot(data[x], cumulative=True)
    plt.title(f"{x.capitalize()} Cumulative Density Distribution")

    plt.show()

In [26]:
summary(data, "customer_age")

5 Point Summary of Customer_age Attribute:

+-------+-------+------+------+------+-------+
|       |   Min |   Q1 |   Q2 |   Q3 |   Max |
|-------+-------+------+------+------+-------|
| Value |    26 |   41 |   46 |   52 |    73 |
+-------+-------+------+------+------+-------+


The data is normally distributed, with only 2 outliers on the right side (higher end)

In [27]:
summary(data, "dependent_count")

5 Point Summary of Dependent_count Attribute:

+-------+-------+------+------+------+-------+
|       |   Min |   Q1 |   Q2 |   Q3 |   Max |
|-------+-------+------+------+------+-------|
| Value |     0 |    1 |    2 |    3 |     5 |
+-------+-------+------+------+------+-------+


Dependent Count is mostly 2 or 3

In [28]:
summary(data, "months_on_book")

5 Point Summary of Months_on_book Attribute:

+-------+-------+------+------+------+-------+
|       |   Min |   Q1 |   Q2 |   Q3 |   Max |
|-------+-------+------+------+------+-------|
| Value |    13 |   31 |   36 |   40 |    56 |
+-------+-------+------+------+------+-------+


Most customers are on the books for 3 years
There are outliers on both lower and higher end

In [29]:
summary(data, "total_relationship_count")

5 Point Summary of Total_relationship_count Attribute:

+-------+-------+------+------+------+-------+
|       |   Min |   Q1 |   Q2 |   Q3 |   Max |
|-------+-------+------+------+------+-------|
| Value |     1 |    3 |    4 |    5 |     6 |
+-------+-------+------+------+------+-------+


Most of the customers have 4 or more relations with the bank

In [30]:
summary(data, "months_inactive_12_mon")

5 Point Summary of Months_inactive_12_mon Attribute:

+-------+-------+------+------+------+-------+
|       |   Min |   Q1 |   Q2 |   Q3 |   Max |
|-------+-------+------+------+------+-------|
| Value |     0 |    2 |    2 |    3 |     6 |
+-------+-------+------+------+------+-------+


There are lower and higher end outliers for Months inactive in last 12 months
Lower end outliers are not concerning since 0 value means the customer is always active. The customers who are inactive for 5 or more months are to be concerned about.

In [31]:
summary(data, "contacts_count_12_mon")

5 Point Summary of Contacts_count_12_mon Attribute:

+-------+-------+------+------+------+-------+
|       |   Min |   Q1 |   Q2 |   Q3 |   Max |
|-------+-------+------+------+------+-------|
| Value |     0 |    2 |    2 |    3 |     6 |
+-------+-------+------+------+------+-------+


Again lower and higher end outliers are noticed.
Here less number of contacts between the bank and the customer should be interesting to be checked

In [32]:
summary(data, "credit_limit")

5 Point Summary of Credit_limit Attribute:

+-------+--------+------+------+---------+-------+
|       |    Min |   Q1 |   Q2 |      Q3 |   Max |
|-------+--------+------+------+---------+-------|
| Value | 1438.3 | 2555 | 4549 | 11067.5 | 34516 |
+-------+--------+------+------+---------+-------+


There are higher end outliers in Credit Limit. This might be because the customers are high end.

In [33]:
data[data["credit_limit"] > 23000]["income_category"].value_counts(normalize=True)

$80K - $120K     0.421
$120K +          0.302
$60K - $80K      0.156
Unknown          0.110
$40K - $60K      0.012
Less than $40K   0.000
Name: income_category, dtype: float64

In [34]:
data[data["credit_limit"] > 23000]["card_category"].value_counts(normalize=True)

Blue       0.592
Silver     0.310
Gold       0.083
Platinum   0.015
Name: card_category, dtype: float64

The customers with credit limit more than 23K have ~87% people earning $60K or more, and 90% have Blue or Silver card

In [35]:
summary(data, "total_revolving_bal")

5 Point Summary of Total_revolving_bal Attribute:

+-------+-------+------+------+------+-------+
|       |   Min |   Q1 |   Q2 |   Q3 |   Max |
|-------+-------+------+------+------+-------|
| Value |     0 |  359 | 1276 | 1784 |  2517 |
+-------+-------+------+------+------+-------+


Total revolving balance of 0 would mean the customer never uses the credit card

In [36]:
summary(data, "avg_open_to_buy")

5 Point Summary of Avg_open_to_buy Attribute:

+-------+-------+--------+------+------+-------+
|       |   Min |     Q1 |   Q2 |   Q3 |   Max |
|-------+-------+--------+------+------+-------|
| Value |     3 | 1324.5 | 3474 | 9859 | 34516 |
+-------+-------+--------+------+------+-------+


Average Open to Buy has lots of higher end outliers, which means there are customers who uses only very small amount of their credit limit
Data is right skewed

In [37]:
summary(data, "total_amt_chng_q4_q1")

5 Point Summary of Total_amt_chng_q4_q1 Attribute:

+-------+-------+-------+-------+-------+-------+
|       |   Min |    Q1 |    Q2 |    Q3 |   Max |
|-------+-------+-------+-------+-------+-------|
| Value |     0 | 0.631 | 0.736 | 0.859 | 3.397 |
+-------+-------+-------+-------+-------+-------+


Outliers are on both higher and lower end

In [38]:
summary(data, "total_trans_amt")

5 Point Summary of Total_trans_amt Attribute:

+-------+-------+--------+------+------+-------+
|       |   Min |     Q1 |   Q2 |   Q3 |   Max |
|-------+-------+--------+------+------+-------|
| Value |   510 | 2155.5 | 3899 | 4741 | 18484 |
+-------+-------+--------+------+------+-------+


Total Transaction Amount has lots of higher end outliers

In [39]:
summary(data, "total_trans_ct")

5 Point Summary of Total_trans_ct Attribute:

+-------+-------+------+------+------+-------+
|       |   Min |   Q1 |   Q2 |   Q3 |   Max |
|-------+-------+------+------+------+-------|
| Value |    10 |   45 |   67 |   81 |   139 |
+-------+-------+------+------+------+-------+


In [40]:
summary(data, "total_ct_chng_q4_q1")

5 Point Summary of Total_ct_chng_q4_q1 Attribute:

+-------+-------+-------+-------+-------+-------+
|       |   Min |    Q1 |    Q2 |    Q3 |   Max |
|-------+-------+-------+-------+-------+-------|
| Value |     0 | 0.582 | 0.702 | 0.818 | 3.714 |
+-------+-------+-------+-------+-------+-------+


Outliers are on both higher and lower end

In [41]:
summary(data, "avg_utilization_ratio")

5 Point Summary of Avg_utilization_ratio Attribute:

+-------+-------+-------+-------+-------+-------+
|       |   Min |    Q1 |    Q2 |    Q3 |   Max |
|-------+-------+-------+-------+-------+-------|
| Value |     0 | 0.023 | 0.176 | 0.503 | 0.999 |
+-------+-------+-------+-------+-------+-------+


Average utilization is right skewed

Percentage on bar chart for Categorical Features
For the categorical variables, it is best to analyze them at percentage of total on bar charts Below function takes a category column as input and plots bar chart with percentages on top of each bar

In [42]:
# Below code plots grouped bar for each categorical feature

def perc_on_bar(data: pd.DataFrame, cat_columns, target, hue=None, perc=True):
    '''
    The function takes a category column as input and plots bar chart with percentages on top of each bar
    
    Usage:
    ------
    
    perc_on_bar(df, ['age'], 'prodtaken')
    '''
    
    subplot_cols = 2
    subplot_rows = int(len(cat_columns)/2 + 1)
    plt.figure(figsize=(16,3*subplot_rows))
    for i, col in enumerate(cat_columns):
        plt.subplot(subplot_rows,subplot_cols,i+1)
        order = data[col].value_counts(ascending=False).index  # Data order  
        ax=sns.countplot(data=data, x=col, palette = 'crest', order=order, hue=hue);
        for p in ax.patches:
            percentage = '{:.1f}%\n({})'.format(100 * p.get_height()/len(data[target]), p.get_height())
            # Added percentage and actual value
            x = p.get_x() + p.get_width() / 2
            y = p.get_y() + p.get_height() + 40
            if perc:
                plt.annotate(percentage, (x, y), ha='center', color='black', fontsize='medium'); # Annotation on top of bars
            plt.xticks(color='black', fontsize='medium', rotation= (-90 if col=='region' else 0));
            plt.tight_layout()
            plt.title(col.capitalize() + ' Percentage Bar Charts\n\n')

In [43]:
category_columns = data.select_dtypes(include="category").columns.tolist()
target_variable = "attrition_flag"
perc_on_bar(data, category_columns, target_variable)

High Imbalance in data since the existing vs. attrited customers ratio is 84:16
Data is almost equally distributed between Males and Females
31% customers are Graduate
~85% customers are either Single or Married, where 46.7% of the customers are Married
35% customers earn less than $40k and 36% earns $60k or more
~93% customers have Blue card

Bi-variate Analysis
Goal of Bi-variate analysis is to find inter-dependencies between features.

In [44]:
#Target vs. All numerical columns
# Below code plots box charts for each numerical feature by each type of Personal Loan (0: Not Borrowed, 1: Borroed)
def box_by_target(data: pd.DataFrame, numeric_columns, target, include_outliers):
    """
    The function takes a category column, target column, and whether to include outliers or not as input
    and plots bar chart with percentages on top of each bar

    Usage:
    ------

    perc_on_bar(['age'], 'prodtaken', True)
    """
    subplot_cols = 2
    subplot_rows = int(len(numeric_columns) / 2 + 1)
    plt.figure(figsize=(16, 3 * subplot_rows))
    for i, col in enumerate(numeric_columns):
        plt.subplot(8, 2, i + 1)
        sns.boxplot(
            data=data,
            x=target,
            y=col,
            orient="vertical",
            palette="Blues",
            showfliers=include_outliers,
        )
        plt.tight_layout()
        plt.title(str(i + 1) + ": " + target + " vs. " + col, color="black")

In [45]:
#With outliers
numeric_columns = data.select_dtypes(exclude="category").columns.tolist()
target_variable = "attrition_flag"
box_by_target(data, numeric_columns, target_variable, True)

In [46]:
#Without outliers
box_by_target(data, numeric_columns, target_variable, False)

Attrited customers have  

Lower total transaction amount
Lower total transaction count
Lower utilization ratio
Lower transaction count change Q4 to Q1
Higher number of times contacted with or by the bank

In [47]:
#Target vs. All Categorical Columns
# Create a function that returns a Pie chart and a Bar Graph for the categorical variables:
def cat_view(df: pd.DataFrame, x, target):
    """
    Function to create a Bar chart and a Pie chart for categorical variables.
    """
    from matplotlib import cm

    color1 = cm.inferno(np.linspace(0.4, 0.8, 30))
    color2 = cm.viridis(np.linspace(0.4, 0.8, 30))
    sns.set_palette("cubehelix")
    fig, ax = plt.subplots(1, 2, figsize=(16, 4))

    """
    Draw a Pie Chart on first subplot.
    """
    s = data.groupby(x).size()

    mydata_values = s.values.tolist()
    mydata_index = s.index.tolist()

    def func(pct, allvals):
        absolute = int(pct / 100.0 * np.sum(allvals))
        return "{:.1f}%\n({:d})".format(pct, absolute)

    wedges, texts, autotexts = ax[0].pie(
        mydata_values,
        autopct=lambda pct: func(pct, mydata_values),
        textprops=dict(color="w"),
    )

    ax[0].legend(
        wedges,
        mydata_index,
        title=x.capitalize(),
        loc="center left",
        bbox_to_anchor=(1, 0, 0.5, 1),
    )

    plt.setp(autotexts, size=12)

    ax[0].set_title(f"{x.capitalize()} Pie Chart")

    """
    Draw a Bar Graph on second subplot.
    """

    df = pd.pivot_table(
        data, index=[x], columns=[target], values=["credit_limit"], aggfunc=len
    )

    labels = df.index.tolist()
    no = df.values[:, 1].tolist()
    yes = df.values[:, 0].tolist()

    l = np.arange(len(labels))  # the label locations
    width = 0.35  # the width of the bars

    rects1 = ax[1].bar(
        l - width / 2, no, width, label="Existing Customer", color=color1
    )
    rects2 = ax[1].bar(
        l + width / 2, yes, width, label="Attrited Customer", color=color2
    )

    # Add some text for labels, title and custom x-axis tick labels, etc.
    ax[1].set_ylabel("Scores")
    ax[1].set_title(f"{x.capitalize()} Bar Graph")
    ax[1].set_xticks(l)
    ax[1].set_xticklabels(labels)
    ax[1].legend()

    def autolabel(rects):

        """Attach a text label above each bar in *rects*, displaying its height."""

        for rect in rects:
            height = rect.get_height()
            ax[1].annotate(
                "{}".format(height),
                xy=(rect.get_x() + rect.get_width() / 2, height),
                xytext=(0, 3),  # 3 points vertical offset
                textcoords="offset points",
                fontsize="medium",
                ha="center",
                va="bottom",
            )

    autolabel(rects1)
    autolabel(rects2)

    fig.tight_layout()
    plt.show()

    """
    Draw a Stacked Bar Graph on bottom.
    """

    sns.set(palette="tab10")
    tab = pd.crosstab(data[x], data[target], normalize="index")

    tab.plot.bar(stacked=True, figsize=(16, 3))
    plt.title(x.capitalize() + " Stacked Bar Plot")
    plt.legend(loc="upper right", bbox_to_anchor=(0, 1))
    plt.show()
cat_view(data, "gender", "attrition_flag")

Attrition does not seem to be related with Gender

In [48]:
cat_view(data, "education_level", "attrition_flag")

Attrition does not seem to be related with Education

In [49]:
cat_view(data, "marital_status", "attrition_flag")

Attrition does not seem to be related with Marital Status

In [50]:
cat_view(data, "income_category", "attrition_flag")

Attrition does not seem to be related with Income Category

In [51]:
cat_view(data, "card_category", "attrition_flag")

Platinum card holder are appearing to be having attrition tendency, however, since there are only 20 data points for platinum card holders, this observation would be biased

Multi-variate Plots
Pairplot of all available numeric columns, hued by Personal Loan

In [52]:
#Below plot shows correlations between the numerical features in the dataset

plt.figure(figsize=(20, 20))
sns.set(palette="nipy_spectral")
sns.pairplot(data=data, hue="attrition_flag", corner=True)

<seaborn.axisgrid.PairGrid at 0x7fa727968cd0>

There are clusters formed with respect to attrition for the variables total revolving amount, total amount change Q4 to Q1, total transaction amount, total transaction count, total transaction count change Q4 to Q1
There are strong correlation between a few columns as well, which we'll check in below correlation heatmap.

#Heatmap to understand correlations between independent and dependent variables
#Plotting correlation heatmap of the features


In [53]:
codes = {'Existing Customer':0, 'Attrited Customer':1}
data_clean = data.copy()
data_clean['attrition_flag'] = data_clean['attrition_flag'].map(codes).astype(int)


sns.set(rc={"figure.figsize": (15, 15)})
sns.heatmap(
    data_clean.corr(),
    annot=True,
    linewidths=0.5,
    center=0,
    cbar=False,
    cmap="YlGnBu",
    fmt="0.2f",
)
plt.show()

-Credit Limit and Average Open to Buy have 100% collinearity
-Months on book and Customer Age have quite strong correlation
-Average Utilization Ration and Total Revolving Balance are also a bit correlated it appears
-Attrition Flag does not have highly strong correlation with any of the numeric variables
-Customer Churn appears to be uncorrelated with Customer Age, Dependent Count, Months on Book, Open to Buy, Credit Limit, we'll remove these from dataset

Data Preprocessing
Pre-processing steps:

Data Split into Dependent and Target sets
Data Split to Train, Test and Validation sets
Standardize feature names
Drop unnecessary columns (Client Number, Customer Age, Dependent Count, Months on Book, Open to Buy, Credit Limit)
Missing Value/Incorrect Value treatment
Encoding
Scaling/Outlier treatment

In [54]:
# Building a function to standardize columns

def feature_name_standardize(df: pd.DataFrame):
    df_ = df.copy()
    df_.columns = [i.replace(" ", "_").lower() for i in df_.columns]
    return df_

# Building a function to drop features

def drop_feature(df: pd.DataFrame, features: list = []):
    df_ = df.copy()
    if len(features) != 0:
        df_ = df_.drop(columns=features)
        
    return df_

# Building a function to treat incorrect value

def mask_value(df: pd.DataFrame, feature: str = None, value_to_mask: str = None, masked_value: str = None):
    df_ = df.copy()
    if feature != None and value_to_mask != None:
        if feature in df_.columns:
            df_[feature] = df_[feature].astype('object')
            df_.loc[df_[df_[feature] == value_to_mask].index, feature] = masked_value
            df_[feature] = df_[feature].astype('category')
            
    return df_

# Building a custom imputer

def impute_category_unknown(df: pd.DataFrame, fill_value: str):
    df_ = df.copy()
    for col in df_.select_dtypes(include='category').columns.tolist():
        df_[col] = df_[col].astype('object')
        df_[col] = df_[col].fillna('Unknown')
        df_[col] = df_[col].astype('category')
    return df_

# Building a custom data preprocessing class with fit and transform methods for standardizing column names

class FeatureNamesStandardizer(TransformerMixin):
    def __init__(self):
        pass

    def fit(self, X, y=None):
        """All SciKit-Learn compatible transformers and classifiers have the
        same interface. `fit` always returns the same object."""
        return self

    def transform(self, X):
        """Returns dataframe with column names in lower case with underscores in place of spaces."""
        X_ = feature_name_standardize(X)
        return X_
    
    
# Building a custom data preprocessing class with fit and transform methods for dropping columns

class ColumnDropper(TransformerMixin):
    def __init__(self, features: list):
        self.features = features

    def fit(self, X, y=None):
        """All SciKit-Learn compatible transformers and classifiers have the
        same interface. `fit` always returns the same object."""
        return self

    def transform(self, X):
        """Given a list of columns, returns a dataframe without those columns."""
        X_ = drop_feature(X, features=self.features)
        return X_
        
    

# Building a custom data preprocessing class with fit and transform methods for custom value masking

class CustomValueMasker(TransformerMixin):
    def __init__(self, feature: str, value_to_mask: str, masked_value: str):
        self.feature = feature
        self.value_to_mask = value_to_mask
        self.masked_value = masked_value

    def fit(self, X, y=None):
        """All SciKit-Learn compatible transformers and classifiers have the
        same interface. `fit` always returns the same object."""
        return self

    def transform(self, X):
        """Return a dataframe with the required feature value masked as required."""
        X_ = mask_value(X, self.feature, self.value_to_mask, self.masked_value)
        return X_
    
    
# Building a custom class to one-hot encode using pandas
class PandasOneHot(TransformerMixin):
    def __init__(self, columns: list = None):
        self.columns = columns

    def fit(self, X, y=None):
        """All SciKit-Learn compatible transformers and classifiers have the
        same interface. `fit` always returns the same object."""
        return self

    def transform(self, X):
        """Return a dataframe with the required feature value masked as required."""
        X_ = pd.get_dummies(X, columns = self.columns, drop_first=True)
        return X_
    
# Building a custom class to fill nulls with Unknown
class FillUnknown(TransformerMixin):
    def __init__(self):
        pass

    def fit(self, X, y=None):
        """All SciKit-Learn compatible transformers and classifiers have the
        same interface. `fit` always returns the same object."""
        return self

    def transform(self, X):
        """Return a dataframe with the required feature value masked as required."""
        X_ = impute_category_unknown(X, fill_value='Unknown')
        return X_

Firstly we'll work on building models individually after data pre-processing, and later we'll build an ML pipeline to run end to end process of pre-processing and model building. We are creating a data copy for the first part.

In [55]:
#Creating data copy
df = churner.copy()
df.describe(include="all").T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
CLIENTNUM,10127.0,,,,739177606.334,36903783.45,708082083.0,713036770.5,717926358.0,773143533.0,828343083.0
Attrition_Flag,10127.0,2.0,Existing Customer,8500.0,,,,,,,
Customer_Age,10127.0,,,,46.326,8.017,26.0,41.0,46.0,52.0,73.0
Gender,10127.0,2.0,F,5358.0,,,,,,,
Dependent_count,10127.0,,,,2.346,1.299,0.0,1.0,2.0,3.0,5.0
Education_Level,10127.0,7.0,Graduate,3128.0,,,,,,,
Marital_Status,10127.0,4.0,Married,4687.0,,,,,,,
Income_Category,10127.0,6.0,Less than $40K,3561.0,,,,,,,
Card_Category,10127.0,4.0,Blue,9436.0,,,,,,,
Months_on_book,10127.0,,,,35.928,7.986,13.0,31.0,36.0,40.0,56.0


Defining the static variables

In [56]:
# The static variables

# For dropping columns
columns_to_drop = [
    "clientnum",
    "credit_limit",
    "dependent_count",
    "months_on_book",
    "avg_open_to_buy",
    "customer_age",
]

# For masking a particular value in a feature
column_to_mask_value = "income_category"
value_to_mask = "abc"
masked_value = "Unknown"

# Random state and loss
seed = 1
loss_func = "logloss"

# Test and Validation sizes
test_size = 0.2
val_size = 0.25

# Dependent Varibale Value map
target_mapper = {"Attrited Customer": 1, "Existing Customer": 0}

Data Type Conversions
Here we are converting Object data type to Category

In [57]:
cat_columns = df.select_dtypes(include="object").columns.tolist()
df[cat_columns] = df[cat_columns].astype("category")

Dependent and independent variables
Splitting the dataset into dependent and independent variable sets



In [58]:
X = df.drop(columns=["Attrition_Flag"])
y = df["Attrition_Flag"].map(target_mapper)

In [59]:
# Splitting data into training, validation and test set:
# first we split data into 2 parts, say temporary and test

X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=test_size, random_state=seed, stratify=y
)

# then we split the temporary set into train and validation

X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=val_size, random_state=seed, stratify=y_temp
)
print(
    "Training data shape: \n\n",
    X_train.shape,
    "\n\nValidation Data Shape: \n\n",
    X_val.shape,
    "\n\nTesting Data Shape: \n\n",
    X_test.shape,
)

Training data shape: 

 (6075, 20) 

Validation Data Shape: 

 (2026, 20) 

Testing Data Shape: 

 (2026, 20)


In [60]:
#Checking the ratio of labels in the target column for each of the data segments
print("Training: \n", y_train.value_counts(normalize=True))
print("\n\nValidation: \n", y_val.value_counts(normalize=True))
print("\n\nTest: \n", y_test.value_counts(normalize=True))

Training: 
 0   0.839
1   0.161
Name: Attrition_Flag, dtype: float64


Validation: 
 0   0.839
1   0.161
Name: Attrition_Flag, dtype: float64


Test: 
 0   0.840
1   0.160
Name: Attrition_Flag, dtype: float64


Data processing
Data pre-processing is one of the the most important parts of the job before starting to train the model with the dataset. We need to impute missing values, fix any illogical data value in columns, convert category columns to numeric (either ordinal, or binary using one-hot encoding), scale the data to deal with the distribution skewness and outliers, before feeding the data to a model.  

We are using the pre-available transformation classes and the custom classes that we created to first fit the training data and then transform the train, validation and test dataset. This is the standard logical practice to keep the influence of test and validation data in the train dataset to prevent/avoid data leakage while training or validating the model.

In [61]:
# To Standardize feature names
feature_name_standardizer = FeatureNamesStandardizer()

X_train = feature_name_standardizer.fit_transform(X_train)
X_val = feature_name_standardizer.transform(X_val)
X_test = feature_name_standardizer.transform(X_test)

# To Drop unnecessary columns
column_dropper = ColumnDropper(features=columns_to_drop)

X_train = column_dropper.fit_transform(X_train)
X_val = column_dropper.transform(X_val)
X_test = column_dropper.transform(X_test)

# To Mask incorrect/meaningless value of a feature
value_masker = CustomValueMasker(
    feature=column_to_mask_value, value_to_mask=value_to_mask, masked_value=masked_value
)

X_train = value_masker.fit_transform(X_train)
X_val = value_masker.transform(X_val)
X_test = value_masker.transform(X_test)

# To impute categorical Nulls to Unknown
cat_columns = X_train.select_dtypes(include="category").columns.tolist()
imputer = FillUnknown()

X_train[cat_columns] = imputer.fit_transform(X_train[cat_columns])
X_val[cat_columns] = imputer.transform(X_val[cat_columns])
X_test[cat_columns] = imputer.transform(X_test[cat_columns])

# To encode the data
one_hot = PandasOneHot()

X_train = one_hot.fit_transform(X_train)
X_val = one_hot.transform(X_val)
X_test = one_hot.transform(X_test)


# Scale the numerical columns
robust_scaler = RobustScaler(with_centering=False, with_scaling=True)
num_columns = [
    "total_relationship_count",
    "months_inactive_12_mon",
    "contacts_count_12_mon",
    "total_revolving_bal",
    "total_amt_chng_q4_q1",
    "total_trans_amt",
    "total_trans_ct",
    "total_ct_chng_q4_q1",
    "avg_utilization_ratio",
]

X_train[num_columns] = pd.DataFrame(
    robust_scaler.fit_transform(X_train[num_columns]),
    columns=num_columns,
    index=X_train.index,
)
X_val[num_columns] = pd.DataFrame(
    robust_scaler.transform(X_val[num_columns]), columns=num_columns, index=X_val.index
)
X_test[num_columns] = pd.DataFrame(
    robust_scaler.transform(X_test[num_columns]),
    columns=num_columns,
    index=X_test.index,
)

In [62]:
X_train.head(3)

Unnamed: 0,total_relationship_count,months_inactive_12_mon,contacts_count_12_mon,total_revolving_bal,total_amt_chng_q4_q1,total_trans_amt,total_trans_ct,total_ct_chng_q4_q1,avg_utilization_ratio,gender_M,education_level_Doctorate,education_level_Graduate,education_level_High School,education_level_Post-Graduate,education_level_Uneducated,education_level_Unknown,marital_status_Married,marital_status_Single,marital_status_Unknown,income_category_$40K - $60K,income_category_$60K - $80K,income_category_$80K - $120K,income_category_Less than $40K,income_category_Unknown,card_category_Gold,card_category_Platinum,card_category_Silver
800,3.0,4.0,3.0,1.226,2.044,0.648,1.278,2.249,0.168,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0
498,3.0,2.0,0.0,1.45,1.697,0.524,0.861,2.667,1.376,1,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0
4356,2.5,1.0,2.0,1.926,3.829,1.661,2.194,3.717,0.775,1,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0


In [63]:
X_val.head(3)

Unnamed: 0,total_relationship_count,months_inactive_12_mon,contacts_count_12_mon,total_revolving_bal,total_amt_chng_q4_q1,total_trans_amt,total_trans_ct,total_ct_chng_q4_q1,avg_utilization_ratio,gender_M,education_level_Doctorate,education_level_Graduate,education_level_High School,education_level_Post-Graduate,education_level_Uneducated,education_level_Unknown,marital_status_Married,marital_status_Single,marital_status_Unknown,income_category_$40K - $60K,income_category_$60K - $80K,income_category_$80K - $120K,income_category_Less than $40K,income_category_Unknown,card_category_Gold,card_category_Platinum,card_category_Silver
2894,2.5,2.0,3.0,0.0,5.083,1.148,1.528,4.068,0.0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,0
9158,0.5,3.0,1.0,0.0,3.982,3.148,1.639,3.81,0.0,1,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0
9618,1.5,4.0,3.0,1.584,3.86,5.291,2.833,2.3,0.126,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,1,0


In [64]:
X_test.head(3)

Unnamed: 0,total_relationship_count,months_inactive_12_mon,contacts_count_12_mon,total_revolving_bal,total_amt_chng_q4_q1,total_trans_amt,total_trans_ct,total_ct_chng_q4_q1,avg_utilization_ratio,gender_M,education_level_Doctorate,education_level_Graduate,education_level_High School,education_level_Post-Graduate,education_level_Uneducated,education_level_Unknown,marital_status_Married,marital_status_Single,marital_status_Unknown,income_category_$40K - $60K,income_category_$60K - $80K,income_category_$80K - $120K,income_category_Less than $40K,income_category_Unknown,card_category_Gold,card_category_Platinum,card_category_Silver
9760,1.0,3.0,2.0,0.865,3.316,5.556,2.583,2.544,0.369,1,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0
7413,2.0,3.0,2.0,0.0,3.219,0.85,1.139,2.19,0.0,1,0,0,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0
6074,1.5,3.0,3.0,0.0,3.237,1.658,2.056,3.215,0.0,0,0,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0


In [65]:
print(
    "Training data shape: \n\n",
    X_train.shape,
    "\n\nValidation Data Shape: \n\n",
    X_val.shape,
    "\n\nTesting Data Shape: \n\n",
    X_test.shape,
)

Training data shape: 

 (6075, 27) 

Validation Data Shape: 

 (2026, 27) 

Testing Data Shape: 

 (2026, 27)


We are now all set to build, train and validate the model

Model Building Considerations
Model evaluation criterion:
Model can make wrong predictions as:
Predicting a customer will attrite and the customer does not attrite - Loss of resources
Predicting a customer will not attrite and the customer attrites - Loss of opportunity for churning the customer
Which case is more important?
Predicting that customer will not attrite, but actually attrites, would result in loss for the bank since if predicted correctly, marketing/sales team could have contacted the customer to retain them. This would result in losses. So, the false negatives should be minimized.
How to reduce this loss i.e need to reduce False Negatives?
Company wants Recall to be maximized, greater the Recall lesser the chances of false negatives.
Let's start by building different models using KFold and cross_val_score and tune the best model using RandomizedSearchCV

Stratified K-Folds cross-validation provides dataset indices to split data into train/validation sets. Split dataset into k consecutive folds (without shuffling by default) keeping the distribution of both classes in each fold the same as the target variable. Each fold is then used once as validation while the k - 1 remaining folds form the training set.
Model Evaluation Functions - Scoring & Confusion Matrix
We are creating a few functions to score the models, show the confusion matrix

In [66]:
#Function to Get Scores
def get_metrics_score(
    model, train, test, train_y, test_y, threshold=0.5, flag=False, roc=True
):
    """
    Function to calculate different metric scores of the model - Accuracy, Recall, Precision, and F1 score
    model: classifier to predict values of X
    train, test: Independent features
    train_y,test_y: Dependent variable
    threshold: thresold for classifiying the observation as 1
    flag: If the flag is set to True then only the print statements showing different will be displayed. The default value is set to True.
    roc: If the roc is set to True then only roc score will be displayed. The default value is set to False.
    """
    # defining an empty list to store train and test results

    score_list = []

    pred_train = model.predict_proba(train)[:, 1] > threshold
    pred_test = model.predict_proba(test)[:, 1] > threshold

    pred_train = np.round(pred_train)
    pred_test = np.round(pred_test)

    train_acc = accuracy_score(pred_train, train_y)
    test_acc = accuracy_score(pred_test, test_y)

    train_recall = recall_score(train_y, pred_train)
    test_recall = recall_score(test_y, pred_test)

    train_precision = precision_score(train_y, pred_train)
    test_precision = precision_score(test_y, pred_test)

    train_f1 = f1_score(train_y, pred_train)
    test_f1 = f1_score(test_y, pred_test)

    pred_train_proba = model.predict_proba(train)[:, 1]
    pred_test_proba = model.predict_proba(test)[:, 1]

    train_roc_auc = roc_auc_score(train_y, pred_train_proba)
    test_roc_auc = roc_auc_score(test_y, pred_test_proba)

    score_list.extend(
        (
            train_acc,
            test_acc,
            train_recall,
            test_recall,
            train_precision,
            test_precision,
            train_f1,
            test_f1,
            train_roc_auc,
            test_roc_auc,
        )
    )

    if flag == True:

        print("Accuracy on training set : ", accuracy_score(pred_train, train_y))
        print("Accuracy on test set : ", accuracy_score(pred_test, test_y))
        print("Recall on training set : ", recall_score(train_y, pred_train))
        print("Recall on test set : ", recall_score(test_y, pred_test))
        print("Precision on training set : ", precision_score(train_y, pred_train))
        print("Precision on test set : ", precision_score(test_y, pred_test))
        print("F1 on training set : ", f1_score(train_y, pred_train))
        print("F1 on test set : ", f1_score(test_y, pred_test))

    if roc == True:
        if flag == True:
            print(
                "ROC-AUC Score on training set : ",
                roc_auc_score(train_y, pred_train_proba),
            )
            print(
                "ROC-AUC Score on test set : ", roc_auc_score(test_y, pred_test_proba)
            )

    return score_list  # returning the list with train and test scores

In [67]:
def make_confusion_matrix(model, test_X, y_actual, labels=[1, 0]):
    """
    model : classifier to predict values of X
    test_X: test set
    y_actual : ground truth

    """
    y_predict = model.predict(test_X)
    cm = metrics.confusion_matrix(y_actual, y_predict, labels=[1, 0])
    df_cm = pd.DataFrame(
        cm,
        index=[i for i in ["Actual - Attrited", "Actual - Existing"]],
        columns=[i for i in ["Predicted - Attrited", "Predicted - Existing"]],
    )
    group_counts = ["{0:0.0f}".format(value) for value in cm.flatten()]
    group_percentages = ["{0:.2%}".format(value) for value in cm.flatten() / np.sum(cm)]
    labels = [f"{v1}\n{v2}" for v1, v2 in zip(group_counts, group_percentages)]
    labels = np.asarray(labels).reshape(2, 2)
    plt.figure(figsize=(5, 3))
    sns.heatmap(df_cm, annot=labels, fmt="", cmap="Blues").set(title="Confusion Matrix")

In [68]:
# defining empty lists to add train and test results

model_names = []
acc_train = []
acc_test = []
recall_train = []
recall_test = []
precision_train = []
precision_test = []
f1_train = []
f1_test = []
roc_auc_train = []
roc_auc_test = []
cross_val_train = []


def add_score_model(model_name, score, cv_res):
    """Add scores to list so that we can compare all models score together"""
    model_names.append(model_name)
    acc_train.append(score[0])
    acc_test.append(score[1])
    recall_train.append(score[2])
    recall_test.append(score[3])
    precision_train.append(score[4])
    precision_test.append(score[5])
    f1_train.append(score[6])
    f1_test.append(score[7])
    roc_auc_train.append(score[8])
    roc_auc_test.append(score[9])
    cross_val_train.append(cv_res)

Building Models
We are making 9 models , Logistic Regression, Bagging, Random Forest, Gradient Boosting, Ada Boosting, Extreme Gradient Boosting, Decision Tree,KNN, Light Gradient Boosting.

Build and Train Models
We are building below 8 models:
 

Bagging
Random Forest Classification
Gradient Boosting Machine
Adaptive Boosting
eXtreme Gradient Boosting
Decision Tree Classification (Classification and Regression Trees - CART)
Light Gradient Boosting Machine
Logistic Regression
KNN
 

Light GBM is a fast, distributed, high-performance gradient boosting framework based on decision tree algorithm, used for ranking, classification and many other machine learning tasks.
 

Since it is based on decision tree algorithms, it splits the tree leaf wise with the best fit whereas other boosting algorithms split the tree depth wise or level wise rather than leaf-wise. So when growing on the same leaf in Light GBM, the leaf-wise algorithm can reduce more loss than the level-wise algorithm and hence results in much better accuracy which can rarely be achieved by any of the existing boosting algorithms. Before is a diagrammatic representation by the makers of the Light GBM to explain the difference clearly.

In [69]:
models = []  # Empty list to store all the models
cv_results = []

# Appending models into the list
models.append(("Bagging", BaggingClassifier(random_state=seed)))
models.append(("Random forest", RandomForestClassifier(random_state=seed)))
models.append(("GBM", GradientBoostingClassifier(random_state=seed)))
models.append(("Adaboost", AdaBoostClassifier(random_state=seed)))
models.append(("Xgboost", XGBClassifier(random_state=seed, eval_metric=loss_func)))
models.append(("dtree", DecisionTreeClassifier(random_state=seed)))
models.append(("Light GBM", lgb.LGBMClassifier(random_state=seed)))
models.append(("Logistic Regression", LogisticRegression(random_state=seed)))
models.append(("KNN",KNeighborsClassifier(n_neighbors=7)))
# For each model, run cross validation on 9 folds (+ 1 validation fold) with scoring for recall
for name, model in models:
    scoring = "recall"
    kfold = StratifiedKFold(
        n_splits=10, shuffle=True, random_state=1
    )  # Setting number of splits equal to 10

    cv_result = cross_val_score(
        estimator=model, X=X_train, y=y_train, scoring=scoring, cv=kfold
    )
    cv_results.append(cv_result)

    model.fit(X_train, y_train)
    model_score = get_metrics_score(model, X_train, X_test, y_train, y_test)
    add_score_model(name, model_score, cv_result.mean())

print("Operation Completed!")

Operation Completed!


In [70]:
comparison_frame = pd.DataFrame(
    {
        "Model": model_names,
        "Cross_Val_Score_Train": cross_val_train,
        "Train_Accuracy": acc_train,
        "Test_Accuracy": acc_test,
        "Train_Recall": recall_train,
        "Test_Recall": recall_test,
        "Train_Precision": precision_train,
        "Test_Precision": precision_test,
        "Train_F1": f1_train,
        "Test_F1": f1_test,
        "Train_ROC_AUC": roc_auc_train,
        "Test_ROC_AUC": roc_auc_test,
    }
)

# Sorting models in decreasing order of test recall
comparison_frame.sort_values(
    by=["Cross_Val_Score_Train", "Test_Recall"], ascending=False
).style.highlight_max(color="lightgreen", axis=0).highlight_min(color="pink", axis=0)

Unnamed: 0,Model,Cross_Val_Score_Train,Train_Accuracy,Test_Accuracy,Train_Recall,Test_Recall,Train_Precision,Test_Precision,Train_F1,Test_F1,Train_ROC_AUC,Test_ROC_AUC
6,Light GBM,0.844267,0.999342,0.976308,0.998975,0.929231,0.996933,0.923547,0.997953,0.92638,0.999988,0.993571
2,GBM,0.81762,0.969712,0.96693,0.873975,0.876923,0.93326,0.913462,0.902646,0.894819,0.992689,0.989508
4,Xgboost,0.809426,0.966255,0.968904,0.849385,0.88,0.934611,0.922581,0.889962,0.900787,0.992539,0.990775
3,Adaboost,0.799137,0.956379,0.958045,0.830943,0.864615,0.890231,0.872671,0.859565,0.868624,0.987073,0.985719
0,Bagging,0.785862,0.996049,0.955084,0.980533,0.846154,0.994802,0.870253,0.987616,0.858034,0.999899,0.974716
1,Random forest,0.77044,1.0,0.95459,1.0,0.803077,1.0,0.903114,1.0,0.850163,1.0,0.989199
5,dtree,0.754113,1.0,0.929911,1.0,0.821538,1.0,0.760684,1.0,0.789941,1.0,0.886078
7,Logistic Regression,0.562487,0.901235,0.904245,0.563525,0.6,0.759669,0.752896,0.647059,0.667808,0.920375,0.934017
8,KNN,0.502051,0.922798,0.90227,0.597336,0.501538,0.884674,0.819095,0.71315,0.622137,0.968172,0.925656


The best model with respect to cross validation score and test recall is Light GBM
The next best models are XGBoost, GBM and AdaBoost respectively

Plotting the cross-validation result comparison
We are plotting the cross validation results for the 7 models in a Box plot, to check which models are potentially good.

In [71]:
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(18, 18))

fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)

plt.boxplot(cv_results)
ax.set_xticklabels(model_names)

plt.show()

It appears Light GBM, XGBoost, GBM are the models with good potential. Ada Boost also looks good with the higher end outlier performance score

Oversampling train data using SMOTE
Our dataset has a huge imbalance in target variable labels. To deal with such datasets, we have a few tricks, which we call Imbalanced Classification.

Imbalanced classification involves developing predictive models on classification datasets that have a severe class imbalance.

The challenge of working with imbalanced datasets is that most machine learning techniques will ignore, and in turn have poor performance on the minority class, although typically it is performance on the minority class that is most important, which is the case in our study here.

One approach to addressing imbalanced datasets is to oversample the minority class. The simplest approach involves duplicating examples in the minority class, although these examples don’t add any new information to the model. Instead, new examples can be synthesized from the existing examples. This is a type of data augmentation for the minority class and is referred to as the Synthetic Minority Oversampling Technique, or SMOTE for short.

In [72]:
print("Before UpSampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before UpSampling, counts of label 'No': {} \n".format(sum(y_train == 0)))

sm = SMOTE(
    sampling_strategy="minority", k_neighbors=10, random_state=seed
)  # Synthetic Minority Over Sampling Technique

X_train_over, y_train_over = sm.fit_resample(X_train, y_train)


print("After UpSampling, counts of label 'Yes': {}".format(sum(y_train_over == 1)))
print("After UpSampling, counts of label 'No': {} \n".format(sum(y_train_over == 0)))


print("After UpSampling, the shape of train_X: {}".format(X_train_over.shape))
print("After UpSampling, the shape of train_y: {} \n".format(y_train_over.shape))

Before UpSampling, counts of label 'Yes': 976
Before UpSampling, counts of label 'No': 5099 

After UpSampling, counts of label 'Yes': 5099
After UpSampling, counts of label 'No': 5099 

After UpSampling, the shape of train_X: (10198, 27)
After UpSampling, the shape of train_y: (10198,) 



Build Models with Oversampled Data
Build and Train Models
We are building and training the same 7 models as before. We are however going to use the over-sampled training data for training the models.

Next, using the Principal Component Analysis (PCA), we will reduce the dimensionality of the dataset as "small" as possible without losing any information.

In [73]:
models_over = []

# Appending models into the list

models_over.append(("Bagging UpSampling", BaggingClassifier(random_state=seed)))
models_over.append(
    ("Random forest UpSampling", RandomForestClassifier(random_state=seed))
)
models_over.append(("GBM UpSampling", GradientBoostingClassifier(random_state=seed)))
models_over.append(("Adaboost UpSampling", AdaBoostClassifier(random_state=seed)))
models_over.append(
    ("Xgboost UpSampling", XGBClassifier(random_state=seed, eval_metric=loss_func))
)
models_over.append(("dtree UpSampling", DecisionTreeClassifier(random_state=seed)))
models_over.append(("Light GBM UpSampling", lgb.LGBMClassifier(random_state=seed)))
models_over.append(("Logistic Regression UpSampling", LogisticRegression(random_state=seed)))
models_over.append(("KNN UpSampling",KNeighborsClassifier(n_neighbors=7)))

for name, model in models_over:
    scoring = "recall"
    kfold = StratifiedKFold(
        n_splits=10, shuffle=True, random_state=1
    )  # Setting number of splits equal to 10

    cv_result_over = cross_val_score(
        estimator=model, X=X_train_over, y=y_train_over, scoring=scoring, cv=kfold
    )
    cv_results.append(cv_result_over)

    model.fit(X_train_over, y_train_over)
    model_score_over = get_metrics_score(
        model, X_train_over, X_val, y_train_over, y_val
    )
    add_score_model(name, model_score_over, cv_result_over.mean())

print("Operation Completed!")

Operation Completed!


In [74]:
#Comparing models
comparison_frame = pd.DataFrame(
    {
        "Model": model_names,
        "Cross_Val_Score_Train": cross_val_train,
        "Train_Accuracy": acc_train,
        "Test_Accuracy": acc_test,
        "Train_Recall": recall_train,
        "Test_Recall": recall_test,
        "Train_Precision": precision_train,
        "Test_Precision": precision_test,
        "Train_F1": f1_train,
        "Test_F1": f1_test,
        "Train_ROC_AUC": roc_auc_train,
        "Test_ROC_AUC": roc_auc_test,
    }
)

# Sorting models in decreasing order of test recall
comparison_frame.sort_values(
    by=["Test_Recall", "Cross_Val_Score_Train"], ascending=False
).style.highlight_max(color="lightgreen", axis=0).highlight_min(color="pink", axis=0)

Unnamed: 0,Model,Cross_Val_Score_Train,Train_Accuracy,Test_Accuracy,Train_Recall,Test_Recall,Train_Precision,Test_Precision,Train_F1,Test_F1,Train_ROC_AUC,Test_ROC_AUC
6,Light GBM,0.844267,0.999342,0.976308,0.998975,0.929231,0.996933,0.923547,0.997953,0.92638,0.999988,0.993571
15,Light GBM UpSampling,0.977642,0.998137,0.967917,0.999216,0.920245,0.997065,0.884956,0.998139,0.902256,0.999984,0.992721
11,GBM UpSampling,0.967837,0.971269,0.957552,0.974701,0.91411,0.968056,0.837079,0.971367,0.8739,0.996154,0.987595
12,Adaboost UpSampling,0.956857,0.952442,0.942251,0.959012,0.911043,0.946574,0.771429,0.952752,0.835443,0.991587,0.981916
13,Xgboost UpSampling,0.965092,0.967739,0.951629,0.971367,0.907975,0.964369,0.813187,0.967855,0.857971,0.995584,0.98816
17,KNN UpSampling,0.989604,0.942146,0.857848,0.997647,0.892638,0.89797,0.534926,0.945188,0.668966,0.996504,0.935066
4,Xgboost,0.809426,0.966255,0.968904,0.849385,0.88,0.934611,0.922581,0.889962,0.900787,0.992539,0.990775
2,GBM,0.81762,0.969712,0.96693,0.873975,0.876923,0.93326,0.913462,0.902646,0.894819,0.992689,0.989508
3,Adaboost,0.799137,0.956379,0.958045,0.830943,0.864615,0.890231,0.872671,0.859565,0.868624,0.987073,0.985719
10,Random forest UpSampling,0.979801,1.0,0.951629,1.0,0.861963,1.0,0.841317,1.0,0.851515,1.0,0.985757


The best 4 models with respect to validation recall and cross validation score, are as follows:
Light GBM trained with over/up-sampled data
GBM trained with over/up-sampled data
AdaBoost trained with over/up-sampled data
XGBoost trained with over/up-sampled data

In [75]:
#PCA Reducing Dimensions
#Using elbow-plot variance/dimensions
from sklearn.decomposition import PCA
import matplotlib.ticker as mtick

pca = PCA()
pca.fit(X_train_over)

cumsum = np.cumsum(pca.explained_variance_ratio_)*100
d = [n for n in range(len(cumsum))]

# Visualize
fig, ax = plt.subplots(figsize =(10, 10))
fig.patch.set_facecolor('white')
mpl.rcParams['font.family'] = 'DejaVu Sans'
mpl.rcParams['font.size'] = 14

ax.plot(d,cumsum, color = '#00538F', label='Cumulative Explained Variance')

ax.axhline(y = 95, color='black', linestyle=':', label = '95% Explained Variance')
ax.legend(loc='best')

# Remove ticks
ax.xaxis.set_ticks_position('none')
ax.yaxis.set_ticks_position('none')

# Remove axes splines
for i in ['top','right']:
    ax.spines[i].set_visible(False)

# Set percentages
ax.yaxis.set_major_formatter(mtick.PercentFormatter())

# annotation arrow
arrowprops = dict(arrowstyle="->", connectionstyle="angle3,angleA=0,angleB=-90")
plt.annotate('Principal Component Number 19', 
             xy=(19, 95), 
             xytext=(19+5, 95+10), 
             arrowprops=arrowprops,
             size = 14)

plt.legend(bbox_to_anchor = (1, 0.2))

plt.suptitle('Explained Variance vs Dimensions', size=26)
plt.ylabel('Cumulative Explained Variance')
plt.xlabel('PC');

In [76]:
pca = PCA(.95) 
pca.fit(X_train_over)

X_pca = pca.transform(X_train_over)
X_pca = pd.DataFrame(X_pca)

print(f'''Shape of X before PCA: {X_train_over.shape}
Shape of X after PCA: {X_pca.shape}''')

Shape of X before PCA: (10198, 27)
Shape of X after PCA: (10198, 16)


In [77]:
from sklearn import svm
from sklearn import datasets

#grid search cross validation for knn
from sklearn.model_selection import GridSearchCV

grid = {"n_neighbors":np.arange(1,50)}
knn = KNeighborsClassifier()

knn_cv = GridSearchCV(knn, grid, cv=10)#GridSearchCV
knn_cv.fit(X_train_over,y_train_over)

#%% print hyperparameter KNN algoritmasindaki K degeri
print("tuned hyperparameter K: ",knn_cv.best_params_)
print("tuned parametreye göre en iyi accuracy (best score): ",knn_cv.best_score_)

tuned hyperparameter K:  {'n_neighbors': 2}
tuned parametreye göre en iyi accuracy (best score):  0.9556789624583887


In [78]:
from sklearn import svm
from sklearn import datasets

# find best K value  K-fold stratisfied. 
# Use cv=3 in cross_val_score to define K in KFold (estratificado).
for k in k_range:
  knn_clf = KNeighborsClassifier(n_neighbors=k)
  knn_clf.fit(X_train_over,y_train_over)
  score = cross_val_score(knn_clf, X_train_over, y_train_over, cv=3, n_jobs = -1)
  scores.append(score.mean())

#get mse
mse = [1-x for x in scores]

NameError: ignored

In [None]:
#plot to decide optim K value

plt.figure(figsize=(20,8))
plt.subplot(121)
sns.lineplot(k_range,mse,markers=True,dashes=False)
plt.xlabel("Value of K")
plt.ylabel("Mean Squared Error")
plt.subplot(122)
sns.lineplot(k_range,scores,markers=True,dashes=False)
plt.xlabel("Value of K")
plt.ylabel("Cross Validation Accuracy")

plt.show()

In [None]:
#fROM  previous graph , optimum K value is 7 for KNN

In [None]:
print('Accuracy of K-NN classifier on training set: {:.2f}'
     .format(KNN.score(X_train, y_train)))
print('Accuracy of K-NN classifier on test set: {:.2f}'
     .format(KNN.score(X_test, y_test)))

In [None]:
#KNN
KNN = KNeighborsClassifier(n_neighbors = 7)
#fit
KNN.fit(X_train, y_train)

#predict
y_predict = KNN.predict(X_test)

# confution matrix
print(f"Confution Matrix:\n\n")
print(confusion_matrix(y_test, y_predict))

#  prediction Summary 
print(confusion_matrix(y_test, y_predict))

# Accuracy score
KNN_SC = metrics.accuracy_score(y_test,y_predict)
print(f"we have identified succesfully {round(KNN_SC*100,2)}% of  all Attrited customers")
plot_confusion_matrix(KNN,X_test,y_test)


In [None]:
#DECISION TREE
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.tree import plot_tree, export_text
clf = DecisionTreeClassifier(max_depth=5,random_state = 20)
clf.fit(X_train_over, y_train_over)

In [None]:
fig, ax = plt.subplots(figsize=(12, 5))

print(f"Profundidad del árbol: {clf.get_depth()}")
print(f"Número de nodos terminales: {clf.get_n_leaves()}")

plot = plot_tree(
            decision_tree = clf,
            feature_names = df.drop(columns = 'Attrition_Flag').columns,
            class_names   = 'Attrition_Flag',
            filled        = True,
            impurity      = False,
            fontsize      = 10,
            precision     = 2,
            ax            = ax
       )

The best 4 models with respect to validation recall and cross validation score, are as follows:
Light GBM trained with over/up-sampled data
GBM trained with over/up-sampled data
AdaBoost trained with over/up-sampled data
XGBoost trained with over/up-sampled data

Undersampling train data using Random Under Sampler
Undersampling is another way of dealing with imbalance in the dataset.

Random undersampling involves randomly selecting examples from the majority class and deleting them from the training dataset until a balanced dataset is created.

In [None]:
rus = RandomUnderSampler(random_state=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)

In [None]:
print("Before Under Sampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before Under Sampling, counts of label 'No': {} \n".format(sum(y_train == 0)))

print("After Under Sampling, counts of label 'Yes': {}".format(sum(y_train_un == 1)))
print("After Under Sampling, counts of label 'No': {} \n".format(sum(y_train_un == 0)))

print("After Under Sampling, the shape of train_X: {}".format(X_train_un.shape))
print("After Under Sampling, the shape of train_y: {} \n".format(y_train_un.shape))

Build Models with Undersampled Data
Build and Train Models
We are again building the same 7 models as before and training with the undersampled dataset, and use the validation dataset to score the models.



In [None]:
models_under = []

# Appending models into the list

models_under.append(("Bagging DownSampling", BaggingClassifier(random_state=seed)))
models_under.append(
    ("Random forest DownSampling", RandomForestClassifier(random_state=seed))
)
models_under.append(("GBM DownSampling", GradientBoostingClassifier(random_state=seed)))
models_under.append(("Adaboost DownSampling", AdaBoostClassifier(random_state=seed)))
models_under.append(
    ("Xgboost DownSampling", XGBClassifier(random_state=seed, eval_metric=loss_func))
)
models_under.append(("dtree DownSampling", DecisionTreeClassifier(random_state=seed)))
models_under.append(("Light GBM DownSampling", lgb.LGBMClassifier(random_state=seed)))
models_under.append(("Logistic Regression UpSampling", LogisticRegression(random_state=seed)))
models_under.append(("KNN UpSampling",KNeighborsClassifier(n_neighbors=7)))

for name, model in models_under:
    scoring = "recall"
    kfold = StratifiedKFold(
        n_splits=10, shuffle=True, random_state=1
    )  # Setting number of splits equal to 10

    cv_result_under = cross_val_score(
        estimator=model, X=X_train_un, y=y_train_un, scoring=scoring, cv=kfold
    )
    cv_results.append(cv_result_under)

    model.fit(X_train_un, y_train_un)
    model_score_under = get_metrics_score(model, X_train_un, X_val, y_train_un, y_val)
    add_score_model(name, model_score_under, cv_result_under.mean())

print("Operation Completed!")

In [None]:
#Comparing models
comparison_frame = pd.DataFrame(
    {
        "Model": model_names,
        "Cross_Val_Score_Train": cross_val_train,
        "Train_Accuracy": acc_train,
        "Test_Accuracy": acc_test,
        "Train_Recall": recall_train,
        "Test_Recall": recall_test,
        "Train_Precision": precision_train,
        "Test_Precision": precision_test,
        "Train_F1": f1_train,
        "Test_F1": f1_test,
        "Train_ROC_AUC": roc_auc_train,
        "Test_ROC_AUC": roc_auc_test,
    }
)

# Sorting models in decreasing order of test recall
comparison_frame.sort_values(
    by=["Test_Recall", "Cross_Val_Score_Train"], ascending=False
).style.highlight_max(color="lightgreen", axis=0).highlight_min(color="pink", axis=0)

The 4 best models are:
XGBoost trained with undersampled data
AdaBoost trained with undersampled data
Light GBM trained with undersampled data
Random forest trained with undersampled data  
We will now try to tune these 4 models using Random Search CV

Choice of models for tuning
XGBoost with down-sampling has the best validation recall of 96.3%, along-with 95% cross validation score on train, and 0.99 AUC, which means is it has high possibility of performing very well in unseen dataset. There is a bit of over-fitting, which I expect to resolve by tuning.  

AdaBoost is generalizing the model very well, it is neither over-fitting, nor has any bias, AUC is 0.985 and cross validation score on train is 93%, recall on validation set is same as XGBoost (96.3%). I expect to improve the model (~94% on validation set) via tuning.  

Light GBM works really well in all aspects, but there is slight over-fitting problem, which I expect to resolve by tuning. Accuracy on validation is 94%, with cross validation score on train 95%, recall on validation ~96%, AUC is 0.99. This looks like a very promising model.  

GBM is not overfitting, and neither it is suffering from bias or variance. Recall on validation is ~96%, accuracy on validation ~94%, AUC is ~0.99, cross validation score on train is ~95%. This would be my top choice because none of the training scores are 100%, meaning it is not trying to explain every single aspect of training data by overfitting it.

Model Tuning using RandomizedSearchCV
Typically a hyperparameter has a known effect on a model in the general sense, but it is not clear how to best set a hyperparameter for a given dataset. Further, many machine learning models have a range of hyperparameters and they may interact in nonlinear ways.

As such, it is often required to search for a set of hyperparameters that result in the best performance of a model on a dataset. This is called hyperparameter optimization, hyperparameter tuning, or hyperparameter search.

An optimization procedure involves defining a search space. This can be thought of geometrically as an n-dimensional volume, where each hyperparameter represents a different dimension and the scale of the dimension are the values that the hyperparameter may take on, such as real-valued, integer-valued, or categorical.

Search Space: Volume to be searched where each dimension represents a hyperparameter and each point represents one model configuration. A point in the search space is a vector with a specific value for each hyperparameter value. The goal of the optimization procedure is to find a vector that results in the best performance of the model after learning, such as maximum accuracy or minimum error.

A range of different optimization algorithms may be used, although two of the simplest and most common methods are random search and grid search.

Random Search: Define a search space as a bounded domain of hyperparameter values and randomly sample points in that domain.  

Grid Search: Define a search space as a grid of hyperparameter values and evaluate every position in the grid.



Tuning XGBOOST with Down Sampled data
Finding best parameter for high recall using Random Search with cross validation

In [None]:
%%time

# defining model
model = XGBClassifier(random_state=seed, eval_metric=loss_func)


# Parameter grid to pass in RandomizedSearchCV
param_grid={'n_estimators':np.arange(50,500,50),
            'scale_pos_weight':[2,5,10],
            'learning_rate':[0.01,0.1,0.2,0.05],
            'gamma':[0,1,3,5],
            'subsample':[0.8,0.9,1],
            'max_depth':np.arange(4,20,1),
            'reg_lambda':[5,10, 15, 20]} 

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
xgb_tuned = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=10, random_state=seed, n_jobs = -1)

#Fitting parameters in RandomizedSearchCV
xgb_tuned.fit(X_train_un,y_train_un)

print("Best parameters are {} with CV score={}:" .format(xgb_tuned.best_params_,xgb_tuned.best_score_))

In [None]:
# building model with best parameters
xgb_tuned_model = XGBClassifier(
    n_estimators=150,
    scale_pos_weight=10,
    subsample=1,
    reg_lambda=20,
    max_depth=5,
    learning_rate=0.01,
    gamma=0,
    eval_metric=loss_func,
    random_state=seed,
)
# Fit the model on training data
xgb_tuned_model.fit(X_train_un, y_train_un)

In [None]:
#get scores
xgb_tuned_model_score = get_metrics_score(
    xgb_tuned_model, X_train, X_val, y_train, y_val
)

kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)

scoring = "recall"
xgb_down_cv = cross_val_score(
    estimator=xgb_tuned_model, X=X_train_un, y=y_train_un, scoring=scoring, cv=kfold
)


add_score_model(
    "XGB Tuned with Down Sampling", xgb_tuned_model_score, xgb_down_cv.mean()
)

In [None]:
#confusion matrix
make_confusion_matrix(xgb_tuned_model, X_val, y_val)


Tuning AdaBoost with Down Sampled data
Finding best parameter for high recall using Random Search with cross validation

In [None]:
%%time

# defining model
model = AdaBoostClassifier(random_state=seed)



# Parameter grid to pass in RandomizedSearchCV
param_grid={'n_estimators':np.arange(50,2000,50),
            'learning_rate':[0.01,0.1,0.2,0.05]} 

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
ada_tuned = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=10, random_state=seed, n_jobs = -1)

#Fitting parameters in RandomizedSearchCV
ada_tuned.fit(X_train_un,y_train_un)

print("Best parameters are {} with CV score={}:" .format(ada_tuned.best_params_,ada_tuned.best_score_))

In [None]:
# building model with best parameters
ada_tuned_model = AdaBoostClassifier(
    n_estimators=1050, learning_rate=0.1, random_state=seed
)
# Fit the model on training data
ada_tuned_model.fit(X_train_un, y_train_un)

In [None]:
#get scores
ada_tuned_model_score = get_metrics_score(
    ada_tuned_model, X_train, X_val, y_train, y_val
)

kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)

scoring = "recall"
ada_down_cv = cross_val_score(
    estimator=ada_tuned_model, X=X_train_un, y=y_train_un, scoring=scoring, cv=kfold
)


add_score_model(
    "AdaBoost Tuned with Down Sampling", ada_tuned_model_score, ada_down_cv.mean()
)

In [None]:
#confusion matrix
make_confusion_matrix(ada_tuned_model, X_val, y_val)

Tuning Light GBM with Down-Sampled data
Finding best parameter for high recall using Random Search with cross validation

In [None]:
%%time

# defining model
model = lgb.LGBMClassifier(random_state=seed)

# Hyper parameters
min_gain_to_split = [0.01, 0.1, 0.2, 0.3]
min_data_in_leaf = [10, 20, 30, 40, 50]
feature_fraction = [0.8, 0.9, 1.0]
max_depth = [5, 8, 15, 25, 30]
extra_trees = [True, False]
learning_rate = [0.01,0.1,0.2,0.05]

# Parameter grid to pass in RandomizedSearchCV
param_grid={'min_gain_to_split': min_gain_to_split,
               'min_data_in_leaf': min_data_in_leaf,
               'feature_fraction': feature_fraction,
               'max_depth': max_depth,
               'extra_trees': extra_trees,
               'learning_rate': learning_rate,
               'boosting_type': ['gbdt'],
               'objective': ['binary'],
               'is_unbalance': [True],
               'metric': ['binary_logloss'],} 

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
lgbm_tuned = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=10, random_state=seed, n_jobs = -1)

#Fitting parameters in RandomizedSearchCV
lgbm_tuned.fit(X_train_un,y_train_un)

print("Best parameters are {} with CV score={}:" .format(lgbm_tuned.best_params_,lgbm_tuned.best_score_))

In [None]:
# building model with best parameters
lgbm_tuned_model = lgb.LGBMClassifier(
               min_gain_to_split = 0.01,
               min_data_in_leaf = 50,
               feature_fraction = 0.8,
               max_depth = 8,
               extra_trees = False,
               learning_rate = 0.2,
               objective = 'binary',
               metric = 'binary_logloss',
               is_unbalance = True,
               boosting_type = 'gbdt',
               random_state = seed
)
# Fit the model on training data
lgbm_tuned_model.fit(X_train_un, y_train_un)

In [None]:
#confusion matrix
make_confusion_matrix(lgbm_tuned_model, X_val, y_val)

Tuning GBM with Down Sampled data
Finding best parameter for high recall using Random Search with cross validation

In [None]:
%%time

# defining model
model = GradientBoostingClassifier(random_state=seed)

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 50, stop = 2000, num = 10)]
max_features = ['auto', 'sqrt']
max_depth = [5, 8, 15, 25, 30]
min_samples_split = [2, 5, 10, 15, 100]
min_samples_leaf = [1, 2, 5, 10, 15] 


# Parameter grid to pass in RandomizedSearchCV
param_grid={'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf} 

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
gbm_tuned = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=10, random_state=seed, n_jobs = -1)

#Fitting parameters in RandomizedSearchCV
gbm_tuned.fit(X_train_un,y_train_un)

print("Best parameters are {} with CV score={}:" .format(gbm_tuned.best_params_,gbm_tuned.best_score_))

In [None]:
# building model with best parameters
gbm_tuned_model = GradientBoostingClassifier(
    n_estimators=700,
    max_features="auto",
    max_depth=25,
    min_samples_split=2,
    min_samples_leaf=15,
    random_state=seed,
)
# Fit the model on training data
gbm_tuned_model.fit(X_train_un, y_train_un)

In [None]:
gbm_tuned_model_score = get_metrics_score(
    gbm_tuned_model, X_train, X_val, y_train, y_val
)


kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)

scoring = "recall"
gbm_down_cv = cross_val_score(
    estimator=gbm_tuned_model, X=X_train_un, y=y_train_un, scoring=scoring, cv=kfold
)


add_score_model(
    "GBM Tuned with Down Sampling", gbm_tuned_model_score, gbm_down_cv.mean()
)

In [None]:
#Confusion matrix
make_confusion_matrix(gbm_tuned_model, X_val, y_val)

In [None]:
#Compaing Models
comparison_frame = pd.DataFrame(
    {
        "Model": model_names,
        "Cross_Val_Score_Train": cross_val_train,
        "Train_Accuracy": acc_train,
        "Test_Accuracy": acc_test,
        "Train_Recall": recall_train,
        "Test_Recall": recall_test,
        "Train_Precision": precision_train,
        "Test_Precision": precision_test,
        "Train_F1": f1_train,
        "Test_F1": f1_test,
        "Train_ROC_AUC": roc_auc_train,
        "Test_ROC_AUC": roc_auc_test,
    }
)


for col in comparison_frame.select_dtypes(include="float64").columns.tolist():
    comparison_frame[col] = round(comparison_frame[col] * 100, 0).astype(int)


comparison_frame.tail(4).sort_values(
    by=["Cross_Val_Score_Train", "Test_Recall"], ascending=False
)

Final Model Selection

The XGBoost model with hyper parameter tuning and trained with undersampled dataset, has best recall on validation set of ~99%, but accuracy is lower than the human level accuracy (i,e, classifying everyone as non-attriting customers). Thus, we are not selecting this model as the final model
 
The validation recall of ~97% is provided by the GBM with hyper parameter tuning trained with undersampled dataset, has validation accuracy of ~94%, and precision of ~74%, Validation AUC ~99%, Cross Validation Mean of 96%. Also, the model is neither suffering from bias, nor variance. We are selecting GBM Tuned with Down Sampling model as our final model

Check Test Data on GBM Tuned and Trained with Downsampled Data
Feature Importance

In [None]:
feature_names = X_train.columns
importances = gbm_tuned_model.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

Test scores
Let's check the performance of the model on Test (unseen) dataset.

In [None]:
gbm_tuned_model_test_score = get_metrics_score(
    gbm_tuned_model, X_train, X_test, y_train, y_test
)

final_model_names = ["gbm Tuned Down-sampled Trained"]
final_acc_train = [gbm_tuned_model_test_score[0]]
final_acc_test = [gbm_tuned_model_test_score[1]]
final_recall_train = [gbm_tuned_model_test_score[2]]
final_recall_test = [gbm_tuned_model_test_score[3]]
final_precision_train = [gbm_tuned_model_test_score[4]]
final_precision_test = [gbm_tuned_model_test_score[5]]
final_f1_train = [gbm_tuned_model_test_score[6]]
final_f1_test = [gbm_tuned_model_test_score[7]]
final_roc_auc_train = [gbm_tuned_model_test_score[8]]
final_roc_auc_test = [gbm_tuned_model_test_score[9]]

final_result_score = pd.DataFrame(
    {
        "Model": final_model_names,
        "Train_Accuracy": final_acc_train,
        "Test_Accuracy": final_acc_test,
        "Train_Recall": final_recall_train,
        "Test_Recall": final_recall_test,
        "Train_Precision": final_precision_train,
        "Test_Precision": final_precision_test,
        "Train_F1": final_f1_train,
        "Test_F1": final_f1_test,
        "Train_ROC_AUC": final_roc_auc_train,
        "Test_ROC_AUC": final_roc_auc_test,
    }
)


for col in final_result_score.select_dtypes(include="float64").columns.tolist():
    final_result_score[col] = final_result_score[col] * 100


final_result_score

The performance of the model with the test data is almost similar to the performance on the validation dataset.

In [None]:
#confusion matrix
make_confusion_matrix(gbm_tuned_model, X_test, y_test)

Cumulative Lift/Gain Chart
Cumulative lift/gain chart is important to understand how a model would perform in production system with unseen data

In [None]:
!pip install scikit-plot
import scikitplot as skplt

y_pred_prob = gbm_tuned_model.predict_proba(X_test)

skplt.metrics.plot_cumulative_gain(y_test, y_pred_prob, figsize=(15, 5))
plt.show()

The above chart shows, if we sort the customers in the descending order of the probability of attrition (class 1), and target the top ~30% of the population, we are most likely to find 100% of the people who would actually attrite.

ROC-AUC Curve
ROC AUC characteristic is important to understand how good the model is.

If the model is really good in identifying the classes, the Area Under Curve is really high, close to 1.

If the model can not distinguish the classes well, the Area Under Curve is really low, close to 0.5.

In [None]:
plot_roc_curve(gbm_tuned_model, X_test, y_test)
plt.title("Receiver Operating Characteristic")
plt.legend(loc="lower right")
plt.plot([0, 1], [0, 1], "b--")
plt.xlim([-0.05, 1])
plt.ylim([0, 1.05])
plt.ylabel("True Positive Rate")
plt.xlabel("False Positive Rate")
plt.show()

Our model appears to be really good, since the AUC is almost 1.

Productionizing the model
Now that we have finalized our model, we'll build a model pipeline to streamline all the steps of model building. We'll start will the initial dataset and proceed with the pipeline building steps.

Machine Learning (ML) pipeline, theoretically, represents different steps including data transformation and prediction through which data passes. The outcome of the pipeline is the trained model which can be used for making the predictions. Sklearn.pipeline is a Python implementation of ML pipeline. Instead of going through the model fitting and data transformation steps for the training and test datasets separately, we can use Sklearn.pipeline to automate these steps. Here is the diagram representing the pipeline for training our machine learning model based on supervised learning, and then using test data to predict the labels.

In [None]:
# The static variables
# Random state and loss
seed = 1
loss_func = "logloss"

# Test and Validation sizes
test_size = 0.2
val_size = 0.25

# Dependent Varibale Value map
target_mapper = {"Attrited Customer": 1, "Existing Customer": 0}

df_pipe = churner.copy()
cat_columns = df_pipe.select_dtypes(include="object").columns.tolist()
df_pipe[cat_columns] = df_pipe[cat_columns].astype("category")

Dependent and independent variables

In [None]:
X = df_pipe.drop(columns=["Attrition_Flag"])
y = df_pipe["Attrition_Flag"].map(target_mapper)

In [None]:
# Splitting data into training, validation and test set:
# first we split data into 2 parts, say temporary and test

X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=test_size, random_state=seed, stratify=y
)

# then we split the temporary set into train and validation

X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=val_size, random_state=seed, stratify=y_temp
)
print(X_train.shape, X_val.shape, X_test.shape)

In [None]:
print(y_train.value_counts(normalize=True))
print(y_val.value_counts(normalize=True))
print(y_test.value_counts(normalize=True))

Undersampling the training data since that generalized this model really well

In [None]:
under_sample = RandomUnderSampler(random_state=seed)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)

In [None]:
# For dropping columns
columns_to_drop = [
    "clientnum",
    "credit_limit",
    "dependent_count",
    "months_on_book",
    "avg_open_to_buy",
    "customer_age",
]

# For masking a particular value in a feature
column_to_mask_value = "income_category"
value_to_mask = "abc"
masked_value = "Unknown"

# One-hot encoding columns
columns_to_encode = [
    "gender",
    "education_level",
    "marital_status",
    "income_category",
    "card_category",
]

# Numerical Columns
num_columns = [
    "total_relationship_count",
    "months_inactive_12_mon",
    "contacts_count_12_mon",
    "total_revolving_bal",
    "total_amt_chng_q4_q1",
    "total_trans_amt",
    "total_trans_ct",
    "total_ct_chng_q4_q1",
    "avg_utilization_ratio",
]

# Columns for null imputation with Unknown
columns_to_null_imp_unknown = ["education_level", "marital_status"]

Build the pipeline

In [None]:
# To Standardize feature names
feature_name_standardizer = FeatureNamesStandardizer()

# To Drop unnecessary columns
column_dropper = ColumnDropper(features=columns_to_drop)

# To Mask incorrect/meaningless value of a feature
value_masker = CustomValueMasker(
    feature=column_to_mask_value, value_to_mask=value_to_mask, masked_value=masked_value
)

# Missing value imputation
imputer = FillUnknown()

# To encode the categorical data
one_hot = OneHotEncoder(handle_unknown="ignore")

# To scale numerical columns
scaler = RobustScaler()


# creating a transformer for feature name standardization and dropping columns
cleanser = Pipeline(
    steps=[
        ("feature_name_standardizer", feature_name_standardizer),
        ("column_dropper", column_dropper),
        ("value_mask", value_masker),
        ("imputation", imputer),
    ]
)

# creating a transformer for data encoding

encode_transformer = Pipeline(steps=[("onehot", one_hot)])
num_scaler = Pipeline(steps=[("scale", scaler)])

preprocessor = ColumnTransformer(
    transformers=[
        ("encoding", encode_transformer, columns_to_encode),
        ("scaling", num_scaler, num_columns),
    ],
    remainder="passthrough",
)

# Model

gbm_tuned_model = GradientBoostingClassifier(
    n_estimators=700,
    max_features="auto",
    max_depth=25,
    min_samples_split=2,
    min_samples_leaf=15,
    random_state=seed,
)

# Creating new pipeline with best parameters
model_pipe = Pipeline(
    steps=[
        ("cleanse", cleanser),
        ("preprocess", preprocessor),
        ("model", gbm_tuned_model),
    ]
)
# Fit the model on training data
model_pipe.fit(X_train_un, y_train_un)

Score the pipeline using test data
Accuracy

In [None]:
print(
    "Accuracy on Test is: {}%".format(round(model_pipe.score(X_test, y_test) * 100, 0))
)

In [None]:
#Recall
pred_train_p = model_pipe.predict_proba(X_train_un)[:, 1] > 0.5
pred_test_p = model_pipe.predict_proba(X_test)[:, 1] > 0.5

pred_train_p = np.round(pred_train_p)
pred_test_p = np.round(pred_test_p)

train_acc_p = accuracy_score(pred_train_p, y_train_un)
test_acc_p = accuracy_score(pred_test_p, y_test)

train_recall_p = recall_score(y_train_un, pred_train_p)
test_recall_p = recall_score(y_test, pred_test_p)

In [None]:
print("Recall on Test is: {}%".format(round(test_recall_p * 100, 0)))

In [None]:
mask = np.zeros_like(data_clean.corr(), dtype=np.bool)
mask[np.triu_indices_from(mask)] = True


sns.set(rc={"figure.figsize": (15, 15)})

sns.heatmap(
    data_clean.corr(),
    cmap=sns.diverging_palette(20, 220, n=200),
    annot=True,
    mask=mask,
    center=0,
)
plt.show()

The most important features to understand customer credit card churn, are
Total Transaction Count
Total Transaction Amount
Total Revolving Balance
Total Amount Change Q4 to Q1
Total Count Change Q4 to Q1
Total Relationship Count
 
All of these features are negatively correlated with the Attrition Flag, meaning, the lower the values of these features, the higher the chances of a customer to attrite
 

Bank should connect with the customer more often to increase the connect, and provide the customer with various offers and schemes to increase relationships of the customer with the bank
 

Bank should offer cashback schemes on credit cards, which might encourage the customers on using the credit card more often
 

Bank should also offer credit limit increase for the customers who are regularly using the credit card. This should increase the credit card spends/transaction amounts.
 

0% interest EMI on credit card is also another offer that can be provided to customers to encourage the customers buy products of higher cost with credit card, and convert the expenditure to EMI, so that total transaction amount as well as transaction counts would increase. The balance would also revolve nicely.
 

Along with the available types of cards, bank can introduce credit cards specific to online shopping (with % cashback offers) or online food ordering. This way the card will be used more frequently.
 

With our model, we can predict which customers are likely to attrite, and according to the predicted probability, at least top 20-30% customers can be reached out to discuss credit card offers, credit limit increase etc, to try retain those customers.