<a id="top"></a>
<div class="list-group" id="list-tab" role="tablist">
<h3 class="list-group-item list-group-item-action active" data-toggle="list" role="tab" aria-controls="home">Table of Content</h3>

- [1. Reading the Data](#1)
- [2. EDA: Exploring Insights](#2)
    - [2.1 An Overview from the Data](#2.1)
    - [2.2 Demographic Analysis](#2.2)
    - [2.3 Financial Profile](#2.3)
- [3. Prep: Building Pipelines](#3) 
    - [3.1 Initial Pipeline](#3.1)
        - [3.1.1 Candidate Features](#3.1.1)
        - [3.1.2 Duplicated Data](#3.1.2)
        - [3.1.3 Target Definition](#3.1.3)
        - [3.1.4 Training and Testing Data](#3.1.4)
    - [3.2 Numerical Pipeline](#3.2)
        - [3.2.1 Null Data](#3.2.1)
        - [3.2.2 Log Transformation](#3.2.2)
        - [3.3.3 Normalization](#3.2.3)
    - [3.3 Categorical Pipeline](#3.3)
        - [3.3.1 Dummies Encoding](#3.3.1)
    - [3.4 Complete Pipelines](#3.4)
- [4. Modelling: Predicting Churn](#4)
    - [4.1 Structuring Variables](#4.1)
    - [4.2 Training Models](#4.2)
    - [4.3 Evaluating Models](#4.3)

This notebook aims to allocate the development referring to exploratory analysis of insights related to the [Credit Card Customers](https://www.kaggle.com/sakshigoyal7/credit-card-customers) dataset taken from the Kaggle platform to improve skills in Data Science and Machine Learning.

___
**_Description and context:_**
_A bank manager is in a scenario where several customers are leaving their credit card services. It would be extremely interesting for the company to be able to predict the customers most likely to leave such services so that, in this way, the bank can act preventively in order to offer better services in favor of maintaining the customer._

_[...]
The data set has 10,000 customers with attributes such as age, salary, marital status, credit limit, card category, among others. There are approximately 18 features in the whole set and there are only 16.0% of customers with churn_
___

In [None]:
!pip install pycomp --upgrade --no-cache-dir

In [None]:
# Importing libraries
import pandas as pd
import os
from pycomp.viz.insights import *

pd.options.display.max_columns = 500
from warnings import filterwarnings
filterwarnings('ignore')

# Project variables
DATA_PATH = '../input/credit-card-customers/'
FILENAME = 'BankChurners.csv'

<a id="1"></a>
<font color="darkslateblue" size=+2.5><b>1. Reading the Data</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

After a formal definition of the project context, the alignment of the objectives of this work and, finally, the definition of the project variables, it is possible to start the investigations by reading the database available for analysis.

In this first contact, it is expected to understand a little more about the available content and the possibilities of analysis within the defined context. It is at this point that the data analyst/scientist takes the first impressions of the data and sets a macro direction for the project while looking for relevant insights to the business problem.

In [None]:
# Reading data
df = pd.read_csv(os.path.join(DATA_PATH, FILENAME))
print(f'Shape of the data: {df.shape}')
df.head()

Looking at the [metadadata](https://www.kaggle.com/sakshigoyal7/credit-card-customers) of the dataset available in Kaggle, it is possible to detail each of the 23 columns in the database as:

- **_CLIENTNUM_** unique identifier of the customer cartonista;
- **_Attrition_Flag_** internal event related to customer activity;
- **_Customer_Age_** age of the customer (in years);
- **_Gender_** gender of the client (M = Male, F = Female);
- **_Dependent_count_** number of customer dependents;
- **_Education_Level_** customer's school level;
- **_Marital_Status_** marital status of the client;
- **_Income_Category_** category related to the client's annual salary;
- **_Card_Category_** credit card category (Blue, Silver, Gold or Platinum);
- **_Months_on_book_** period of relationship with the bank (in months)

___

_The other attributes present in the database do not have detailed descriptions on the metadata page. However, in an intuitive way, it is possible to extract some meaning from these from the name of the registered columns_
___

- **_Total_Relationship_Count:_** indicator of the customer's general relationship with the bank;
- **_Months_Inactive_12_mon:_** number of months of customer inactivity considering the last 12 months;
- **_Contacts_Count_12_mon:_** number of contacts registered by the customer considering the last 12 months;
- **_Credit_Limit:_** customer's credit limit;
- **_Total_Revolving_Bal:_**
- **_Avg_Open_To_Buy:_** indicator of purchase willingness by the customer (opening of offer);
- **_Total_Amt_Chng_Q4_Q1:_** probably indicates the value migrated between Q4 and Q1 for a full annual period;
- **_Total_Trans_Amt:_** total traded by the customer;
- **_Total_Trans_Ct:_** total traded by the customer on the card;
- **_Total_Ct_Chng_Q4_Q1:_** probably indicates the amount migrated from card transactions between Q4 and Q1 for a full annual period;
- **_Avg_Utilization_Ratio:_** indicator of average customer usage of the card;

___
_The last two attributes present in the database indicate, in some way, the results of classification processes and the construction of scores for customers_
___

<a id="2"></a>
<font color="darkslateblue" size=+2.5><b>2. EDA: Exploring Insights</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

At this point, there is a well-defined context of the project's objective, in addition to a database already read and transformed into a DataFrame format of the pandas. From this moment on, a true scan of the data will be proposed for the application of a detailed descriptive analysis in order to gather relevant insights for the business context.

Using the homemade package [pycomp](https://github.com/ThiagoPanini/pycomp), whose construction was motivated exactly to facilitate the work of data scientists in the pillars of insights, prep and modeling, for this second session is expected a full understanding of the set of available data and a clear idea of the steps required to be applied in the prep and in the modeling.

<img src="https://i.imgur.com/WcAaq1P.png" alt="pycomp Logo">

<a id="2.1"></a>
<font color="dimgrey" size=+2.0><b>2.1 An Overview from the Data</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

The first proposed analysis is based on the extraction of metadata from the available database itself. Performing this work is extremely important to have a clear idea about the attributes contained in the data set and the existing possibilities given the characteristics of the features.

In [None]:
# Returning an overview from the data
df_overview = data_overview(df=df)
df_overview

The `data_overview()` function, extracted from the [pycomp](https://github.com/ThiagoPanini/pycomp) package returns an overview of a given database, informing the user of important factors, such as the quantity of null records, the primitive type and the number of categorical entries for each column. Observing the result generated for the database in question, it is possible to state:

- There are no null records for the database available;
- Of the 23 columns available, 6 are categorical and 17 are numeric;
- The categorical column with the most registered entries is **_Education_Level_** with 7 different entries;

<a id="2.2"></a>
<font color="dimgrey" size=+2.0><b>2.1 Demographic Analysis</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

In this first exploratory analysis session, demographic variables present in the database, such as age, gender, dependents, education, among others, will be discussed. The objective is to understand the public of this banking institution a little better and to cross these factors with other key variables that can better define possible customer migrations.

___
**_Customer analysis by age and gender_**
___

In [None]:
# Customers age distribution
plot_distplot(df=df, col='Customer_Age', hist=True, title='Age distribution of the customers on the data')

The graph above shows an expected normal distribution for the age variable and, therefore, it is noticed that the database does not have any "bias" linked to this factor for the analysis clients present in the dataset. Now let's look at the age distribution by other demographic factors:

In [None]:
# Public by gender
plot_donut_chart(df=df, col='Gender', colors=['lightcoral', 'lightskyblue'],
                 title='Total customers by gender')

In [None]:
# Age by gender
plot_distplot(df=df, col='Customer_Age', hue='Gender', kind='kde', color_list=['lightcoral', 'lightskyblue'],
              title='Age distribution of the customers by its gender')

The distribution charts above allow to extract some important information related to the demographic attributes of the customers present in the database:

1. The public of analysis is mainly formatted by customers between 45 and 55 years old;
2. There is a good balance of clients by gender: 53% are women and 47% are men, with no distinction between these two groups in relation to age (strictly similar distribution curves);

___
**_Analysis of the public by family dependents_**
___

In [None]:
# Dependents
plot_pie_chart(df=df, col='Dependent_count', explode=(0.02, 0.02, 0, 0, 0, 0),
               title='Customer analysis by its dependents')

In [None]:
# Age by dependents
plot_distplot(df=df, col='Customer_Age', hue='Dependent_count', kind='boxen', palette='plasma',
              title="Customer's age distribution by dependents count")

In general, the analysis of the public in relation to family dependents is possible to state:

1. Most customers have 1, 2 or 3 dependents;
3. It is possible to notice that customers with a high number of dependents usually establish themselves in more restricted age groups (between 35 and 55 years old), while customers with a low number of dependents (none, 1 or 2) have a greater spread and spread in relation to age;

___
**_Analysis of the public by marital status and education level_**
___

In [None]:
# Marital status
plot_donut_chart(df=df, col='Marital_Status', title="Total customers by marital status")

In [None]:
# Dependents and marital status
plot_countplot(df=df, col='Dependent_count', hue='Marital_Status', figsize=(17, 8),
               title="Customer analysis by marital status and dependents count")

Probably an interesting analysis for the financial institution is related to a joint study between the marital status and the number of dependents of each client. We know that different decisions can be made taking into account the marital situation and the "family size" of each client. The bar graph above shows:

1. The base is formed, in its majority, by married clients and, therefore, in spite of being majority in the whole analysis by registered dependents, single clients without dependents surpass married clients without dependents.
2. The financial institution may, in some way, further analyze single customers who have a large number of dependents (green bars). This audience may have specific needs and specific spending behaviors.

In [None]:
# Education
plot_countplot(df=df, col='Education_Level', order=True, palette='Blues_r',
               title='Total customers by education level')

The graph above is important to have a better understanding of the audience present at the base in relation to the level of education of customers.

___
**_Analysis by income category_**
___

In [None]:
# Income category
plot_countplot(df=df, col='Income_Category', order=False, palette='cividis',
               title='Total customers by salary range (income category in annual earnings $)')

In [None]:
# Salary range by age
plot_distplot(df=df, col='Customer_Age', hue='Income_Category', kind='box', order=True, palette='cividis',
              title='Age distribution by income category')

The above graphs allow us to infer that:

1. The vast majority of this institution's clients fall into the portion with annual earnings of less than $40K;
2. There is a subtle relationship between age and salary range, indicating that customers with high annual earnings are usually part of an older audience.

<a id="2.3"></a>
<font color="dimgrey" size=+2.0><b>2.3 Financial Profile</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

After a brief analysis of the base's public through customer demographic variables, it is possible to continue the study using variables that, in some way, characterize the public from the financial consumption and / or use of the bank's resources.

From this point, it will be possible to understand a little better the target of analysis of the project as a whole.

___
**_Attrited Customers_**
___

In [None]:
# Attrition flag
plot_pie_chart(df=df, col='Attrition_Flag', explode=(0, 0.03),
               title='How many customers have some attrition with the bank?')

The graph above reveals that approximately 16% of the customers present at the base have some type of friction with the financial institution. This is an important slice of analysis, given that it basically represents the target audience of customers who, in some way, are not comfortable with the services offered by this financial institution.

According to the metadata, this column describes exactly the customers who left the bank (churn) and thus represents the target for much of the subsequent analysis. Before diving deeper into this slice, let's look at some other important categories for a better understanding of the audience.

___
**_Credit Card Category_**
___

In [None]:
# Card category
plot_countplot(df=df, col='Card_Category', palette=['darkslateblue', 'silver', 'gold', 'cadetblue'],
               order=True, title='Total customers by card category')

In [None]:
plot_distplot(df=df, col='Months_on_book', hue='Card_Category', kind='box', 
              palette=['darkslateblue', 'gold', 'silver', 'cadetblue'],
              title='Relationship time by card category')

In [None]:
# Age by card category
plot_aggregation(df=df, group_col='Card_Category', value_col='Credit_Limit', aggreg='mean',
                 palette=['cadetblue', 'gold', 'silver', 'darkslateblue'],
                 title="Average customer age by card category")

The charts above show important factors related to how customers can be categorized in terms of consumption variables at this financial institution. At first, it is possible to state that:

1. 93% of base customers have a "Blue" card, followed by 5.5% of customers with a Silver card, 1.1% in the Gold category and only 0.2% from the Platinum category;
2. Regarding customer relationship time, it is possible to perceive a slight positive correlation between "card level" and "long relationship time" with the bank.
3. Analyzing the pre-approved limit of customers by type of card, it is noticed that the "level" of the card is directly proportional to the average pre-approved limit, following the order Platinum, Gold, Silver and Blue;

___
**_Correlation matrix_**
___

After an initial approach to important variables that describe the profile of the financial institution's customers, it is possible to analyze the numerical attributes at once in terms of correlation. This approach is important to give an overview, in a single view, of how the variables are correlated with each other and with a target variable (`Attrition_flag`)

In [None]:
# Correlation matrix
tmp = df.copy()
tmp['churned'] = tmp['Attrition_Flag'].map({'Existing Customer': 1, 'Attrited Customer': 0})
clf_drop_cols = ['Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1',
                 'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2']
plot_corr_matrix(df=tmp.drop(clf_drop_cols, axis=1), corr_col='churned', figsize=(13, 13))

From the correlation matrix above, it is possible to analyze the main factors that possibly influence the _churn_ of clients of this financial institution. The function `plot_corr_matrix ()` of the module `pycomp` analyzes the correlation (default =` Pearson`) of the numeric variables with a target variable (argument `corr_col`) and also between them, thus allowing a detailed analysis of the main correlated variables . Thus, it is possible to quote:

1. The variables `Total_Trans_Ct`,` Total_Ct_Chng_Q4_Q1` and `Total_Revolving_Bal` are the top 3 features that most directly and positively influence the _churn_ of customers. In other words, the higher the value of these 3 variables mentioned, the higher the _churn_ rate of these customers.
2. In the other analysis spectrum, the `Contacts_Count_12_mon` and` Months_Inactive_12_mon` variables are the 2 main features that have a negative correlation with the churn target variable. This means that the lower the value of these 2 mentioned variables, the higher the _churn_ rate of the public.
3. Analyzing the correlations of the variables with each other, it is possible to mention:
    * The variables `Total_Trans_Amt` and` Total_Trans_Ct` have a high index of positive correlation (directly proportional growths) - the higher the total value transacted, the greater the total value transacted on the card
    * The variables `Avg_Utilization_Ratio` and` Total_Revolving_Bal` have a high index of positive correlation (growth directly proportional)
    * The variables `Customer_Age` and` Months_on_book` have a high index of positive correlation (directly proportional growths) - the older the customer, the longer the relationship with the bank
    * There is an inversely proportional relationship between the variables `Credit_Limit` and` Avg_Utilization_Ratio`, indicating that the lower the average customer use of the products, the lower the pre-approved limit;
    * This same inverse relationship also occurs between `Avg_Open_To_Buy` and` Avg_Utilization_Ratio`, indicating that less use also influences the customer's purchase opening.
4. The pre-approved limit does not influence _churn_

In [None]:
plot_distplot(df=df, col='Total_Trans_Ct', hue='Attrition_Flag', kind='kde',
              title='Total transactions on credit card by attrition flag\nDo attrited customers use more or less the credit card?')

Analyzing the correlations and proposing a more detailed view of the variables with the greatest impact on customers who left the bank, it is possible to perceive that customers who are attributable have a smaller distribution of volume traded on the card, thus indicating a possible dissatisfaction with services and a possible migration to other institutions.

In [None]:
plot_distplot(df=df, col='Total_Ct_Chng_Q4_Q1', hue='Attrition_Flag', kind='strip', 
              palette=['cadetblue', 'crimson'],
              title='Total transactions changed Q4-Q1 by attrition flag')

In [None]:
# Correlation between age and relationship time
fig, ax = plt.subplots(figsize=(15, 10))
sns.scatterplot(x='Customer_Age', y='Months_on_book', data=df, hue='Attrition_Flag',
                palette=['cadetblue', 'crimson'])
ax.set_title("Correlation between customer's age and relationship time\n(analysis by attrition flag)", size=16)
format_spines(ax, right_border=False)

The distribution chart above shows the directly proportional relationship that exists between age and the relationship between customers and the bank. The breakdown by attrition flag indicates that there is no direct relationship between these two variables with clients who migrated to another institution. This is because the red points indicated in the graph are spread across all two dimensions and are not positioned in a specific portion of the axes.

In [None]:
fig, ax = plt.subplots(figsize=(15, 10))
sns.scatterplot(x='Credit_Limit', y='Avg_Utilization_Ratio', data=df, hue='Attrition_Flag',
                palette=['cadetblue', 'crimson'])
ax.set_title('Correlation between utilization ratio and credit limit\n(analysis by attrition)', size=16)
format_spines(ax, right_border=False)

In [None]:
fig, ax = plt.subplots(figsize=(15, 10))
sns.scatterplot(x='Avg_Open_To_Buy', y='Avg_Utilization_Ratio', data=df, hue='Attrition_Flag',
                palette=['cadetblue', 'crimson'])
ax.set_title('Correlation between utilization ratio and opening to buy\n(analysis by attrtion flag)', size=16)
format_spines(ax, right_border=False)

The two distribution charts above show an inversely proportional relationship generated from the column that indicates an average use of the card by the customer. In the first case, we have the behavior of this variable with the pre-approved credit limit and, in the second case, the relationship with a variable that indicates the opening of the customer's purchase.

Again, the friction break does not show a specific niche or concentration in the plotted dimensions.

<a id="3"></a>
<font color="darkslateblue" size=+2.5><b>3. Prep: Building Pipelines</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

After a detailed exploratory analysis session on the available basis, it was possible to create familiarity with the database and to draw some valuable insights related to the defined business problem. The visions and graphical plots allowed a better understanding of the target audience of analysis and a clear idea about the most promising variables for the prediction of _churn_ of clients of this financial institution.

From that point on, a series of necessary steps will be proposed for the application of a complete _DataPrep_ process in the database in search of training a predictive model capable of returning the probability of _churn_ of each client.

For that, some features present in the `pycomp` library will be used, more specifically in its` pycomp.ml.transformers` module.

<a id="3.1"></a>
<font color="dimgrey" size=+2.0><b>3.1 Initial Pipeline</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

In this session, changes in the database common to the whole set will be proposed. The big goal in creating an initial pipeline of data preparation is to ensure that some steps are applied before an official pipeline enters the scene, for example, an initial drop in features or the removal of duplicate data from a training base.

<a id="3.1.1"></a>
<font color="dimgrey" size=+1.0><b>3.1.1 Candidate Features</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

The objectives of this step are:
1. Filter the initial columns to be used in the modeling (elimination of key or non-representative columns for a predictive model)
2. Prepare a transformer that can carry out this process automatically, in case there is a need to repeat the entire training process

In [None]:
# Importing class
from pycomp.ml.transformers import FiltraColunas

# Initial definition
TO_DROP = ['CLIENTNUM', 'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1',
           'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2']
INITIAL_FEATURES = list(df.drop(TO_DROP, axis=1).columns)

# Criando e aplicando transformador de seleção de features
selector = FiltraColunas(features=INITIAL_FEATURES)
df_slct = selector.fit_transform(df)

# Resultados
print(f'Shape of original dataset: {df.shape}')
print(f'Shape of dataset after selecting candidate features: {df_slct.shape}')

<a id="3.1.2"></a>
<font color="dimgrey" size=+1.0><b>3.1.2 Duplicated Data</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

The objectives of this step are:

1. Check for the presence of null and duplicate data in the database
2. Treat null and duplicate data (if applicable)

In [None]:
# Looking at null and duplicated data
print(f'Total of null data: {df_slct.isnull().sum().sum()}')
print(f'Total of duplicated data: {df_slct.duplicated().sum()}')

Previously, just after reading the database, it was possible to notice that the `data_overview ()` function did not return any null data for the columns. With this confirmation and, also verifying the absence of duplicate data, we can proceed further in the steps related to the prep.

<a id="3.1.3"></a>
<font color="dimgrey" size=+1.0><b>3.1.3 Target Definition</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

In our original database, the column that identifies the target of the business problem is given by `Attrition_Flag`. In order to be able to train a predictive model, it is necessary to prepare in this column to transform it into "0s and 1s". For this, it is possible to use a class `DefineTarget` which, in turn, is responsible for applying a modification to a database based on a target column and an entry given as a positive class.

In [None]:
# Importing class
from pycomp.ml.transformers import DefineTarget

# Applying transformation
target_transformer = DefineTarget(target_col='Attrition_Flag', pos_class='Attrited Customer')
df_tgt = target_transformer.fit_transform(df_slct)

df_tgt['target'].value_counts()

<a id="3.1.4"></a>
<font color="dimgrey" size=+1.0><b>3.1.4 Training and Testing Data</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

The objectives of this session are:

1. Define the target column of the model;
2. Apply data separation in training and testing

In [None]:
# Importing class
from pycomp.ml.transformers import SplitDados

# Creating object and applying transformer
TARGET = 'target'
splitter = SplitDados(target=TARGET)
X_train, X_test, y_train, y_test = splitter.fit_transform(df_tgt)

# Results
print(f'Shape of X_train: {X_train.shape}')
print(f'Shape of X_test: {X_test.shape}')
print(f'Shape of y_train: {y_train.shape}')
print(f'Shape of y_test: {y_test.shape}')

<a id="3.2"></a>
<font color="dimgrey" size=+2.0><b>3.2 Numerical Pipeline</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

After building an initial pipeline capable of receiving a raw database, applying a feature selection process, performing a categorical grouping procedure and, finally, separating the data in training and testing, we will use the resulting training base to build pipelines in two different ways:

* **Numerical pipeline:** preparation of the numerical data contained in the database;
* **Categorical pipeline:** preparation of categorical data contained in the database.

In [None]:
# Splitting features by its dtype
num_features = [col for col, dtype in X_train.dtypes.items() if dtype != 'object']
cat_features = [col for col, dtype in X_train.dtypes.items() if dtype == 'object']

# Validating
print(f'Total of num_features: {len(num_features)}')
print(f'Total of cat_features: {len(cat_features)}')
print(f'Total of features after initial drop: {X_train.shape[1]}')

# Splitting data
X_train_num = X_train[num_features]
X_train_cat = X_train[cat_features]

Once the numerical and categorical sets of our training base are separated, we will start the steps of building individual pipelines for each of the two primitive types, starting with the numerical pipeline and its particularities.

<a id="3.2.1"></a>
<font color="dimgrey" size=+1.0><b>3.2.1 Null Data</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

Perhaps the first step to investigate in terms of numerical pipelines is the presence of null data in the database. In this context, we saw, from the function `data_overview()` proposed at the beginning of the EDA process, that the data set does not have any null data. Thus, it will not be necessary to provide any transformers responsible for filling or dropping null data.

See the confirmation below.

In [None]:
# Returning null data
print(f'Null data on X_train_num:')
X_train_num.isnull().sum()

<a id="3.2.2"></a>
<font color="dimgrey" size=+1.0><b>3.2.2 Log Transformation</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

To validate the impact of the logarithmic transformation on candidate predictive models, we will optionally propose a step in the pipeline that applies this procedure to the numerical features present in the base. With this, we can validate whether the final performance of the model is sensitive to this type of transformation.

In [None]:
# Distribution example
log_ex = 'Credit_Limit'
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(17, 7))
plot_distplot(df=X_train_num, col='Credit_Limit', ax=axs[0], hist=True,
              title=f'Original {log_ex} Distribution')

tmp_data = X_train_num.copy()
tmp_data['Credit_Limit'] = tmp_data['Credit_Limit'].apply(lambda x: np.log1p(x))
plot_distplot(df=tmp_data, col='Credit_Limit', ax=axs[1], color='mediumseagreen', hist=True, 
              title=f'{log_ex} After Log Transformation')

Two highly relevant statistical measures for distribution analysis are `skew` and` kurtosis`. Through the [link](https://codeburst.io/2-important-statistics-terms-you-need-to-know-in-data-science-skewness-and-kurtosis-388fef94eeaa) it is possible to have a clear idea on what each of these measures is and how to interpret continuous distributions through their values.

The logarithmic transformation helps to increase performance for distributions with positive skewness (asymmetric on the left). Thus, we will analyze the numerical features again and rank the main features with the opportunity for improvement through this type of transformation.

In [None]:
from scipy.stats import skew, kurtosis

tmp_ov = df_overview.copy()
tmp_ov['skew'] = tmp_ov.query('feature in @num_features')['feature'].apply(lambda x: skew(X_train_num[x]))
tmp_ov['kurtosis'] = tmp_ov.query('feature in @num_features')['feature'].apply(lambda x: kurtosis(X_train_num[x]))
tmp_ov[~tmp_ov['skew'].isnull()].sort_values(by='skew', ascending=False).loc[:, ['feature', 'skew', 'kurtosis']]

The table above shows a list of features through their skewness and kurtosis measures of symmetry. In the code block below, we will execute the `DynamicLogTransformation` class, which, in turn, has the role of applying the logarithmic transformation in a database in a preparation pipeline. The advantage of this class is the previous definition of a list of features to which the transformation will be applied, which is defined by the user.

In [None]:
# Importing class
from pycomp.ml.transformers import DynamicLogTransformation

# Defining parameters
COLS_TO_LOG = ['Total_Trans_Amt', 'Credit_Limit', 'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1',
               'Avg_Utilization_Ratio']
log_tr = DynamicLogTransformation(num_features=num_features, cols_to_log=COLS_TO_LOG)
X_train_num_ori = X_train_num.copy()
X_train_num_log = log_tr.fit_transform(X_train_num)

# Plotting some results
fig, axs = plt.subplots(nrows=2, ncols=3, figsize=(17, 12))

for i in range(3):
    ax = axs[0, i]
    plot_distplot(df=df, col=COLS_TO_LOG[i], ax=ax, hist=True,
                  title=f'{COLS_TO_LOG[i]} Distribution $Before$ \nLog Transformation')
    
for i in range(3):
    ax = axs[1, i]
    plot_distplot(df=X_train_num_log, col=COLS_TO_LOG[i], ax=ax, hist=True, color='mediumseagreen',
                  title=f'{COLS_TO_LOG[i]} Distribution $After$ \nLog Transformation')

plt.tight_layout()

Additionally, it is worth mentioning that the class and `DynamicLogTransformation` have a Boolean attribute called `application` that can be used in the future for interactions in `GridSearch` or `RandomizedSearch`. Its objective is to enable performance analysis of models **with** or **without** the logarithmic transformation.

<a id="3.2.3"></a>
<font color="dimgrey" size=+1.0><b>3.2.3 Normalization</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

Another interesting way to apply a procedure that helps a given predictive model to converge to the optimal value more quickly is given by the `normalization` of the data. For the context of machine learning, it is possible to use ready-made sklearn classes, for example, `MinMaxScaler` or` StandardScaler`.

This type of standardization / normalization can optionally be applied directly to the numerical pipeline. Below, an example of how this transformation can be applied to our numerical database will be demonstrated.

In [None]:
# Importing class
from pycomp.ml.transformers import DynamicScaler

scaler = DynamicScaler(scaler_type='Standard')
X_train_num_scaled = scaler.fit_transform(X_train_num_log)
X_train_num_scaled = pd.DataFrame(X_train_num_scaled, columns=num_features)
X_train_num_scaled.head()

With that, we ended the preparation steps in the numerical pipeline of the project. In the future, we will consolidate each of these _steps_ into a single preparation block using the `sklearn` class `Pipeline`. As next steps, let's look at the categorical part of the set.

<a id="3.3"></a>
<font color="dimgrey" size=+2.0><b>3.3 Categorical Pipeline</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

Continuing with the base transformation step, we now have the mission of applying specific transformers within the categorical universe of the data set. Recalling a little about the main features existing in this world, the block below rescues some parameters extracted previously:

In [None]:
# Categorical dataset
X_train_cat.head()

In principle, the only transformation required in this categorical block is the application of the encoding process at the base. This is essential to feed the predictive models correctly, so that they can read the numeric inputs present after coding the categorical inputs. For this, we will use the `DummiesEncoding` class present in the `pycomp` package, which is responsible for applying the pandas `get_dummies()` method in order to transform the set appropriately.

<a id="3.3.1"></a>
<font color="dimgrey" size=+1.0><b>3.3.1 Dummies Encoding</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

In [None]:
# Importing class
from pycomp.ml.transformers import DummiesEncoding

# Applying encoding 
encoder = DummiesEncoding(dummy_na=False)
X_train_cat_encoded = encoder.fit_transform(X_train_cat)

# Results after encoding
X_train_cat_encoded.head()

With the result of the application of the encoding method, it is possible to notice a significant growth in the number of features present in our database. This was due to the large number of categorical variables present, each contributing a reasonable number of entries. When applying the `DummiesEncoding` class, each categorical entry is pivoted at the base and transformed into a different new column (example: `Gender_F`, `Education_Level_College`, `Marital_Status_Single`, among others).

<a id="3.4"></a>
<font color="dimgrey" size=+2.0><b>3.4 Complete Pipelines</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

Once the steps to be considered in preparing the data, whether initial or official, have been defined, we now have the ability to consolidate all _steps_ into single data transformation blocks. Thus, the cell below aims to carry out this consolidation process while defining some global design variables.

In [None]:
# Importing libraries
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split

# Defining global variables
ORIGINAL_TARGET = 'Attrition_Flag'
TARGET = 'target'
TARGET_POSITIVE_CLASS = 'Attrited Customer'

INITIAL_FEATURES = ['Customer_Age', 'Gender', 'Dependent_count', 'Education_Level', 'Marital_Status',
                    'Income_Category', 'Card_Category', 'Months_on_book', 'Total_Relationship_Count',
                    'Months_Inactive_12_mon', 'Contacts_Count_12_mon', 'Credit_Limit', 'Attrition_Flag',
                    'Total_Revolving_Bal', 'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt',
                    'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio']

INITIAL_PRED_FEATURES = [col for col in INITIAL_FEATURES if col not in [ORIGINAL_TARGET, TARGET]]

NUM_FEATURES = ['Customer_Age', 'Dependent_count', 'Months_on_book', 'Total_Relationship_Count',
                'Months_Inactive_12_mon', 'Contacts_Count_12_mon', 'Credit_Limit', 'Total_Revolving_Bal',
                'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt', 'Total_Trans_Ct',
                'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio']

CAT_FEATURES = ['Gender', 'Education_Level', 'Marital_Status', 'Income_Category', 'Card_Category']

MODEL_FEATURES = ['Customer_Age', 'Dependent_count', 'Months_on_book', 'Total_Relationship_Count',
                  'Months_Inactive_12_mon', 'Contacts_Count_12_mon', 'Credit_Limit', 'Total_Revolving_Bal',
                  'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt', 'Total_Trans_Ct',
                  'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio', 'Gender_F', 'Gender_M',
                  'Education_Level_College', 'Education_Level_Doctorate', 'Education_Level_Graduate',
                  'Education_Level_High School', 'Education_Level_Post-Graduate',
                  'Education_Level_Uneducated', 'Education_Level_Unknown', 'Marital_Status_Divorced',
                  'Marital_Status_Married', 'Marital_Status_Single', 'Marital_Status_Unknown',
                  'Income_Category_$120K +', 'Income_Category_$40K - $60K', 'Income_Category_$60K - $80K',
                  'Income_Category_$80K - $120K', 'Income_Category_Less than $40K',
                  'Income_Category_Unknown', 'Card_Category_Blue', 'Card_Category_Gold',
                  'Card_Category_Platinum', 'Card_Category_Silver']

SCALER_TYPE = 'Standard'
ENCODER_DUMMY_NA = False
LOG_APPLICATION = True
COLS_TO_LOG = ['Total_Trans_Amt', 'Credit_Limit', 'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1',
               'Avg_Utilization_Ratio']

# Building initial pipeline (train and prediction)
initial_train_pipeline = Pipeline([
    ('col_filter', FiltraColunas(features=INITIAL_FEATURES)),
    ('target_transformer', DefineTarget(target_col=ORIGINAL_TARGET, pos_class=TARGET_POSITIVE_CLASS))
])

initial_pred_pipeline = Pipeline([
    ('col_filter', FiltraColunas(features=INITIAL_PRED_FEATURES))
])

# Building numerical pipeline
num_pipeline = Pipeline([
    ('log_transformer', DynamicLogTransformation(application=LOG_APPLICATION, num_features=NUM_FEATURES, 
                                                 cols_to_log=COLS_TO_LOG)),
    ('scaler', DynamicScaler(scaler_type=SCALER_TYPE))
])

# Building categorical pipeline
cat_pipeline = Pipeline([
    ('encoder', DummiesEncoding(dummy_na=ENCODER_DUMMY_NA))
])

# Building a complete pipeline
prep_pipeline = ColumnTransformer([
    ('num', num_pipeline, NUM_FEATURES),
    ('cat', cat_pipeline, CAT_FEATURES)
])

In [None]:
# Reading raw data
df = pd.read_csv(os.path.join(DATA_PATH, FILENAME))

# Executing initial training pipeline
df_prep = initial_train_pipeline.fit_transform(df)

# Splitting training and testing data
X_train, X_test, y_train, y_test = train_test_split(df_prep.drop(TARGET, axis=1), df_prep[TARGET].values,
                                                    test_size=.20, random_state=42)

# Executing preparation pipeline on training and testing data
X_train_prep = prep_pipeline.fit_transform(X_train)
X_test_prep = prep_pipeline.fit_transform(X_test)

# Results
print(f'After reading raw data and applying initial and preparation pipelines, we have:\n')
print(f'Shape of X_train_prep: {X_train_prep.shape}')
print(f'Shape of X_test_prep: {X_test_prep.shape}')
print(f'\nTotal features considered on MODEL_FEATURES list: {len(MODEL_FEATURES)}')

<a id="3.4"></a>
<font color="dimgrey" size=+2.0><b>3.4 Complete Pipelines</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

Once the steps to be considered in preparing the data, whether initial or official, have been defined, we now have the ability to consolidate all _steps_ into single data transformation blocks. Thus, the cell below aims to carry out this consolidation process while defining some global design variables.

In [None]:
# Importing libraries
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split

# Defining global variables
ORIGINAL_TARGET = 'Attrition_Flag'
TARGET = 'target'
TARGET_POSITIVE_CLASS = 'Attrited Customer'

INITIAL_FEATURES = ['Customer_Age', 'Gender', 'Dependent_count', 'Education_Level', 'Marital_Status',
                    'Income_Category', 'Card_Category', 'Months_on_book', 'Total_Relationship_Count',
                    'Months_Inactive_12_mon', 'Contacts_Count_12_mon', 'Credit_Limit', 'Attrition_Flag',
                    'Total_Revolving_Bal', 'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt',
                    'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio']

INITIAL_PRED_FEATURES = [col for col in INITIAL_FEATURES if col not in [ORIGINAL_TARGET, TARGET]]

NUM_FEATURES = ['Customer_Age', 'Dependent_count', 'Months_on_book', 'Total_Relationship_Count',
                'Months_Inactive_12_mon', 'Contacts_Count_12_mon', 'Credit_Limit', 'Total_Revolving_Bal',
                'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt', 'Total_Trans_Ct',
                'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio']

CAT_FEATURES = ['Gender', 'Education_Level', 'Marital_Status', 'Income_Category', 'Card_Category']

MODEL_FEATURES = ['Customer_Age', 'Dependent_count', 'Months_on_book', 'Total_Relationship_Count',
                  'Months_Inactive_12_mon', 'Contacts_Count_12_mon', 'Credit_Limit', 'Total_Revolving_Bal',
                  'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt', 'Total_Trans_Ct',
                  'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio', 'Gender_F', 'Gender_M',
                  'Education_Level_College', 'Education_Level_Doctorate', 'Education_Level_Graduate',
                  'Education_Level_High School', 'Education_Level_Post-Graduate',
                  'Education_Level_Uneducated', 'Education_Level_Unknown', 'Marital_Status_Divorced',
                  'Marital_Status_Married', 'Marital_Status_Single', 'Marital_Status_Unknown',
                  'Income_Category_$120K +', 'Income_Category_$40K - $60K', 'Income_Category_$60K - $80K',
                  'Income_Category_$80K - $120K', 'Income_Category_Less than $40K',
                  'Income_Category_Unknown', 'Card_Category_Blue', 'Card_Category_Gold',
                  'Card_Category_Platinum', 'Card_Category_Silver']

SCALER_TYPE = 'Standard'
ENCODER_DUMMY_NA = False
LOG_APPLICATION = True
COLS_TO_LOG = ['Total_Trans_Amt', 'Credit_Limit', 'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1',
               'Avg_Utilization_Ratio']

# Building initial pipelines (training and prediction)
initial_train_pipeline = Pipeline([
    ('col_filter', FiltraColunas(features=INITIAL_FEATURES)),
    ('target_transformer', DefineTarget(target_col=ORIGINAL_TARGET, pos_class=TARGET_POSITIVE_CLASS))
])

initial_pred_pipeline = Pipeline([
    ('col_filter', FiltraColunas(features=INITIAL_PRED_FEATURES))
])

# Building a numerical pipeline
num_pipeline = Pipeline([
    ('log_transformer', DynamicLogTransformation(application=LOG_APPLICATION, num_features=NUM_FEATURES, 
                                                 cols_to_log=COLS_TO_LOG)),
    ('scaler', DynamicScaler(scaler_type=SCALER_TYPE))
])

# Building a categorical pipeline
cat_pipeline = Pipeline([
    ('encoder', DummiesEncoding(dummy_na=ENCODER_DUMMY_NA))
])

# Building a preparation pipeline
prep_pipeline = ColumnTransformer([
    ('num', num_pipeline, NUM_FEATURES),
    ('cat', cat_pipeline, CAT_FEATURES)
])

In [None]:
# Reading raw data
df = pd.read_csv(os.path.join(DATA_PATH, FILENAME))

# Executing initial training prep pipeline
df_prep = initial_train_pipeline.fit_transform(df)

# Splitting data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df_prep.drop(TARGET, axis=1), df_prep[TARGET].values,
                                                    test_size=.20, random_state=42)

# Executing preparation pipeline
X_train_prep = prep_pipeline.fit_transform(X_train)
X_test_prep = prep_pipeline.fit_transform(X_test)

# Results
print(f'Shape of X_train_prep: {X_train_prep.shape}')
print(f'Shape of X_test_prep: {X_test_prep.shape}')
print(f'\nTotal model features: {len(MODEL_FEATURES)}')

_To be continued..._

<a id="4"></a>
<font color="darkslateblue" size=+2.5><b>4. Modelling: Predicting Churn</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

Finally, after extensive steps of exploratory analysis and preparation of the database, the time has come to apply Machine Learning concepts for the development of a predictive model capable of predicting the loss or migration of customers to other banking institutions. In possession of the final prepared basis, we will propose some algorithms capable of giving us this answer and, through their training and evaluation, we will choose the best model for the task in question.

All these steps will be built based on the tools available in the `pycomp` package through its `pycomp.ml.trainer` module (more specifically in the `ClassifierBinary` class). With ready-made codes and functions, the package brings a wide range of possibilities containing components that provide great ease in the development of predictive models.

<a id="4.1"></a>
<font color="dimgrey" size=+2.0><b>4.1 Structuring Variables</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

In this step, some important variables will be defined for the use of the `ClassificadorBinario` class of the `pycomp` package. It is at this moment that we define the structures and objects that will serve as input for the training and evaluation of the models.

In [None]:
# Importando modelos
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier

# Instanciando objetos
dtree = DecisionTreeClassifier()
forest = RandomForestClassifier()
lgbm = LGBMClassifier()
xgb = XGBClassifier()

# Criando dicionário set_classifiers
model_obj = [dtree, forest, lgbm, xgb]
model_names = [type(model).__name__ for model in model_obj]
set_classifiers = {name: {'model': obj, 'params': {}} for (name, obj) in zip(model_names, model_obj)}

print(f'Classifiers that will be trained on next steps: \n\n{model_names}')

<a id="4.2"></a>
<font color="dimgrey" size=+2.0><b>4.2 Training Models</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

Once the modeling structure has been prepared from specific objects, such as the `set_classifiers` dictionary, it is now possible to import the` ClassificadorBinario` class present in the `pycomp.ml.trainer` module to carry out all the training and evaluation of the candidate models.

This class was developed in order to greatly facilitate the work of the analyst / scientist in terms of implementing codes to train, evaluate and optimize predictive models for binary classification. Its methods include powerful features that perform various actions with just one call.

In [None]:
# Importing class
from pycomp.ml.trainer import ClassificadorBinario

# Creating an object and training models
trainer = ClassificadorBinario()
trainer.fit(set_classifiers, X_train_prep, y_train, random_search=False)

The `fit()` method of the created `trainer` object is responsible for training the models encapsulated in the `set_classifiers` dictionary created in the initial definitions stage.

By configuring the method to also apply the process of `RandomizedSearchCV` (random search of the best hyperparameters of each algorithm), it is possible to build models optimized according to the search space passed in the dictionary `set_classifiers`.

<a id="4.3"></a>
<font color="dimgrey" size=+2.0><b>4.3 Evaluating Performance</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

Once the candidate models are trained through the `fit()` method, it is then possible to evaluate the performance obtained in each case, thus returning the main classification metrics capable of indicating the best direction for the given task.

To perform this process, we can use the `evaluate_performance()` or `plot_metrics()` methods of the `trainer` object. In the first case, the return is an analytical DataFrame containing the result of the evaluation of each model against the main metrics. In the second case, the return is a visual analysis of the metrics for each of the models.

In [None]:
# Training results
metrics = trainer.evaluate_performance(X_train_prep, y_train, X_test_prep, y_test)
metrics

As mentioned earlier, the `evaluate_performance()` method returns an analytical table containing the performance of each model (in training and testing) for the main metrics for evaluating classification models. From this table, it is possible to point out that, in terms of accuracy, the `LightGBM` model performed slightly better than the others, despite the high time required to perform the calculations.

Thinking about setting, in fact, an optimization goal to choose the best predictive model, let's consider the "accuracy" as the metric to be used for this decision. Another way to analyze the performance of candidate models is from the `plot_metrics()` method. Its result can be seen below:

In [None]:
# Visual analysis on metrics
trainer.plot_metrics()

_To be continued..._

<font size="+1" color="black"><b>Please visit my other kernels by clicking on the buttons</b></font><br>

<a href="https://www.kaggle.com/thiagopanini/pycomp-exploring-and-modeling-housing-prices" class="btn btn-primary" style="color:white;">Pycomp: Housing Prices</a>
<a href="https://www.kaggle.com/thiagopanini/pycomp-predicting-survival-on-titanic-disaster" class="btn btn-primary" style="color:white;">Pycomp: Titanic EDA</a>
<a href="https://www.kaggle.com/thiagopanini/predicting-restaurant-s-rate-in-bengaluru" class="btn btn-primary" style="color:white;">Bengaluru's Restaurants</a>
<a href="https://www.kaggle.com/thiagopanini/sentimental-analysis-on-e-commerce-reviews" class="btn btn-primary" style="color:white;">Sentimental Analysis E-Commerce</a>