In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Santander Customer Satisfaction

The problem is realitvely simple, we need to get the highest AUC of the ROC curve as we can.

## In order to solve this problem, we are going to do the following steps:
- 1 Loading Data and Packeges;
- 2 Basic Exploratory Analysis;
- 3 Dataset Split (train - test);
- 4 Features Selection;  
    - 4.1 Removing low variance features;  
    - 4.2 Removing repeated features;
    - 4.3 Using SelectKBest to compare "f_classif" & "mutual_info_classif" approaches;
- 5 Bayesian Opitimization to the XGBClassifier model;
- 6 Model scoring;
- 7 Results Analysis;
- 8 Next steps;
- 9 References;

Let us begin!

### 1 Loading Data and Packeges

In [None]:
# Loading packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time

%matplotlib inline

In [None]:
# Loading the Train and Test datasets
df_train = pd.read_csv("../input/santander-customer-satisfaction/train.csv")
df_test = pd.read_csv("../input/santander-customer-satisfaction/test.csv")

### 2 Basic Exploratory Analysis
For this step, let us address the following points:
- Are the data in the columns numeric or do they need to be encoded?
- Can the test dataset really be used or is it useful only for a Kaggle competition?
- Are there any missing data?
- What is the proportion of dissatisfied customers (1) in the dataset df_train?
- Does it make sense to apply a features selection method on the data?

In [None]:
# Checking the genearl infos of df_train
df_train.info()

In [None]:
# Checking the genearl infos of df_test
df_test.info()

Looking at the outputs of the cells above, we can say that:
- All columns are already in a numeric format. This means that we don't need to do any encoding to convert any type of variable into a numeric variable.  
- Since this is an anonym dataset, we have any cue if there are categorical variables. So, there is no need to make any encode to address this problem.
- Lastly, df_train has 371 columns and df_test has 370 columns.

In [None]:
# Checking the first 5 rows of df_train
df_train.head()

In [None]:
# Checking the first 5 rows of df_test
df_test.head()

By comparing the two cells above, it is clear that df_test doesn't have the TARGET variable.

As expected, since these datasets come originally from a Kaggle competition, the test dataset should not have the TARGET column.

In order to make the test dataset useful, I will split the df_train in train and test datasets and keep df_test as the last check.
After we test the model with test split data, we can double-check the performance by making predictions on df_test and uploading the results on Kaggle competition as a late submission.

In [None]:
# Checking if is there any missing value in both train and test datasets
df_train.isnull().sum().sum(), df_test.isnull().sum().sum()

So, we can conclude that both datasets are free from any missing values.

In [None]:
# Investigating the proportion of unsatisfied customers on df_train
rate_insatisfied = df_train.TARGET.value_counts()[1] / df_train.TARGET.value_counts()[0]
rate_insatisfied * 100

We have an **extremely unbalanced dataset**, approximately 4.12% positive. This must be taken into account in two situations:
- To split the data in train and test.
- To choose hyperparameters such as "class_weight" by Random Forest.

Since both train and test dataset are relatively large (around 76k rows and 370 columns), and we don't know the what each feature represents and how they can impact the model, it demands a features selection for three reasons:
1. To know which features bring most relevant prediction power to the model;
2. Avoid using features that could degrade the model performance;
3. Minimize the computational coast by using the minimal amount of features that provide the best model performance.

### 3 Dataset Split
Here we are going to split the df_train in train and test dataset. 

As the train_test_split method does the segmentation at random, even with an extremely unbalanced dataset, the split should occur so that both training and testing have the same proportion of unsatisfied customers.  
**However, as it is difficult to guarantee randomness in fact, we can make a stratified split based on the TARGET variable and thus ensure that the proportion is exact in both datasets.**

In [None]:
from sklearn.model_selection import train_test_split

# Spliting the dataset on a proportion of 80% for train and 20% for test.
X_train, X_test, y_train, y_test = train_test_split(df_train.drop('TARGET', axis = 1), df_train.TARGET, 
                                                    train_size = 0.8, stratify = df_train.TARGET,
                                                    random_state = 42)

#Checando o resultado do splot
X_train.shape, y_train.shape[0], X_test.shape, y_test.shape[0]

### 4 Feature Selection

Here we want to investigate the following questions:
- Are there constant and/or semi-constates features that can be removed?
- Are there duplicate features?
- Does it make sense to perform some more filtering to reach a smaller group of features?

In [None]:
# Making copys of X_train and X_test to work with in this section
X_train_clean = X_train.copy()
X_test_clean = X_test.copy()

In [None]:
# Investigating if there are constant or semi-constat feature in X_train
from sklearn.feature_selection import VarianceThreshold

# Removing all features that have variance under 0.01
selector = VarianceThreshold(threshold = 0.01)
selector.fit(X_train_clean)
mask_clean = selector.get_support()
X_train_clean = X_train_clean[X_train_clean.columns[mask_clean]]

In [None]:
# Cheking if we realy removed something
(len(df_train.columns) - 1) - X_train_clean.shape[1]

In [None]:
# Total of remaning features
X_train_clean.shape[1]

With this filtering, 104 features were removed. Thus, the dataset has become leaner without losing predictive power, as these features do not add information to the ML model that impact its ability to classify an instance.  
**We have now 266 features.**

#### 4.2 Removing repeated features;

In [None]:
# Checking if there is any duplicated column
remove = []
cols = X_train_clean.columns
for i in range(len(cols)-1):
    column = X_train_clean[cols[i]].values
    for j in range(i+1,len(cols)):
        if np.array_equal(column, X_train_clean[cols[j]].values):
            remove.append(cols[j])


# If yes, than they will be dropped here
X_train_clean.drop(remove, axis = 1, inplace=True)

In [None]:
# Checking if any column was dropped
X_train_clean.shape

There were 266 columns before checking for duplicate features and now there are 251. Soon there were 15 repeated features.

#### 4.3 Using SelectKBest to compare "f_classif" & "mutual_info_classif" approaches

There are two types of methods for evaluating features in conjunction with SelectKBest: f_classif (fc) and mutual_info_classif (mic). The first works best when the features and Target have a more linear relationship. The second is more appropriate when there are non-linear relationships.  
SINCE KAGGLE LIMITS THE SESSION IN 9 HOURS, THERE IS NOT ENOUGH TIME TO RUN THE ANALYSIS FOR MUTUAL_INFO_CLASSIF EVEN WHEN WE USE A WHOLE SESSION FOR MUTUAL_INFO_CLASSIF. THEREFORE, IN THIS NOTEBOOK WILL BE PRESENTED ONLY THE ANALYSIS FOR F_CLASSIF METHOD, BUT I WILL LEAVE THE CODE FOR MUTUAL_INFO_CLASSIF FOR ANYONE WHO MIGHT WANT TO USE IT.  
The analysis for both methods are availeble on my github: https://github.com/PedroHCouto/Santander-Case/blob/master/Part%20A%20-%20Classification.ipynb

As the dataset is anonymized and the quantity of features is too large to make a quality study on the feature-target relationship, both methods will be tested and the one that produces a stable region with the highest AUC value will be chosen.

For this, different K values will be tested with the SelectKBest class, which will be used to train an XGBClassifier model and evaluated using the AUC metric. Having a collection of values, a graph for fc and another for mic will be created.

Thus, through a visual analysis, it is possible to choose the best K value as well as the best method for scoring features.

In [None]:
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif
from sklearn.metrics import roc_auc_score as auc
from sklearn.model_selection import cross_val_score
import xgboost as xgb

Let's first analyse the method f_classif (fc).

In [None]:
#Create an automated routine to test different K values for f_classif method

K_vs_score_fc = [] #List to store AUC of each K with f_classif

start = time.time()

for k in range(2, 247, 2):
    start = time.time()
    
    # Instantiating a KBest object for each of the metrics in order to obtain the K features with the highest value
    selector_fc = SelectKBest(score_func = f_classif, k = k)

    
    # Selecting K-features and modifying the dataset
    X_train_selected_fc = selector_fc.fit_transform(X_train_clean, y_train)

    
    # Instantiating an XGBClassifier object
    clf = xgb.XGBClassifier(seed=42)
    
    # Using 10-CV to calculate AUC for each K value avoinding overfitting
    auc_fc = cross_val_score(clf, X_train_selected_fc, y_train, cv = 10, scoring = 'roc_auc')

    
    # Adding the average values obtained in the CV for further analysis.
    K_vs_score_fc.append(auc_fc.mean())

    
    end = time.time()
    # Returning the metrics related to the tested K and the time spent on this iteration of the loop
    print("k = {} - auc_fc = {} - Time = {}s".format(k, auc_fc.mean(), end-start))
    
end = time.time()
print(end - start)


Now let's analyse the mutual_info_classif (mic) method.

In [None]:
# Just for purpose of sharing this piece of code
# Create an automated routine to test different K values for mutual_info_classif

K_vs_score_mic = [] #List to store AUC of each K with mutual_info_classif


for k in range(2, 247, 2):
    start = time.time()
    
    # Instantiating a KBest object for each of the metrics in order to obtain the K features with the highest value
    selector_mic = SelectKBest(score_func = mutual_info_classif, k = k)
    
    # Selecting K-features and modifying the dataset
    X_train_selected_mic = selector_mic.fit_transform(X_train_clean, y_train) 
    
    # Instantiating an XGBClassifier object
    clf = xgb.XGBClassifier(seed=42)
    
    # Using 10-CV to calculate AUC for each K value avoinding overfitting
    auc_mic = cross_val_score(clf, X_train_selected_mic, y_train, cv = 10, scoring = 'roc_auc')
    
    # Adding the average values obtained in the CV for further analysis.
    K_vs_score_mic.append(auc_mic.mean())
    
    end = time.time()
    # Returning the metrics related to the tested K and the time spent on this iteration of the loop
    print("k = {} - auc_mic = {} - Time = {}s".format(k, auc_mic.mean(), end-start))
    


In [None]:
# Checking if both list have 123 elements each
len(K_vs_score_fc)

In [None]:
# Ploting K_vs_score_fc (# of K-Best features vs AUC)

# Figure setup
fig, ax = plt.subplots(figsize = (20, 8))
plt.title('Score valeus for each K with f_classif method', fontsize=18)
plt.ylabel('Score', fontsize = 16)
plt.xlabel('Value of K', fontsize = 16)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.xticks(fontsize = 12)
plt.yticks(fontsize = 12)

# Create the lines
plt.plot(np.arange(2, 247, 2), K_vs_score_fc, color='blue', linewidth=2)

plt.show()

Through the graphs above, it is noted that the best values are between 0.80 and 0.82 AUC. However, as the graphs have a range from 0.70 to 0.82 due to small K values.  
  
Thus, the visualization only of the range between 0.8 and 0.82 must be done, in order to ensure a better evaluation of the K value and which method will be maintained for the next steps.

In [None]:
# Ploting K_vs_score_fc (# of K-Best features vs AUC) 
import matplotlib.patches as patches

# Figure setup
fig, ax = plt.subplots(1, figsize = (20, 8))
plt.title('Score valeus for each K with f_classif method', fontsize=18)
plt.ylabel('Score', fontsize = 16)
plt.xlabel('Value of K', fontsize = 16)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.xticks(fontsize = 12)
plt.yticks(fontsize = 12)

# Create the lines
plt.plot(np.arange(2, 247, 2), K_vs_score_fc, color='blue', linewidth=2)
ax.set_ylim(0.80, 0.825);

# Create a Rectangle patch
rect = patches.Rectangle((82, 0.817), 20, (0.823 - 0.817), linewidth=2, edgecolor='r', facecolor='none')

# Add the patch to the Axes
ax.add_patch(rect)

plt.show()

Analyzing which method generates a better set of features to be used, we look for two main points:

- The smallest number of features that generate the highest AUC value;
- That the K-features be in a more stable region of the curve because if K is just a peak, this can bring some instability to the model.

Applying these conditions to the above graphs, we observed that the region around K = 96, red rectangle, gives us a behaviour that satisfies the both conditions.

Therefore, ___we selected for K the value of 96___, which would be an intermediate point in this region.


In [None]:
# Selection the 96 best features aconrdingly to f_classif
selector_fc = SelectKBest(score_func = f_classif, k = 96)
selector_fc.fit(X_train_clean, y_train)
mask_selected = selector_fc.get_support()

# Saving the selected columns in a list
selected_col = X_train_clean.columns[mask_selected]
selected_col

We can then visualize the importance of each feature according to the model.

In [None]:
#plotando o feature score das 96 melhores features
feature_score = pd.Series(selector_fc.scores_, index=X_train_clean.columns).sort_values(ascending=False)

fig, ax = plt.subplots(figsize=(20, 12))
ax.barh(feature_score.index[0:30], feature_score[0:30])
plt.gca().invert_yaxis()


ax.set_xlabel('K-Score', fontsize=18);
ax.set_ylabel('Features', fontsize=18);
ax.set_title('30 best features by its K-Score', fontsize = 20)
plt.yticks(fontsize = 14)
plt.xticks(fontsize = 14)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False);

We can see that a small group of 10 features has a great impact on the classification and the rest generates only a much smaller impactor.  

The ten features with the greatest impact are:

In [None]:
[print(i) for i in feature_score.index[0:10]];

Now that we have the list of features that we will use for this task, we can create dataset where only the desired features should be present.

In [None]:
# Creating datasets where only with the selected 96 features are included
X_train_selected = X_train[selected_col]
X_test_selected = X_test[selected_col]

In [None]:
# Checking the first 5 rows of X_train_selected and its shape
X_train_selected.head()

In [None]:
# Checking the first 5 rows of X_train_selected and its shape
X_test_selected.head()

Now that we have a good understanding of the features and a good selection of which have the greatest impact on the model, we can move on to the next steps.

### 5 Bayesian Opitimization to the XGBClassifier model

For this classification task, we are going to use a XGBoost Classifier algorithm. This algorithm is known for its great performance, robustness and simplicity to understand the learning process.

So, by using an algorithm an interesting approach is to optimize its hyperparameters in a way we can have the best performance it can offer to us.  

For this task, we will use Bayesian Optimization approach. Some articles prove its greater efficiency when compared to grid search and a performance that is similar or even better than random search. Another advantage is that Bayesian Optimization allows us to optimize multiples hyperparameters at the same time.

The scikit-optimize package provides us with a great structure to perform Bayesian optimization of hyperparameters.

In [None]:
# Using a random forest to optimize
from skopt import forest_minimize

In [None]:
# Function for hyperparamters tunning
# Implementation learned on a lesson of Mario Filho (Kagle Grandmaster) for parametes optmization.
# Link to the video: https://www.youtube.com/watch?v=WhnkeasZNHI
def tune_xgbc(params):
    """Function to be passed as scikit-optimize minimizer/maximizer input
    
    Parameters:
    Tuples with information about the range that the optimizer should use for that parameter, 
    as well as the behaviour that it should follow in that range.
    
    Returns:
    float: the metric that should be minimized. If the objective is maximization, then the negative 
    of the desired metric must be returned. In this case, the negative AUC average generated by CV is returned.
    """
    
    
    #Hyperparameters to be optimized
    print(params)
    learning_rate = params[0] 
    n_estimators = params[1] 
    max_depth = params[2]
    min_child_weight = params[3]
    gamma = params[4]
    subsample = params[5]
    colsample_bytree = params[6]
        
    
    #Model to be optimized
    mdl = xgb.XGBClassifier(learning_rate = learning_rate, n_estimators = n_estimators, max_depth = max_depth, 
                            min_child_weight = min_child_weight, gamma = gamma, subsample = subsample, 
                            colsample_bytree = colsample_bytree, seed = 42)
    

    #Cross-Validation in order to avoid overfitting
    auc = cross_val_score(mdl, X_train_selected, y_train, cv = 10, scoring = 'roc_auc')
    
    print(auc.mean())
    # as the function is minimization (forest_minimize), we need to use the negative of the desired metric (AUC)
    return -auc.mean()

In [None]:
# Creating a sample space in which the initial randomic search should be performed
space = [(1e-3, 1e-1, 'log-uniform'), # learning rate
          (100, 2000), # n_estimators
          (1, 10), # max_depth 
          (1, 6.), # min_child_weight 
          (0, 0.5), # gamma 
          (0.5, 1.), # subsample 
          (0.5, 1.)] # colsample_bytree 

# Minimization using a random forest with 20 random samples and 50 iterations for Bayesian optimization.
result = forest_minimize(tune_xgbc, space, random_state = 42, n_random_starts = 20, n_calls  = 25, verbose = 1)

In [None]:
# Hyperparameters optimized values
hyperparameters = ['learning rate', 'n_estimators', 'max_depth', 'min_child_weight', 'gamma', 'subsample',
                   'colsample_bytree']

for i in range(0, len(result.x)): 
    print('{}: {}'.format(hyperparameters[i], result.x[i]))

Visualizing the convergence of the optimization for the parameters that lead to the highest AUC.

In [None]:
from skopt.plots import plot_convergence

In [None]:
# Setting up the figure
fig, ax = plt.subplots(figsize = (20,8))
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.grid(b = None)

# Ploting
plot_convergence(result)

# Setting up axes and title
ax.set_title('Convergence Plot', fontsize = 18)
ax.set_xlabel('Number of calls (n)', fontsize = 16)
ax.set_ylabel('min(x) after n calls', fontsize = 16);

In practice, the negative of the AUC was minimized, that is, the lower the value, the better the chosen parameters performed.  

Thus, it is possible to notice some very relevant leaps along the iterations of the Bayesian optimization. Near the sixth iteration, the negative of the AUC already signals that the hyperparameters found had reached a stable region and values optimized for the AUC.  

Therefore, we proceed to the next steps with the optimal values found for the parameters.

### 6 Model scoring  

Now that the most important features and the best hyperparameters for the model are known, this model can be applied to the test data for the final evaluation of the model. Again, the AUC metric will be used here.

In [None]:
# Generating the model with the optimized hyperparameters
clf_optimized = xgb.XGBClassifier(learning_rate = result.x[0], n_estimators = result.x[1], max_depth = result.x[2], 
                            min_child_weight = result.x[3], gamma = result.x[4], subsample = result.x[5], 
                            colsample_bytree = result.x[6], seed = 42)

In [None]:
# Fitting the model to the X_train_selected dataset
clf_optimized.fit(X_train_selected, y_train)

In [None]:
# Evaluating the performance of the model in the test data (which have not been used so far).
y_predicted = clf_optimized.predict_proba(X_test_selected)[:,1]
auc(y_test, y_predicted)

So, our model AUC Score was **0.8477!** Pretty good so far!

How dataset comes from a competition of Kaggle, we can test the site of the training data (df_test) on the platform and see what was the AUC value obtained in a dataset of 75818 instances never seen by the model.

In [None]:
# making predctions on the test dataset (df_test), from Kaggle, with the selected features and optimized parameters
y_predicted_df_test = clf_optimized.predict_proba(df_test[selected_col])[:, 1]

In [None]:
# saving the result into a csv file to be uploaded into Kaggle late subimission 
# https://www.kaggle.com/c/santander-customer-satisfaction/submit
sub = pd.Series(y_predicted_df_test, index = df_test['ID'], name = 'TARGET')
sub.to_csv('submission.csv')

Let us now upload the file and see which result we achieve!

Seems that our model is also a very good job at data that it has never seen before. That is great!

### 7 Results Analysis  
Throughout the model development process, steps were taken to select features and optimize to generate a robust model, capable of generalizing well to new data and maximizing profits by correctly classifying customers by minimizing the amount of FP and maximizing the amount of TP.  

So, using the AUC metric, we arrived at a model that:

- On test data, split in step 3, **scored 0.8477 for AUC;**
- On Kaggle data, in 75818 new instances, **scored 0,8305 for AUC.**

It can therefore be concluded that the objective of creating a model that maximizes profits has been achieved satisfactorily.

In addition, we can analyze the ROC curve and better understand the AUC generated by the model.  
This brings a better understanding of how profit maximization can be done and we can see the optimum point for the classification decision threshold.

In [None]:
# Code base on this post: https://stackoverflow.com/questions/25009284/how-to-plot-roc-curve-in-python
import sklearn.metrics as metrics

# Calculate FPR and TPR for all thresholds
fpr, tpr, threshold = metrics.roc_curve(y_test, y_predicted)
roc_auc = metrics.auc(fpr, tpr)

# Plotting the ROC curve
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize = (20, 8))
plt.title('Receiver Operating Characteristic', fontsize=18)
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.4f' % roc_auc)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.legend(loc = 'upper left', fontsize = 16)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate', fontsize = 16)
plt.xlabel('False Positive Rate', fontsize = 16)
plt.show()

Finally, analyzing the ROC curve, we can choose the cut-off that maximizes profits. In this case, the point to be chosen is where the AUC curve approaches (shortest distance) from the top of the y scale. Thus, it can be concluded that the cut-off to be chosen is the one that generates an FPR of 0.12 and a TPR of approximately 0.61.

### 8 Next steps

For further iterations on this project in order to improve the analysis and the results, I would suggest 3 main points:

- Work on feature engineering creating new features if possible;
- Try out different ML algorithms and compare them to the XGBClassifier
- As Caio Martins (https://github.com/CaioMar/) did and suggested me, a nice improvement would be to create a function that calculates the total profit. It is possible once we have values for TP and FP.

### 9 References
[1] Banerjee. Prashant, Comprehensive Guide on Feature Selection., https://www.kaggle.com/prashant111/comprehensive-guide-on-feature-selection  
[2] D. Beniaguev., Advanced Feature Exploration. https://www.kaggle.com/selfishgene/advanced-feature-exploration  
[3] M. Filho., A forma mais simples de selecionar as melhores variáveis usando Scikit-learn. https://www.youtube.com/watch?v=Bcn5e7LYMhg&t=2027s  
[4] M. Filho., Como Remover Variáveis Irrelevantes de um Modelo de Machine Learning, https://www.youtube.com/watch?v=6-mKATDSQmk&t=1454s  
[5] M. Filho., Como Tunar Hiperparâmetros de Machine Learning Sem Perder Tempo, https://www.youtube.com/watch?v=WhnkeasZNHI  
[6] G. Caponetto., Random Search vs Grid Search for hyperparameter optimization, https://towardsdatascience.com/random-search-vs-grid-search-for-hyperparameter-optimization-345e1422899d  
[7] A. JAIN., Complete Guide to Parameter Tuning in XGBoost with codes in Python, https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/ 
[8] How to plot ROC curve in Python, https://stackoverflow.com/questions/25009284/how-to-plot-roc-curve-in-python  
[9] F. Santana., Algoritmo K-means: Aprenda essa Técnica Essêncial através de Exemplos Passo a Passo com Python, https://minerandodados.com.br/algoritmo-k-means-python-passo-passo/  
[10] A. Géron., Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems, Alta Books, Rio de Janeiro, 2019, 516 p.  
[11] W. McKinney., Python for data analysis, Novatec Editora Ltda, São Paulo, 2019, 613 p.  