
# Boosting


## Initialization

In [None]:
!pip install skorch

In [None]:
%matplotlib inline

# join one or more path components intelligently
from os.path import join

# Interface com o sistema operacional
import os

# Manipulação de dataframes
import pandas as pd

# Manipulação de dados tabulares
import numpy as np

# visualização de dados baseada no matplotlib
import seaborn as sns

# Esboço de gráficos
from matplotlib import pyplot as plt

# Leitura de dados de arquivos hdf5
import h5py

# Separação do conjunto de dados em treino e teste
from sklearn.model_selection import train_test_split

# Normalização das características
from sklearn.preprocessing import StandardScaler

# Classificador AdaBoost
from sklearn.ensemble import AdaBoostClassifier

# Gradient Boosting para classificação
from sklearn.ensemble import GradientBoostingClassifier

# Determinação dos conjuntos de treino e de validação cruzada para plotagem da curva de aprendizado
from sklearn.model_selection import learning_curve     
from sklearn.model_selection import ShuffleSplit    # Random permutation cross-validator

# Classificadores Bagging e Random Forest
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier

# Avaliação do modelo
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

# Busca exaustiva sobre valores de parâmetros especificados para um estimador (sintonização de hiperparâmetros)
from sklearn.model_selection import GridSearchCV

# Transformação das features
from sklearn.preprocessing import MinMaxScaler

# Classificador Multi-layer Perceptron
from sklearn.neural_network import MLPClassifier

# Warning messages
import warnings

# Leitura de dados de arquivo .mat
from scipy.io import loadmat
import scipy.io as spio

# Otimização de funções
from scipy import optimize as opt

# Funções relacionadas ao tempo
import time


# Ensemble Learning

**Details**:

- A (fictitious) financial institution has a database with the history of installment plans offered to its customers. 

- Based on the credit history offered to clients, the institution wants to investigate the creation of classification models to infer whether or not a new client who submitted a loan application will pay the debt, if the bank decides to take out this loan.

- Objective: Predict whether or not a new customer would pay a contracted debt, based on the characteristics of this new customer. Once trained, a rating model for that problem can infer whether or not a new customer will honor an eventual loan granted to them.

- The dataset to be used for training has 1500 examples and contains data related to credits (loans) granted to clients of the financial institution. Records of credits granted to financial institution customers are contained in the `credtrain.txt` file. For each customer, 11 attributes (variables, characteristics) are defined. In addition, the last column of each example tells you whether or not the customer has honored the loan payment.

- Table 1 contains the description of the attributes.

<img src="https://github.com/cristianegea/PPCIC/blob/main/Aprendizado%20de%20Máquinas/Trabalho%203/img/img1.png?raw=true" width="50%"/>

**Objectives**:

- Perform experiments with *Boosting* algorithms to create classification models for the aforementioned problem. 
- Present the results (obtained with Scikit-Learn's `classification_report` function) for each classification model on the data contained in the `credtest.txt` file.

Note: with respect to the model's hyperparameters, it is possible to use the values *default* or the values obtained by searching for hyperparameters.

## Data processing

**Import and read data**

In [None]:
# Definition of variable names (according to the table contained in the statement)
colnames = ['ESCT', 'NDEP', 'INCOME', 'TIPOR', 'VBEM', 'NPARC',
            'VPARC', 'TEL', 'AGE', 'RESMS', 'ENTRY', 'CLASS']

In [None]:
# Reading training data
file = 'datasets/credtrain.txt'
data_train = pd.read_csv(file, sep='\t', header=None, names = colnames)

# Reading testing data
file = 'datasets/credtest.txt'
data_test = pd.read_csv(file, sep='\t', header=None, names = colnames)

**Data inspection**

In [None]:
# Dimension inspection
print(data_train.shape, data_test.shape)

In [None]:
# Training data structure
data_train.head()

In [None]:
# Testing data structure
data_test.head()

**Category variable treatment**

It is important to note that the variable ESCT (Marial Status) is categorical and can take 4 different values ​​(each value corresponds to a marital status). Thus, unlike NDEP (where each value corresponds to a number of dependents), in the variable ESCT each value corresponds to a category. However, this fact can lead to inconsistencies in the creation and training of models.

To mitigate this problem, an alternative is to transform the variable ESCT into a *dummy* variable (binary variable). In this sense, each category of the variable ESCT will correspond to a variable. Since there are 4 possible categories for the ESCT variable, we will get 4 binary ESCT variables.

A *dummy* variable is a binary variable used to represent categories. In this sense, in a case of a variable with 3 or more categories, it is recommended to create $n-1$ dummies. Therefore, the variable ESCT will be transformed into 4 "dummy variants", where the value 1 will correspond to the occurrence of a certain category and the value 0 will correspond to the non-occurrence.

In [None]:
# Training set
data_train_new = pd.get_dummies(data = data_train, 
                                prefix='ESCT', 
                                columns=['ESCT'], 
                                drop_first=True)

"""
pd.get_dummies: Convert categorical variable into dummy/indicator variables.
"""

data_train_new.head()

In [None]:
# Testing
data_test_new = pd.get_dummies(data = data_test, prefix='ESCT', columns=['ESCT'], drop_first=True)

"""
pd.get_dummies: Convert categorical variable into dummy/indicator variables
"""

data_test_new.head()

**Separation of dataset into label ($\mathrm{y}$) and features ($\mathrm{x}$)**

The label ($\mathrm{y}$) corresponds to the vector containing the target variable (CLASS), while features ($\mathrm{x}$) corresponds to the data matrix.

In [None]:
# Transforming the target variable of the training set into vector
y_train = np.array(data_train_new['CLASS'])

y_train[:5]

In [None]:
# Transforming the target variable of the testing set into vector
y_test = np.array(data_test_new['CLASS'])

y_test[:5]

In [None]:
# Transforming the remaining training set into a data matrix
features_name_train = list(data_train_new.columns)               
features_name_train.remove('CLASS')                             
X_train = np.array(data_train_new.loc[:, features_name_train])   

X_train

In [None]:
# Transforming the remaining testing set into a data matrix
features_name_test = list(data_test_new.columns)               
features_name_test.remove('CLASS')                            
X_test = np.array(data_test_new.loc[:, features_name_test])   

X_test

**Important note**: Due to the fact that the `learning_curve` class itself performs the division of the dataset in training and validation to compare the learning curves, there is no need to divide the set of training in the training and validation subsets.

**Normalization of features**

Similar to the previous works, before proceeding, the normalization of the characteristics will be carried out in order to avoid problems arising from the discrepancy in the order of magnitude of the features.

In [None]:
# Object creation for feature standardization
scaler = StandardScaler()

# Adjustment of StandardScaler to training dataset and standardization of training data
X_train_norm = scaler.fit_transform(X_train)

# Transformation of test data with parameters adjusted from training data
X_test_norm = scaler.transform(X_test)

In [None]:
# Datasets dimension
print(X_train_norm.shape, X_test_norm.shape, y_train.shape, y_test.shape)

## Boosting

**Objectives**:

- Create classification models through the inductors: `sklearn.ensemble.AdaBoostClassifier` and `sklearn.ensemble.GradientBoostingClassifier`.
- For each learning algorithm, present graphs that plot accuracy values (measured in the training and validation sets (against the number of training iterations).
- Divide the examples contained in the `credtrain.txt` file into 2 subsets (training and validation), using the 80/20 ratio for training and validation, respectively.

### Creating and training models

**AdaBoost**

AdaBoost uses the complete training set to train weak classifiers. In addition, the training samples are reweighted at each iteration to build a strong classifier, which learns from the mistakes previously made by the weak classifiers.

This algorithm constructs a committee, $C^\star(x)$, as a linear combination (weighted sum) of $T$ weak classifiers (weak learners).

$$
C^\star(x) = \sum^T_{i = 1} \alpha_i C_i(x)
$$

where $c_i(x)$ is a weak classifier and $\alpha_i$ is the weight assigned to each classifier/learner.

In [None]:
# Creation of the AdaBoost inductor without changing hyperparameters
adaboost = AdaBoostClassifier()

In [None]:
# Training
adaboost.fit(X_train_norm, y_train)

**Gradient Boosting**

Similar to AdaBoost, Gradient Boosting works by sequentially adding predictors to a committee, where each one fixes its predecessor. However, instead of adjusting the instance weights on each iteration (like AdaBoost), this method tries to adjust the new predictor to the residual errors made by the previous predictor.

In [None]:
# Creation of the Gradient Boosting inductor without changing hyperparameters
gradientboost = GradientBoostingClassifier()

In [None]:
# Training
gradientboost.fit(X_train_norm, y_train)

### Learning curves

Learning curves visualize the performance of a model over the training set and during cross-validation, as the number of observations in the training set increases. They are commonly used to determine whether learning algorithms could benefit from collecting additional training data.

**Definition of validation criteria**

In [None]:
# 20% random choice of set for validation for each iteration
cv = ShuffleSplit(test_size = 0.2, random_state = 31)

**AdaBoost Model Learning Curve**

In [None]:
# Application of the function to compare validation and training results
_, acc_treino, acc_val = learning_curve(adaboost,
                                        X_train_norm,
                                        y_train,
                                        scoring = 'accuracy',
                                        random_state = 31,
                                        cv = cv)

In [None]:
# Calculation of scores based on the average of the rounds
acc_treino_adaboost = np.mean(acc_treino, axis=1)
acc_val_adaboost = np.mean(acc_val, axis=1)

In [None]:
# learning curve graph
plt.figure(figsize = (10,6))
plt.plot(acc_treino_adaboost, alpha=0.7)         # Train
plt.plot(acc_val_adaboost, 'g--', alpha=0.7)     # Validation

# Graphics labels and captions
plt.xlabel('Number of training examples', fontweight='bold')
plt.ylabel('Accuracy', fontweight='bold')
plt.title('Learning curve: AdaBoost', fontweight='bold')
plt.legend(['Train', 'Validation'])

From the graphic above, it is possible to observe that from 1 training example onwards, the algorithm is starting to adjust to the examples of the training and validation set. Therefore, it is possible to observe that the accuracy of the model with the training data presents a downward trend and the accuracy of the model with the validation data presents an upward trend. In other words, the graph above shows that as the number of training examples increases, the greater the training error tends to be and the smaller the validation error tends to be. Therefore, it is possible to observe a possible *underfitting* problem.

**Gradient Boosting Model Learning Curve**

In [None]:
# Application of the function to compare validation and training results
_, acc_treino, acc_val = learning_curve(gradientboost,
                                        X_train_norm,
                                        y_train,
                                        scoring = 'accuracy',
                                        random_state = 31,
                                        cv = cv)

In [None]:
# Calculation of scores based on the average of the rounds
acc_treino_gradientboost = np.mean(acc_treino, axis=1)
acc_val_gradientboost = np.mean(acc_val, axis=1)

In [None]:
# learning curve graph
plt.figure(figsize = (10,6))
plt.plot(acc_treino_gradientboost, alpha=0.7)         # Treino
plt.plot(acc_val_gradientboost, 'g--', alpha=0.7)     # Validação

# Graphics labels and captions
plt.xlabel('Number of training examples', fontweight='bold')
plt.ylabel('Accuracy', fontweight='bold')
plt.title('Learning curve: Gradient Boost', fontweight='bold')
plt.legend(['Train', 'Validation'])

From the graphic above, it is possible to observe that from 1 training example onwards, the algorithm is starting to adjust to the examples of the training and validation set. Therefore, it is possible to observe that the accuracy of the model with the training data presents a downward trend and the accuracy of the model with the validation data presents an upward trend. In other words, the graph above shows that as the number of training examples increases, the greater the training error tends to be and the smaller the validation error tends to be. Therefore, it is possible to observe a possible *underfitting* problem.

**Comparison between learning curves**

In [None]:
fig = plt.figure(figsize=(20,6))
fig.suptitle('Comparison of Learning Curves', fontweight='bold')

ax1 = fig.add_subplot(1,2,1)
ax2 = fig.add_subplot(1,2,2)

# AdaBoost Learning Curve
ax1.plot(acc_treino_adaboost, alpha=0.7)         # Train
ax1.plot(acc_val_adaboost, 'g--', alpha=0.7)     # Validation

# Graphics labels and captions
ax1.set_xlabel('Number of training examples', fontweight='bold')
ax1.set_ylabel('Acurácia', fontweight='bold')
ax1.set_title('Learning curve: AdaBoost', fontweight='bold')


# Gradient Boosting Learning Curve
ax2.plot(acc_treino_gradientboost, alpha=0.7)         # Treino
ax2.plot(acc_val_gradientboost, 'g--', alpha=0.7)     # Validação

# Graphics labels and captions
ax2.legend(['Treino', 'Validação'], loc='upper right')
ax2.set_xlabel('Número de exemplos de treinamento', fontweight='bold')
ax2.set_title('Curva de Aprendizado: Gradient Boosting', fontweight='bold')

# Final visual adjustments
sns.despine()

As discussed earlier for each case alone, both cases indicate a possible *underfitting* problem.

Another point worth mentioning are the differences identified in each case:

- While in AdaBoost it is possible to visualize a convergence between the training and validation curves, in Gradient Boost they are more distant.
- The behavior of the training and validation curves have smoother trajectories for Gradient Boost than for AdaBoost.
- The level of accuracy of the learning curves for Gradient Boost remained, on average, higher than that of the learning curves for AdaBoost.

### Model Prediction and Evaluation

**AdaBoost**

In [None]:
# Using models for prediction
y_pred_adaboost = adaboost.predict(X_test_norm)

In [None]:
# Model evaluation
print(classification_report(y_test, y_pred_adaboost))

**Gradient Boosting**

In [None]:
# Using models for prediction
y_pred_gradientboost = gradientboost.predict(X_test_norm)

In [None]:
# Model evaluation
print(classification_report(y_test, y_pred_gradientboost))

**Comparison of performance between models**

From the comparison above, it is possible to infer that Gradient Boosting had better predictive performance compared to AdaBoost. Given that the Gradient Boosting learning curves presented a more stable behavior and at a higher level of accuracy, it was expected that it would present a superior predictive performance (compared to the AdaBoost model).