# Summary

This work is organized as follows:

* 1. Introduction
* 2. Data analysis
* 3. Validation method
* 4. Model training
* 5. Submission


# 1. Introduction

This competition was chosen as a internship work for a study programme in Data Science.

It was hosted by Santander with the aim do identify "which customers will make a specidic transaction in the future, irrespective of the amount of money transacted" (as it is writen in the description of the challenge).

The data provided were anonymized and contain only numeric feature variables. The target is binary. We have three files to work with: one with traning data, another one with test data and a sample to indicate the correct format for submission

This work is inspired in the notebook "Projeto completo de Classificação Binária (Diabetes)", by Marcos Kalinowski and Tatiana Escovedo, used in the course "Engenharia de Software para Ciência de Dados - PUC-Rio"

Other notebooks and sites were used as references:
* https://towardsdatascience.com/kagglers-guide-to-lightgbm-hyperparameter-tuning-with-optuna-in-2021-ed048d9838b5
* https://www.kaggle.com/code/dimanjung/lgbm-with-parameters
* https://www.kaggle.com/code/dott1718/922-in-3-minutes/notebook
* https://www.kaggle.com/code/thomasandarilho/tp3-desafio-vivencial
* https://www.kaggle.com/code/prashant111/lightgbm-classifier-in-python/notebook
* https://lightgbm.readthedocs.io/en/latest/
* https://scikit-learn.org/stable/



## 1.1. Loading libraries and data

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


# Load of libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option('display.max_columns', None)

# Sklearn preprocessing
from sklearn.preprocessing import StandardScaler # para padronização
from sklearn.preprocessing import MinMaxScaler # para normalização
from sklearn.decomposition import PCA

from sklearn.model_selection import train_test_split
from sklearn.model_selection import ParameterGrid

# Metrics
from sklearn.metrics import roc_auc_score
from sklearn.metrics import f1_score

from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.neural_network import MLPClassifier
from lightgbm import LGBMClassifier
import lightgbm

# To balance data
from imblearn.over_sampling import SMOTE 

import time
import warnings
warnings.filterwarnings("ignore")

In [None]:
# Load of the data
df = pd.read_csv('/kaggle/input/santander-customer-transaction-prediction/train.csv')
df.set_index('ID_code',inplace=True)

test = pd.read_csv('/kaggle/input/santander-customer-transaction-prediction/test.csv')
test.set_index('ID_code',inplace=True)

# 2. Data analysis

The data is comprised of two hundred variables and one target. As the data is anonymized, it is not possible to assign meaning to the variables. There are two hundred thousand entries.

In [None]:
df.sample(5)

In [None]:
print('Data has {1} columns and {0} entries'.format(df.shape[0], df.shape[1]))

The "target" is a integer, while all the other columns are flot.

In [None]:
df.info()

Variables have different means, minimun and maximun values.

In [None]:
df.describe()

"Positivies" are 10.05% of the total entries, meaning that we have imbalanced data.

In [None]:
number_of_positive = df.groupby('target').target.count()[1]
number_of_negative = df.groupby('target').target.count()[0]

print("There are {0} entries 'Positive' and {1} entries 'Negative'. Positives are {2:.2f}% of total.".format(number_of_positive, number_of_negative, 100*number_of_positive/(number_of_positive+number_of_negative)))

The histograms of all variables are shown below.

In [None]:
fig, axs = plt.subplots(40, 5, figsize=(5*3, 40*3),sharey='row')
#fig.suptitle('Histogram of all variables')
i = 0
for col in df.columns[1:]:

    if i%5 == 0:
        legend=True
    else:
        legend=False
        
    sns.histplot(data=df, x=col,kde=True, hue='target',multiple='stack',ax=axs[i//5,i%5],legend=legend)
        
    axs[i//5,i%5].set_title(col)
    axs[i//5,i%5].set_xlabel('')
    axs[i//5,i%5].set_ylabel('')
    i+=1
plt.show()

In [None]:
median_values = df.iloc[:,1:].median()
mean_values = df.iloc[:,1:].mean()
std_values = df.iloc[:,1:].std()
coef_of_variation = df.iloc[:,1:].mean() / df.iloc[:,1:].std()


#plt.figure(figsize=(16, 16))
fig, axs = plt.subplots(1,2,figsize=(12, 4))


axs[0].scatter(x=mean_values,y=std_values)
axs[0].set_title("Mean x standard deviation")
axs[0].set_xlabel("Mean")
axs[0].set_ylabel("Standard deviation")

axs[1].scatter(x=mean_values,y=median_values)
axs[1].set_title("Mean x Median")
axs[1].set_xlabel("Mean")
axs[1].set_ylabel("Median")

plt.show()

The mean correlation between variables is 0.005, and the percentile 99% is 0.0064. According to these results and the histogram of the correlations values, variables are not correlated.

In [None]:
correlations = df.iloc[:,1:].corr()
array_of_correlations = correlations.values.reshape(200*200)
print("\nThe mean value of correlations between variables is {0:.2}".format(array_of_correlations.mean()))
print("\nThe percentile 99% of correlatios between variables is {0:.2}".format(np.percentile(array_of_correlations,q=99)))
print('\n\n')

plt.hist(correlations.values.reshape(200*200))
plt.title('Histogram of correlations between variables')

plt.show()

Computing the mean between positive and negative values for all variables and comparing them shows that they are quite the same, as the graph shows almost a straight line around the curve x = y.

In [None]:
values = df.columns.values[1:]
pivot_table = pd.pivot_table(df,values=values,columns='target',aggfunc=[np.mean])
print(pivot_table.sample(10),'\n\n')

negative_mean = pivot_table.loc[:,('mean',0)].values
positive_mean = pivot_table.loc[:,('mean',1)].values
difference = 1 * (negative_mean - positive_mean)
perc_difference = 100 * (negative_mean / positive_mean - 1)

fig, axs = plt.subplots(2,2,figsize=(4*2, 4*2),constrained_layout=True)


axs[0,0].scatter(x=positive_mean,y=negative_mean)
axs[0,0].set_xlabel('Positive mean')
axs[0,0].set_ylabel('Negative mean')
axs[0,0].set_title('Comparison between\n the mean of positive and negative entries')
axs[0,0].set_xlim(0,27)
axs[0,0].set_ylim(0,27)


axs[0,1].hist(difference)
axs[0,1].set_title('Absolute difference between\n the mean of positive and negative entries')

axs[1,0].hist(perc_difference,bins=100)
axs[1,0].set_title('Relative difference between\n the mean of positive and negative entries')

axs[1,1].hist(perc_difference,bins=100)
axs[1,1].set_title('Relative difference between\n the mean of positive and negative entries\n(Between -100% nd 100%)')
axs[1,1].set_xlim(-100,100)


plt.show()

# 3. Validation method

For testing, data will be splitted into "train" and "test". Afterwards, the chosen model will be validated using the file "test.csv" provided by the competition.

For evaluating the model, the area under the ROC curve between the predicted probability and the observed target will be used.

In [None]:
X = df.iloc[:,1:].values
y = df.iloc[:,0].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 4. Model training

## 4.1. Resampling

As the data is unbalanced, SMOTE will be used the balance it.

In [None]:
# Resample using SMOTE
sm = SMOTE(random_state=42,sampling_strategy='minority',k_neighbors=5)
X_resampled, y_resampled = sm.fit_resample(X_train, y_train)

# Shuffle the data to avoid erros in cross validation score (to avoid creat a group with only one class)
matrix_shuffled = np.concatenate((y_resampled.reshape(len(y_resampled),1),X_resampled),axis=1)
np.random.shuffle(matrix_shuffled)

# Assign the shuffled data to the variables
X_resampled = matrix_shuffled[:,1:]
y_resampled = matrix_shuffled[:,0]

## 4.2. Test models

Firstly, some models will be tested to find out which has the the better results.

The following models will be tested: Logistic Regression (LR), Decision Tree (CART), Naive Bayes (NB), Ada Boost Classifier (AB), Extra Trees Classifier (ET), Multi-layer Perceptron classifier (MLP) and Light GBM (lgbm).

These models were chosen according to their results and speed to run.

Standard Scaler and Principal component analysis (PCA) will be used as well.

In [None]:
def test_models(scalers, pcas, models, X_resampled_selected, y_resampled, X_test_selected):
    #kfold = KFold(n_splits=num_folds)
    results = []
    names = []
    durations = []

    runs = len(models) * len(pcas) * len(scalers)    
    run = 1

    for name_model, model in models:
        for name_pca, pca in pcas:
            for name_scaler, scaler in scalers:

                if name_scaler == 'Bypass' and name_pca != 'Bypass':
                    pipe = Pipeline(steps=[(name_pca, pca),(name_model,model)])
                elif name_scaler != 'Bypass' and name_pca == 'Bypass':
                    pipe = Pipeline(steps=[(name_scaler,scaler),(name_model,model)])
                elif name_scaler == 'Bypass' and name_pca == 'Bypass':
                    pipe = Pipeline(steps=[(name_model,model)])
                else:
                    pipe = Pipeline(steps=[(name_scaler,scaler),(name_pca, pca),(name_model,model)])
                
                # Resamped with smote
                start_time = time.time()
                #cv_results = cross_val_score(pipe, X_resampled, y_resampled, cv=kfold, scoring=scoring).tolist()
                pipe.fit(X_resampled_selected, y_resampled)
                y_predict = pipe.predict(X_test_selected)
                roc_results = roc_auc_score(y_test,y_predict,average='weighted')
                f1 = f1_score(y_test,y_predict,average='weighted')

                names.append('Smote|' + name_scaler + ' / ' + name_pca + ' / ' + name_model)
                #results.append(cv_results)
                results.append(roc_results)
                duration = time.time() - start_time
                durations.append(duration)
                print('Run {0}/{1}: Smote|{2}|{3}|{4}, AUC: {5:.2f}, F1: {8:.2f}, duration: {6:.2f} s / {7:.2f} min'.format(run,runs,name_scaler,name_pca,name_model,roc_results,duration,duration/60,f1))
                run = run + 1
    return names, results, durations



Defining the models:

In [None]:
# Model creation

scalers = [('standard_scaler', StandardScaler())
          ]
pcas = [('pca_10', PCA(n_components=10))]

models = [('LR', LogisticRegression(solver='liblinear')),
          ('CART', DecisionTreeClassifier()),
          ('NB', GaussianNB()),
          ('AB', AdaBoostClassifier()),
          ('ET', ExtraTreesClassifier(n_estimators=10)),
          ('MLP', MLPClassifier(hidden_layer_sizes=100,activation='relu')),
          ('lgbm', LGBMClassifier())
         ]


Running the models:

In [None]:
names, results, durations = test_models(scalers, pcas, models,X_resampled, y_resampled, X_test)

The graphs below show the results and time to run of each model.

In [None]:
fig = plt.figure(figsize=(8,6)) 
sns.barplot(y=names,x=results)
plt.title('Result of the models (AUC)')
plt.show()

In [None]:
fig = plt.figure(figsize=(8,6)) 
sns.barplot(y=names,x=durations)
plt.title('Time to run')
plt.show()

According to the previous results and the time to run, the following models were chosen for a deeper investigation: Logistic Regression (LR), Naive Bayes (NB) and LightGBM (lgbm).

Furthermore, Min_Max_Scaler and other PCA number of components will be tested.

In [None]:
# Model creation

scalers = [('min_max_scaler', MinMaxScaler()), 
          ('standard_scaler', StandardScaler())
]

pcas = [('Bypass','Bypass'),
    ('pca_100', PCA(n_components=100)),
    ('pca_50', PCA(n_components=50)),
    ('pca_10', PCA(n_components=10)),
    ('pca_5', PCA(n_components=5))
]

models = [('LR', LogisticRegression(solver='liblinear')),
          ('NB', GaussianNB()),
          ('lgbm', LGBMClassifier())
         ]


Running the models:

In [None]:
names, results, durations = test_models(scalers, pcas, models,X_resampled, y_resampled, X_test)

The graphs below show the results and time to run of each model.

In [None]:
fig = plt.figure(figsize=(8,6)) 
sns.barplot(y=names,x=results)
plt.title('Result of the models (AUC)')
plt.show()

In [None]:
fig = plt.figure(figsize=(8,6)) 
sns.barplot(y=names,x=durations)
plt.title('Time to run')
plt.show()

## 4.3. Parameter adjustment

In this section, the parameters of the best models will be tuned.

### 4.3.1. Logistic Regression

In [None]:
# Parameters of pipelines
param_grid = {
    'dual': [True,False],
    'tol':[1e-4,1e-1],
    'C':[1.0,100.0],
    'solver': ['liblinear'],
    'intercept_scaling': [1,100],
    'max_iter':[300]
}

parameters = ParameterGrid(param_grid)

In [None]:
runs = len(parameters)
run = 1
for parameter in parameters:
    pipe = Pipeline(steps=[('Scaler', StandardScaler()), 
                           ('pca_5', PCA(n_components=5)),
                           ('LR',LogisticRegression(intercept_scaling=parameter['intercept_scaling'],
                                                    solver=parameter['solver'],
                                                    tol=parameter['tol'],
                                                    C=parameter['C'],  
                                                    dual=parameter['dual'],
                                                    max_iter=parameter['max_iter']
                               
                                                                               ))])
    
    start_time = time.time()
    pipe.fit(X_resampled, y_resampled)
    y_predict = pipe.predict(X_test)
    roc_results = roc_auc_score(y_test,y_predict,average='weighted')
    f1 = f1_score(y_test,y_predict,average='weighted')
    duration = time.time() - start_time

    print('Run {0}/{1}: {2}, AUC: {3:.5f}, F1: {4:.5f}'.format(run,runs,parameter,roc_results, f1, duration,duration/60))
    run = run + 1


### 4.3.2. Gaussian Naive Bayes

In [None]:
# Parameters of pipelines
param_grid = {
    'var_smoothing': [1e-150,1e-9,1e-3],
}

parameters = ParameterGrid(param_grid)

In [None]:
runs = len(parameters)
run = 1
for parameter in parameters:
    pipe = Pipeline(steps=[('Scaler', StandardScaler()), 
                           ('pca_5', PCA(n_components=5)),
                           ('NB',GaussianNB(var_smoothing=parameter['var_smoothing']                           
                                                                               ))])
    
    start_time = time.time()
    pipe.fit(X_resampled, y_resampled)
    y_predict = pipe.predict(X_test)
    roc_results = roc_auc_score(y_test,y_predict,average='weighted')
    f1 = f1_score(y_test,y_predict,average='weighted')
    duration = time.time() - start_time

    print('Run {0}/{1}: {2}, AUC: {3:.5f}, F1: {4:.5f}'.format(run,runs,parameter,roc_results, f1, duration,duration/60))
    run = run + 1

### 4.3.3. Light GBM

In [None]:
# Reference: https://www.kaggle.com/code/dimanjung/lgbm-with-parameters

pipe = Pipeline(steps=[('Scaler', StandardScaler()),
                       ('pca_5', PCA(n_components=5))])
pipe.fit(X_resampled, y_resampled)
X_resampled_transformed = pipe.transform(X_resampled)
X_test_transformed = pipe.transform(X_test)
                
train_data = lightgbm.Dataset(X_resampled_transformed, label=y_resampled)
valid_data = lightgbm.Dataset(X_test_transformed, label=y_test)

In [None]:
param_grid = {
    'n_estimators': [10,25,50],
    'learning_rate': [0.01,0.05],
    'metric':['auc'],
    'verbose': [-1],
    'boosting': ['gbdt'],
}


parameters = ParameterGrid(param_grid)

In [None]:
runs = len(parameters)
run = 1

for parameter in parameters:
    model_lgbm = lightgbm.train(parameter,
                                train_data,
                                verbose_eval=False,
                                valid_sets=valid_data,
                                num_boost_round=20000,
                                early_stopping_rounds=100)

    y_predictions = model_lgbm.predict(X_test_transformed)
    print("Run {2}/{3}, Parameters: {0}, AUC: {1:.5f}".format(parameter,roc_auc_score(y_test, y_predictions,average='weighted'),run,runs))
    
    run = run + 1

## 4.4. Retrain the model with all data

Now all data will be used to build the model. Previously, the data had been split into train and test.

LGBM was chosen to be submitted.

In [None]:
# Resample using SMOTE
sm = SMOTE(random_state=42,sampling_strategy='minority',k_neighbors=5)
X_resampled, y_resampled = sm.fit_resample(X, y)

# Shuffle the data to avoid erros in cross validation score (to avoid creat a group with only one class)
matrix_shuffled = np.concatenate((y_resampled.reshape(len(y_resampled),1),X_resampled),axis=1)
np.random.shuffle(matrix_shuffled)

# Assign the shuffled data to the variables
X_resampled = matrix_shuffled[:,1:]
y_resampled = matrix_shuffled[:,0]

In [None]:
pipe = Pipeline(steps=[('Scaler', StandardScaler()),
                       ('pca_5', PCA(n_components=5))])
pipe.fit(X_resampled, y_resampled)
X_resampled_transformed = pipe.transform(X_resampled)
X_transformed = pipe.transform(X)
                
train_data = lightgbm.Dataset(X_resampled_transformed, label=y_resampled)
valid_data = lightgbm.Dataset(X, label=y)

In [None]:
param_grid = {
    'n_estimators': [25],
    'learning_rate': [0.05],
    'metric':['auc'],
    'verbose': [-1],
    'boosting': ['gbdt']
}


parameters = ParameterGrid(param_grid)

In [None]:
runs = len(parameters)
run = 1

for parameter in parameters:
    model_lgbm = lightgbm.train(parameter,
                                train_data,
                                verbose_eval=False,
                                valid_sets=valid_data,
                                num_boost_round=20000,
                                early_stopping_rounds=100)

    y_predictions = model_lgbm.predict(X_transformed)
    print("Run {2}/{3}, Parameters: {0}, AUC: {1:.5f}".format(parameter,roc_auc_score(y, y_predictions,average='weighted'),run,runs))
    
    run = run + 1

# 5. Submit

In [None]:
test_transformed = pipe.transform(test.values)
test_result=model_lgbm.predict(test_transformed)

In [None]:
test['target']=test_result
test.reset_index(inplace=True)
submit=test[['ID_code','target']]
submit.to_csv('submission.csv',index=False)