# Kaggle competition - Santander Customer Transaction Prediction

Marcelo Abbehusen

#### This kernel is part of the Petrobras Data Science Training Program final challenge.

* [Importing Packages](#section-one)
* [Functions](#section-two)
* [Loading Data](#section-three)
* [EDA (Exploratory Data Analysis)](#section-four)
* [Training a baseline model (XGBoost)](#section-five)
* [Tuned XGBoost Model](#section-six)
* [Tuned Weighted XGBoost Model](#section-seven)
* [Feature Engineering](#section-eight)
    - [Identifying Magic Numbers](#subsection-one)
* [Training the best model on the magic numbers dataset](#section-ten)
* [Training and ensembling 200 different models](#section-eleven)
* [Ensemble 200 models using VotingClassifier](#section-twelve)
* [Suggestions](#section-thirteen)

<a id="section-one"></a>
# Importing Packages

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score, GridSearchCV
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report, roc_auc_score
import matplotlib.pyplot as plt
import seaborn as sns
from xgboost import XGBClassifier
from collections import Counter
from sklearn.ensemble import VotingClassifier
import pickle
plt.style.use('fivethirtyeight')
import warnings
warnings.filterwarnings("ignore")

<a id="section-two"></a>
# Functions

In [None]:
def plot_boxplots(dataset):
    '''Plot boxplots of all the variables
    of the dataset in a row'''
    for i in range(dataset.shape[1]):
        sns.boxplot(x=X.iloc[:, i])
        plt.tight_layout()
        plt.show()

def reduce_mem_usage(df, verbose=True):
    numerics = ['int8','int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

def f1_eval(y_pred, dtrain):
    '''Calculates f1 score error'''
    y_true = dtrain.get_label()
    err = 1-f1_score(y_true, np.round(y_pred))
    return 'f1_err', err

<a id="section-three"></a>
# Loading data

In [None]:
df_test = pd.read_csv('../input/santander-customer-transaction-prediction/test.csv')
df_train = pd.read_csv('../input/santander-customer-transaction-prediction/train.csv')

In [None]:
df_test.head()

In [None]:
df_train.head()

In [None]:
df_train.shape, df_test.shape

#### Before starting our studies, we'll use the function reduce_mem_usage, that reduces memory usage by 75%, by changing the data type of each column. This function was obtained in this [kernel](https://www.kaggle.com/code/somang1418/tuning-hyperparameters-under-10-minutes-lgbm)

In [None]:
df_train = reduce_mem_usage(df_train)
df_test = reduce_mem_usage(df_test)
print("Shape of train set: ", df_train.shape)
print("Shape of test set: ", df_test.shape)

#### Defining df_test_ids and training variables, that will be used later on the submisssion file.

In [None]:
df_test_ids = df_test.ID_code
df_test = df_test.drop('ID_code', axis=1)

#### Setting target as variable y and the rest of the variables as X

#### ID_code will be dropped, since It's just an index.

In [None]:
y = df_train['target'].copy()
X = df_train.drop(['target', 'ID_code'], axis=1)

<a id="section-four"></a>
# Exploratory Data Analysis (EDA)

#### The first step will be doing some basic exploratory data analysis (EDA) to check If there's any preprocessing that needs to be done before training the models.

#### Checking If there's any missing data.

In [None]:
df_train.isna().sum().sum()

In [None]:
df_test.isna().sum().sum()

#### No missing values. Now we'll check If the dataset is balanced or not.

In [None]:
plt.figure(figsize=(7,7))
sns.countplot(data=df_train, x='target')
plt.title('Target class distribution')
plt.xlabel('target')
plt.ylabel('Count')
#plt.savefig('cnn.png', dpi=500)
plt.tight_layout()
plt.show()

#### As we can see on the chart above, the dataset is pretty imbalanced, which means we'll probably have to deal with this problem later by adjusting the training algorithm to take into account this imbalanced distribution of classes and setting different weights to the classes.

#### Now we should check if there's any correlation between variables.

In [None]:
plt.figure(figsize=(10,10))
sns.heatmap(df_train.corr(), cmap='YlGnBu');

#### Even though the heatmap above shows that there's no correlation between features, due to the fact that our dataset is large and has 200 columns, It isn't the best way of visualizing this parameter. Let's try then building a histogram to show the distribution between correlations. Note that we need to exclude correlations that are equal to 1, because they refer to variables that are correlated to themselves.

In [None]:
train_corr = X.corr()
train_corr = train_corr.values.flatten()
train_corr = train_corr[train_corr != 1]

plt.figure(figsize=(18, 9))
sns.histplot(train_corr, color='blue', label='train', kde=True, stat='density', linewidth=1)
plt.xlabel('Correlations between variables on the training dataset')
plt.ylabel('Frequency')
plt.title('Frequency of correlations between features')
plt.tight_layout()
plt.show()

#### The histogram above confirms our previous suspicious. The variables are not correlated indeed.
#### Now we'll check if there's any duplicates.

In [None]:
df_train.duplicated().sum()

#### Now we'll build a baseline model using XGBoost, which is a machine learning algorithm based on gradient boosting decision trees.

<a id="section-five"></a>
# Training a baseline model (XGboost)

In [None]:
# creating a 5 fold stratifiedkfold cross-validation object

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

In [None]:
# splitting the dataset into training (80%) and test (20%) data

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)

In [None]:
# Creating a baseline XGBoost model

model_xgb = XGBClassifier(max_depth=7, 
                          random_state=42, 
                          class_weight= 'imbalanced')

In [None]:
# Analyzing the model using 5-fold cross-validation

scores = cross_val_score(model_xgb, X_train, y_train, scoring='roc_auc', cv=cv, n_jobs=-1)

In [None]:
print('5-fold Cross validation Roc-auc and std: {:.2f} ({:.4f})'.format(np.mean(scores), np.std(scores)))

In [None]:
# creating a variable for the baseline model cross validation roc_auc

baseline_cv_roc_auc = round(np.mean(scores), 4)

In [None]:
# Training the model

model_xgb.fit(X_train, y_train, eval_metric=f1_eval)

In [None]:
# Predicting on X_test

y_pred = model_xgb.predict(X_test)

In [None]:
print(classification_report(y_test, y_pred))

#### As we can see, the model has a high accuracy, because of the highly imbalanced distribution of classes, although It is performing poorly on the class 1. Let's dive a bit deeper into this, by checking the model's confusion matrix.

In [None]:
# Creating a variable for the classification report

report = classification_report(y_test, y_pred, output_dict=True)

In [None]:
# Creating variables for precision, recall, f1score, accuracy, roc_auc and class1 f1score

baseline_precision = round(report['macro avg']['precision'], 4)
baseline_recall = round(report['macro avg']['recall'], 4)
baseline_f1score = round(report['macro avg']['f1-score'], 4)
baseline_accuracy = round(report['accuracy'], 4)
baseline_roc_auc = round(roc_auc_score(y_test, y_pred), 4)
baseline_class1_f1score = round(report['1']['f1-score'], 4)

In [None]:
# Checking confusion matrix

target_names = ['0', '1']

plt.figure(figsize=(6, 5))
ax= plt.subplot()
sns.heatmap(confusion_matrix(y_test, y_pred),
            annot=True, fmt='g', ax=ax)

ax.set_xlabel('Predicted values', fontsize=14)
ax.set_ylabel('Expected values', fontsize=14)
ax.set_title('Confusion Matrix', fontsize=14)

ax.xaxis.set_ticklabels(target_names, fontsize=14)
ax.yaxis.set_ticklabels(target_names, fontsize=14)
plt.show()

#### The confusion matrix above confirms what we saw on the classification report. The model's accuracy is high just because of the imbalance of the dataset. As we can see, the model performs poorly on class 1.

In [None]:
# Creating a dictionary to compare models performances and parameters

dic = {
    'Test Precision': baseline_precision,
    'Test Recall': baseline_recall,
    'Test F1-Score': baseline_f1score,
    'Test Accuracy': baseline_accuracy,
    'Test Roc_auc': baseline_roc_auc,
    'Class 1 F1-score': baseline_class1_f1score, 
    'Validation Roc_auc': baseline_cv_roc_auc}

In [None]:
# Transforming the dictionary into a dataframe

df_performance = pd.DataFrame(dic, index=['Baseline Model (XGBoost)'])

In [None]:
df_performance

### Now we have a baseline model performance the needs to be beat.

#### Even though we've achieved a pretty high accuracy (91%), the precision and recall metrics for class 1 were very low, 72% and 24%, respectively, which means that the accuracy is high due to the imbalance of the dataset. The F-1 score (~37%) confirms this suspicious. Because of that, we'll store class 1 f1-score as a variable, for further comparison between the models we'll build.

#### Before trying to improve the baseline model, we'll submit this predictions to the kaggle platform and check the results. Before doing that, we'll have to build the model using the entire training dataset.

## Retraining the baseline model on the whole dataset and submitting the results to kaggle

In [None]:
# Creating a baseline XGBoost model

model_xgb = XGBClassifier(max_depth=7, 
                          random_state=42, 
                          class_weight= 'imbalanced')

In [None]:
# Training the model on the entire dataset

model_xgb.fit(X, y, eval_metric=f1_eval)

In [None]:
# Predicting on the test set (df_set)

y_pred = model_xgb.predict(df_test)

In [None]:
submission_1 = pd.DataFrame()

In [None]:
submission_1['ID_code'] = df_test_ids.copy()
submission_1['target'] = y_pred

In [None]:
submission_1.head()

In [None]:
submission_1.to_csv('final_baseline_model.csv', index=False)

#### The model results on kaggle were:

#### Score: 0.61632

#### Public score: 0.62048

In [None]:
submission_score = 0.61632
submission_public_score = 0.62048

In [None]:
# Creating a dictionary to compare submissions performances

dic = {
    'Submission Score': submission_score,
    'Submission Public Score': submission_public_score}

In [None]:
# Transforming the dictionary into a dataframe

df_submissions = pd.DataFrame(dic, index=['Baseline Model (XGBoost)'])

In [None]:
df_submissions

#### The next attempt will be checking If there's any room for improvement just by tuning a few hyperparameters. During the process of building this kernel, gridsearchcv and randomizedsearchcv were both used to find the best hyperparameters, but due to the fact that training was taking too long and the iterative process of tuning hyperparameters was way too computational expensive, in the end we decided to use the hyperparameters setup based on this [kernel](https://www.kaggle.com/code/ricksun/xgboost-stratifiedkfold-for-beginner/notebook)

<a id="section-six"></a>
# Tuned XGBoost model

In [None]:
# creating a 5 fold stratifiedkfold cross-validation object

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

In [None]:
# Creating a tuned XGBoost model

model_xgb =  XGBClassifier(max_depth=2,
                           colsample_bytree=0.7,
                           n_estimators=20000,
                           learning_rate=0.02,
                           objective='binary:logistic', 
                           verbosity =1,
                           eval_metric  = f1_eval,
                           tree_method='gpu_hist',
                           n_jobs=-1)

In [None]:
# Analyzing the model using 5-fold cross-validation

scores = cross_val_score(model_xgb, X_train, y_train, scoring='roc_auc', cv=cv, n_jobs=-1)

In [None]:
print('5-fold Cross validation Roc-auc and std: {:.2f} ({:.4f})'.format(np.mean(scores), np.std(scores)))

In [None]:
# creating a variable for the tuned model cross validation roc_auc

tuned_roc_auc = round(np.mean(scores), 4)

In [None]:
# Training the model

model_xgb.fit(X_train, y_train)

In [None]:
# Predicting the model on the test set

y_pred = model_xgb.predict(X_test)

In [None]:
print(classification_report(y_test, y_pred))

#### As we can see, the model performance had a significant improvement, just by tuning the hyperparameters. Now we should check the confusion matrix.

In [None]:
# Creating a variable for the classification report

report = classification_report(y_test, y_pred, output_dict=True)

In [None]:
# Creating variables for precision, recall, f1score, accuracy, roc_auc and class1 f1score

tuned_precision = round(report['macro avg']['precision'], 4)
tuned_recall = round(report['macro avg']['recall'], 4)
tuned_f1score = round(report['macro avg']['f1-score'], 4)
tuned_accuracy = round(report['accuracy'], 4)
tuned_roc_auc = round(roc_auc_score(y_test, y_pred), 4)
tuned_class1_f1score = round(report['1']['f1-score'], 4)

In [None]:
# Checking confusion matrix

target_names = ['0', '1']

plt.figure(figsize=(6, 5))
ax= plt.subplot()
sns.heatmap(confusion_matrix(y_test, y_pred),
            annot=True, fmt='g', ax=ax)

ax.set_xlabel('Predicted values', fontsize=14)
ax.set_ylabel('Expected values', fontsize=14)
ax.set_title('Confusion Matrix', fontsize=14)

ax.xaxis.set_ticklabels(target_names, fontsize=14)
ax.yaxis.set_ticklabels(target_names, fontsize=14)
#plt.savefig('baseline_model_heatmap.png', dpi=500)
plt.show()

#### The confusion matrix above shows that there was a significant improvement on the class 1 true positives, which can be observed on the F1-scores performances.

In [None]:
# Adding this results to the df_performance dataframe, for further comparison between models

df_performance.loc['Tuned XGBoost', :] = [tuned_precision,
                                          tuned_recall,
                                          tuned_f1score,
                                          tuned_accuracy,
                                          tuned_roc_auc,
                                          tuned_class1_f1score,
                                          tuned_roc_auc]

In [None]:
df_performance

#### Now we'll do another submission on kaggle and check if there's any improvement on the overall scores.

## Retraining the Tuned XGBoost model on the whole dataset and submitting the results to kaggle

In [None]:
# Creating a tuned XGBoost model

model_xgb =  XGBClassifier(max_depth=2,
                           colsample_bytree=0.7,
                           n_estimators=20000,
                           learning_rate=0.02,
                           objective='binary:logistic', 
                           verbosity =1,
                           eval_metric  = 'auc',
                           tree_method='gpu_hist',
                           n_jobs=-1)

In [None]:
# Training the model on the entire dataset

model_xgb.fit(X, y, eval_metric=f1_eval)

In [None]:
# Predicting on test set (df_test)

y_pred = model_xgb.predict(df_test)

In [None]:
submission_2 = pd.DataFrame()

In [None]:
submission_2['ID_code'] = df_test_ids.copy()
submission_2['target'] = y_pred

In [None]:
submission_2.head()

In [None]:
submission_2.to_csv('final_submission_tuned.csv', index=False)

#### The model results on kaggle were:

#### Score: 0.68044

#### Public score: 0.68329

In [None]:
submission_score = 0.68044
submission_public_score = 0.68329

In [None]:
# Adding this results to the df_submission dataframe, for further comparison between models

df_submissions.loc['Tuned Model (XGBoost))', :] = [submission_score,
                                                   submission_public_score]

In [None]:
df_submissions

#### The tuned model had better performance than our baseline model.

#### The next attempt will be tuning the scale_pos_weight hyperparameter, that enhances the correction of the minority class (class 1), making it more cost-sensitive.

<a id="section-seven"></a>
# Tuned weighted XGBoost Model

In [None]:
# Checking the proportion between majority and minority classes

counter = Counter(y)
weight = round(counter[0] / counter[1])
weight

#### As we can see, the majority class is nine times more representative than the minority class, so that's the weighting factor that we'll use to minimize the impact of class imbalance.

In [None]:
# creating a 5 fold stratifiedkfold cross-validation object

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

In [None]:
# Creating a tuned XGBoost model and setting scale_pos_weight to 9

model_xgb =  XGBClassifier(max_depth=2,
                           colsample_bytree=0.7,
                           n_estimators=20000,
                           learning_rate=0.02,
                           scale_pos_weight=weight,
                           objective='binary:logistic', 
                           verbosity =1,
                           eval_metric  = 'auc',
                           tree_method='gpu_hist',
                           n_jobs=-1)

In [None]:
# Analyzing the model using 5-fold cross-validation

scores = cross_val_score(model_xgb, X_train, y_train, scoring='roc_auc', cv=cv, n_jobs=-1)

In [None]:
print('5-fold Cross validation Roc-auc and std: {:.2f} ({:.4f})'.format(np.mean(scores), np.std(scores)))

In [None]:
# creating a variable for the tuned weighted model cross validation roc_auc

tuned_scale_weight_roc_auc = round(np.mean(scores), 4)

In [None]:
# Training the model

model_xgb.fit(X_train, y_train, eval_metric=f1_eval)

In [None]:
# Predicting the model on the test set (X_test)

y_pred = model_xgb.predict(X_test)

In [None]:
print(classification_report(y_test, y_pred))

#### As we can see, the model performance had a slight improvement when we minimized the impact of class imbalance. Now we should check the confusion matrix.

In [None]:
# Creating a variable for the classification report

report = classification_report(y_test, y_pred, output_dict=True)

In [None]:
# Creating variables for precision, recall, f1score, accuracy, roc_auc and class1 f1score

tuned_scale_weight_precision = round(report['macro avg']['precision'], 4)
tuned_scale_weight_recall = round(report['macro avg']['recall'], 4)
tuned_scale_weight_f1score = round(report['macro avg']['f1-score'], 4)
tuned_scale_weight_accuracy = round(report['accuracy'], 4)
tuned_scale_weight_roc_auc = round(roc_auc_score(y_test, y_pred), 4)
tuned_scale_weight_class1_f1score = round(report['1']['f1-score'], 4)

In [None]:
# Checking confusion matrix

target_names = ['0', '1']

plt.figure(figsize=(6, 5))
ax= plt.subplot()
sns.heatmap(confusion_matrix(y_test, y_pred),
            annot=True, fmt='g', ax=ax)

ax.set_xlabel('Predicted values', fontsize=14)
ax.set_ylabel('Expected values', fontsize=14)
ax.set_title('Confusion Matrix', fontsize=14)

ax.xaxis.set_ticklabels(target_names, fontsize=14)
ax.yaxis.set_ticklabels(target_names, fontsize=14)
plt.show()

#### Despite the fact that the confusion matrix above seems to show that the model's performance got worse, we should check the performance on the test dataset, because there are a few hints on the classification report that could be related to a better overall performance, such as the validation and test roc_auc parameters.

In [None]:
df_performance.loc['Tuned Weighted XGBoost', :] = [tuned_scale_weight_precision,
                                                   tuned_scale_weight_recall,
                                                   tuned_scale_weight_f1score,
                                                   tuned_scale_weight_accuracy,
                                                   tuned_scale_weight_roc_auc,
                                                   tuned_scale_weight_class1_f1score,
                                                   tuned_scale_weight_roc_auc]

In [None]:
df_performance

#### Now we'll do another submission on kaggle and check if there's any improvement on the overall scores.

## Rebuilding the Tuned Weighted XGBoost model on the whole dataset and submitting the results to kaggle

In [None]:
# Creating a tuned XGBoost model and setting scale_pos_weight to 9

model_xgb =  XGBClassifier(max_depth=2,
                           colsample_bytree=0.7,
                           n_estimators=20000,
                           scale_pos_weight = weight,
                           learning_rate=0.02,
                           objective='binary:logistic', 
                           verbosity =1,
                           eval_metric  = 'auc',
                           tree_method='gpu_hist',
                           n_jobs=-1)

In [None]:
# Training the model on the entire dataset

model_xgb.fit(X, y, eval_metric=f1_eval)

In [None]:
y_pred = model_xgb.predict(df_test)

In [None]:
submission_2 = pd.DataFrame()

In [None]:
submission_2['ID_code'] = df_test_ids.copy()
submission_2['target'] = y_pred

In [None]:
submission_2.head()

In [None]:
submission_2.to_csv('final_submission_tuned_weighted_9.csv', index=False)

#### The model results on kaggle were:

#### Score: 0.80703

#### Public score: 0.81134

In [None]:
submission_score = 0.80703
submission_public_score = 0.81134

In [None]:
# Adding this results to the df_submission dataframe, for further comparison between models

df_submissions.loc['Tuned Weighted Model (XGBoost) - weight 9)', :] = [submission_score,
                                                                       submission_public_score]

In [None]:
df_submissions

#### Now we finally obtained a better performance, scoring about ~81% on the kaggle submission platform.

#### After a few attempts and submissions modifying the scale_pos_weight hyperparameter to check If there's any number that has a better performance, we got to the following results:

In [None]:
# weight 11

submission_score_11 = 0.80956

submission_public_score_11 = 0.81494

# weight 12

submission_score_12 = 0.81040

submission_public_score_12 = 0.81520

# weight 13

submission_score_13 = 0.80975

submission_public_score_13 = 0.81488

# weight 14

submission_score_14 = 0.80734

submission_public_score_14 = 0.81356

In [None]:
# Adding this results to the df_submission dataframe, for further comparison between models

df_submissions.loc['Tuned Weighted Model (XGBoost) - weight 11)', :] = [submission_score_11,
                                                                        submission_public_score_11]

df_submissions.loc['Tuned Weighted Model (XGBoost) - weight 12)', :] = [submission_score_12,
                                                                        submission_public_score_12]

df_submissions.loc['Tuned Weighted Model (XGBoost) - weight 13)', :] = [submission_score_13,
                                                                        submission_public_score_13]

df_submissions.loc['Tuned Weighted Model (XGBoost) - weight 14)', :] = [submission_score_14,
                                                                        submission_public_score_14]

In [None]:
df_submissions

#### As we can see on the dataframe above, the best performance was obtained using 12 as weighting factor.

#### The next step will be trying to improve our model by doing some feature engineering.

<a id="section-eight"></a>
# Feature Engineering

#### It's pretty well documented amongst top kagglers and on the best kernels of this competition that this dataset has what is called "magic numbers", which is basically counting the number of times a value occurs in each variable. Thus, we'll add a new "magic feature" to each variable. A nice explanation about this magic features can be seen [here](https://www.kaggle.com/code/felipemello/step-by-step-guide-to-the-magic-lb-0-922) and [here](https://www.kaggle.com/code/jganzabal/trying-to-understand-why-magic-counts-works).

#### Before starting the process of feature engineering, there's another relevant aspect about this dataset that must be explained. Many kagglers found out that the test set consists of real samples as well as synthetic samples. In other words, we'll check If the feature value of a sample is unique. If a sample has at least one unique feature value, then It must be a real sample, and If It has no unique values It is considered to be a synthetic sample. Top kagglers found out that counting only the real samples of the test set raised the model scores.

#### A good explanation about this topic can be found [here](https://www.kaggle.com/code/yag320/list-of-fake-samples-and-public-private-lb-split/notebook).

<a id="subsection-one"></a>
## Identifying magic numbers

#### The next step will be spliting the real and fake test data samples, so then we can properly count the variables unique numbers.

In [None]:
df_test = pd.read_csv('../input/santander-customer-transaction-prediction/test.csv')
df_train = pd.read_csv('../input/santander-customer-transaction-prediction/train.csv')

In [None]:
train = df_train.copy()
test = df_test.copy()

In [None]:
# Creating a variable for column names

col_names = [f'var_{i}' for i in range(200)]

In [None]:
# Identifying unique values on the test set

for col in col_names:
    count = test[col].value_counts()
    uniques = count.index[count == 1]
    test[col + "_u"] = test[col].isin(uniques)

In [None]:
# Creating a column that tells if a sample has at least one unique value

test['has_unique'] = test[[col + '_u' for col in col_names]].any(axis=1)

In [None]:
# Creating variables for the real and fake test samples

real_test = test.loc[test['has_unique'], ['ID_code'] + col_names]
fake_test = test.loc[~test['has_unique'], ['ID_code'] + col_names]

In [None]:
# Checking how many real and fake samples were identified

real_test.shape, fake_test.shape

#### The numbers above are quite interesting, because they show that half of the test set consists of synthetic samples.

In [None]:
# Merging the original training set to the real samples extracted from the test set

train_and_test = pd.concat([train, real_test], axis=0)

In [None]:
train_df = df_train.copy()
test_df = df_test.copy()

In [None]:
# Creating new columns with the unique values count for each variable

for feat in ['var_' + str(x) for x in range(200)]:
    count_values = train_and_test.groupby(feat)[feat].count()
    train_df['new_' + feat] = count_values.loc[train_df[feat]].values
    test_df['new_' + feat] = count_values.loc[test_df[feat]].values

In [None]:
# Dropping the ID_code column

test_df_final = test_df.drop('ID_code', axis=1).copy()
test_df_final.head()

In [None]:
# Dropping ID_code and target columns

train_df_final = train_df.drop(['ID_code', 'target'], axis=1).copy()
train_df_final.head()

In [None]:
X = train_df_final.copy()
y = train_df.target.copy()

In [None]:
# Splitting the training dataset into train and test subsets

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)

<a id="section-ten"></a>
# Training the best model on the magic numbers dataset

In [None]:
# creating a 5 fold stratifiedkfold cross-validation object

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

In [None]:
# Creating a tuned XGBoost model and setting scale_pos_weight to 12

model_xgb =  XGBClassifier(max_depth=2,
                           colsample_bytree=0.7,
                           n_estimators=20000,
                           learning_rate=0.02,
                           scale_pos_weight=12,
                           objective='binary:logistic', 
                           verbosity =1,
                           eval_metric  = 'auc',
                           tree_method='gpu_hist',
                           n_jobs=-1)

In [None]:
# Analyzing the model using 5-fold cross-validation

scores = cross_val_score(model_xgb, X_train, y_train, scoring='roc_auc', cv=cv, n_jobs=-1)

In [None]:
print('5-fold Cross validation Roc-auc and std: {:.2f} ({:.4f})'.format(np.mean(scores), np.std(scores)))

In [None]:
# creating a variable for the tuned weighted model cross validation roc_auc

tuned_scale_weight_roc_auc = round(np.mean(scores), 4)

In [None]:
# Training the model

model_xgb.fit(X_train, y_train, eval_metric=f1_eval)

In [None]:
# Predicting the model on the test set (X_test)

y_pred = model_xgb.predict(X_test)

In [None]:
print(classification_report(y_test, y_pred))

#### As we can see, the model performance had a slight improvement when we minimized the impact of class imbalance. Now we should check the confusion matrix.

In [None]:
# Creating a variable for the classification report

report = classification_report(y_test, y_pred, output_dict=True)

In [None]:
# Creating variables for precision, recall, f1score, accuracy, roc_auc and class1 f1score

tuned_scale_weight_precision = round(report['macro avg']['precision'], 4)
tuned_scale_weight_recall = round(report['macro avg']['recall'], 4)
tuned_scale_weight_f1score = round(report['macro avg']['f1-score'], 4)
tuned_scale_weight_accuracy = round(report['accuracy'], 4)
tuned_scale_weight_roc_auc = round(roc_auc_score(y_test, y_pred), 4)
tuned_scale_weight_class1_f1score = round(report['1']['f1-score'], 4)

In [None]:
# Checking confusion matrix

target_names = ['0', '1']

plt.figure(figsize=(6, 5))
ax= plt.subplot()
sns.heatmap(confusion_matrix(y_test, y_pred),
            annot=True, fmt='g', ax=ax)

ax.set_xlabel('Predicted values', fontsize=14)
ax.set_ylabel('Expected values', fontsize=14)
ax.set_title('Confusion Matrix', fontsize=14)

ax.xaxis.set_ticklabels(target_names, fontsize=14)
ax.yaxis.set_ticklabels(target_names, fontsize=14)
plt.show()

#### The confusion matrix above shows that there was a significant improvement on the class 1 true positives.

In [None]:
df_performance.loc['Tuned Weighted XGBoost (Magic Numbers)', :] = [tuned_scale_weight_precision,
                                                                   tuned_scale_weight_recall,
                                                                   tuned_scale_weight_f1score,
                                                                   tuned_scale_weight_accuracy,
                                                                   tuned_scale_weight_roc_auc,
                                                                   tuned_scale_weight_class1_f1score,
                                                                   tuned_scale_weight_roc_auc]

In [None]:
df_performance

#### The comparison between models confirms that the magic numbers really play an important role in improving the overall performance of our classification models.

#### Now we'll do another submission on kaggle and check if there's any improvement on the overall scores.

## Rebuilding the Tuned Weighted XGBoost model on the whole dataset and submitting the results to kaggle

In [None]:
# Creating a tuned XGBoost model and setting scale_pos_weight to 12

model_xgb =  XGBClassifier(max_depth=2,
                           colsample_bytree=0.7,
                           n_estimators=20000,
                           scale_pos_weight = 15,
                           learning_rate=0.02,
                           objective='binary:logistic', 
                           verbosity =1,
                           eval_metric  = 'auc',
                           tree_method='gpu_hist',
                           n_jobs=-1)

In [None]:
# Training the model on the entire dataset

model_xgb.fit(X, y, eval_metric='auc')

In [None]:
y_pred = model_xgb.predict(test_df_final)

In [None]:
submission_2 = pd.DataFrame()

In [None]:
submission_2['ID_code'] = df_test_ids.copy()
submission_2['target'] = y_pred

In [None]:
submission_2.head()

In [None]:
submission_2.to_csv('final_submission_tuned_weighted_13_magic_numbers_child.csv', index=False)

#### The model results on kaggle were:

#### Score: 0.82466

#### Public score: 0.83206

#### Now we'll do a few tests, changing the scale_pos_weight hyperparameter and checking the scores on the test data.

In [None]:
# weight 9

submission_score_9 = 0.82216

submission_public_score_9 = 0.82980

# weight 11

submission_score_11 = 0.82359

submission_public_score_11 = 0.83247

# weight 12

submission_score_12 = 0.82466

submission_public_score_12 = 0.83206

# weight 13

submission_score_13 = 0.82448

submission_public_score_13 = 0.83250

# weight 14

submission_score_14 = 0.82533

submission_public_score_14 = 0.83196

# weight 15

submission_score_15 = 0.82465

submission_public_score_15 = 0.83010

In [None]:
# Adding this results to the df_submission dataframe, for further comparison between models

df_submissions.loc['Magic Numbers (XGBoost) - weight 9)', :] = [submission_score_9,
                                                                submission_public_score_9]

df_submissions.loc['Magic Numbers (XGBoost) - weight 11)', :] = [submission_score_11,
                                                                 submission_public_score_11]

df_submissions.loc['Magic Numbers (XGBoost) - weight 12)', :] = [submission_score_12,
                                                                 submission_public_score_12]

df_submissions.loc['Magic Numbers (XGBoost) - weight 13)', :] = [submission_score_13,
                                                                 submission_public_score_13]

df_submissions.loc['Magic Numbers (XGBoost) - weight 14)', :] = [submission_score_14,
                                                                 submission_public_score_14]

df_submissions.loc['Magic Numbers (XGBoost) - weight 15)', :] = [submission_score_15,
                                                                 submission_public_score_15]

In [None]:
df_submissions

#### As we can see, setting the scale_pos_weight hyperparameter to 13 is the best option to improve our model performance on the test data.

#### The next step will be trying to isolate the variables and training 200 different models, which means that each pair of variables (var_1 + new_var_1, var_2 + new_var_2, and so on) will work as a single model. At the end, we'll try to ensemble the 200 different models using a voting ensemble methodology. This idea came from the fact that the variables are not correlated, something we noticed during the EDA process, and It was inspired by this [kernel](https://www.kaggle.com/code/cdeotte/200-magical-models-santander-0-920).

<a id="section-eleven"></a>
# Training and ensembling 200 different models

In [None]:
# The code below loops over all features and uses each one of them to train a different model

train_probas4 = pd.DataFrame()
test_probas4 = pd.DataFrame()

train_accuracies4 = []
test_accuracies4 = []

train_preds4 = pd.DataFrame()
test_preds4 = pd.DataFrame()

models_dic4 = {}
models_list4 = list()

for i in range(200): # loop over all features
    features  = [X_train.columns[i], 'new_'+X_train.columns[i]] # selects a pair of features
    #features  = X_train.columns[i]
    print('')
    print('*'*24)
    print(features)
    print('*'*24)
    print('')
    
    model_xgb4 =  XGBClassifier(max_depth=2,
                              colsample_bytree=0.7,
                              n_estimators=20000,
                              scale_pos_weight = 9,
                              learning_rate=0.02,
                              objective='binary:logistic', 
                              verbosity =1,
                              eval_metric  = 'auc',
                              tree_method='gpu_hist',
                              n_jobs=-1)

    model_xgb4.fit(X_train[features], y_train, eval_metric=f1_eval)
        
    train_preds4[X_train.columns[i]] = model_xgb4.predict(X_train[features])
    train_probas4[X_train.columns[i]] = model_xgb4.predict_proba(X_train[features])[:, 0]
    train_accuracies4.append(accuracy_score(y_train, train_preds4[X_train.columns[i]]))
    
    test_preds4[X_train.columns[i]] = model_xgb4.predict(X_test[features])
    test_probas4[X_train.columns[i]] = model_xgb4.predict_proba(X_test[features])[:, 0]
    test_accuracies4.append(accuracy_score(y_test, test_preds4[X_test.columns[i]]))
    
    models_dic4['model_'+str(i)] = model_xgb4
    models_list4.append(('model_'+str(i), model_xgb4))
    
    print(accuracy_score(y_test, test_preds4[X_train.columns[i]]))

In [None]:
# Saving the models to pickle files

for i in range(200):
    model_name = 'model_'+str(i)
    pasta = r"C:\Users\U4R9\OneDrive - PETROBRAS\Repositorio\Desafio_Kaggle_BR\Modelx_xgb_magic_numbers"+'\\'
    filename = pasta+model_name+'.sav'
    pickle.dump(models_dic4[model_name], open(filename, 'wb'))

In [None]:
# Saving the test and train predictions to csv files

test_preds4.to_csv('test_preds4.csv')
train_preds4.to_csv('train_preds4.csv')
train_probas4.to_csv('train_probas4.csv')
test_probas4.to_csv('test_probas4.csv')

#### Now that we've built 200 different models using each of the variables of the dataset, we'll need to ensemble the predictions of this models. The first attempt will be doing a weighted average of the models predictions, giving more weight to the predictions that have more probabilities of being correct, which can be veryfied on the test_probas4 dataframe. This will be an iterative process of checking each one of the instances (rows) of both test_preds4 and test_probas4 datasets.

In [None]:
# Iterating through test_preds4 and test_probas4 dataframes
# Higher weights are given to the higher probabilities

final_preds = []
final_probas = []

for i in range(len(test_preds4)):
    weights = []
    for j in test_probas4.iloc[i]:
        if j >= 0.8:
            weights.append(2.2)
        elif 0.7 <= j <0.8:
            weights.append(1.9)
        elif 0.6 <= j <0.7:
            weights.append(1.5)
        elif 0.5 <= j <0.6:
            weights.append(1)
        elif 0.4 <= j <0.5:
            weights.append(1)
        elif 0.3 <= j <0.4:
            weights.append(1.5) 
        elif 0.2 <= j <0.3:
            weights.append(1.9) 
        else:
            weights.append(2.2)
            
    prediction_proba = sum(test_probas4.iloc[i] * weights) / sum(weights)
    final_probas.append(prediction_proba)
    if prediction_proba > 0.5:
        final_preds.append(0)
    else:
        final_preds.append(1)

In [None]:
# creating a variable for the ensemble roc_auc score

ensembling_roc_auc = round(roc_auc_score(y_test, final_preds), 4)

In [None]:
print(classification_report(y_test, final_preds))

In [None]:
# Creating a variable for the classification report

report = classification_report(y_test, final_preds, output_dict=True)

In [None]:
# Creating variables for precision, recall, f1score, accuracy, roc_auc and class1 f1score

ensembling_precision = round(report['macro avg']['precision'], 4)
ensembling_recall = round(report['macro avg']['recall'], 4)
ensembling_f1score = round(report['macro avg']['f1-score'], 4)
ensembling_accuracy = round(report['accuracy'], 4)
ensembling_roc_auc = round(roc_auc_score(y_test, final_preds), 4)
ensembling_class1_f1score = round(report['1']['f1-score'], 4)

In [None]:
# Checking confusion matrix

target_names = ['0', '1']

plt.figure(figsize=(6, 5))
ax= plt.subplot()
sns.heatmap(confusion_matrix(y_test, final_preds),
            annot=True, fmt='g', ax=ax)

ax.set_xlabel('Predicted values', fontsize=14)
ax.set_ylabel('Expected values', fontsize=14)
ax.set_title('Confusion Matrix', fontsize=14)

ax.xaxis.set_ticklabels(target_names, fontsize=14)
ax.yaxis.set_ticklabels(target_names, fontsize=14)
plt.show()

#### It seems that the performance didn't improve at all, based on the confusion matrix predictions.

In [None]:
df_performance.loc['Ensemble predictions', :] = [ensembling_precision,
                                                 ensembling_recall,
                                                 ensembling_f1score,
                                                 ensembling_accuracy,
                                                 ensembling_roc_auc,
                                                 ensembling_class1_f1score,
                                                 ensembling_roc_auc]

In [None]:
df_performance

#### Now we'll do another submission on kaggle and check if there's any improvement on the overall scores.

## Retraining the 200 different models on the whole dataset and submitting the results to kaggle

In [None]:
train_probas6 = pd.DataFrame()
train_accuracies6 = []
train_preds6 = pd.DataFrame()

models_dic6 = {}
models_list6 = list()

for i in range(200): # loop over all features
    features  = [X.columns[i], 'new_'+X.columns[i]]
    #features  = X_train.columns[i]
    print('')
    print('*'*24)
    print(features)
    print('*'*24)
    print('')
    
    model_xgb6 =  XGBClassifier(max_depth=2,
                              colsample_bytree=0.7,
                              n_estimators=20000,
                              scale_pos_weight = 9,
                              learning_rate=0.02,
                              objective='binary:logistic', 
                              verbosity =1,
                              eval_metric  = 'auc',
                              tree_method='gpu_hist',
                              n_jobs=-1)
    
    model_xgb6.fit(X[features], y, eval_metric=f1_eval)
        
    train_preds6[X.columns[i]] = model_xgb6.predict(X[features])
    train_probas6[X.columns[i]] = model_xgb6.predict_proba(X[features])[:, 0]
    train_accuracies6.append(accuracy_score(y, train_preds6[X.columns[i]]))
    
    models_dic6['model_'+str(i)] = model_xgb6
    models_list6.append(('model_'+str(i), model_xgb6))
    
    print(accuracy_score(y, train_preds6[X.columns[i]]))

In [None]:
# Iterating through test_preds4 and test_probas4 dataframes
# Higher weights are given to the higher probabilities

final_preds = []
final_probas = []

for i in range(len(train_preds6)):
    weights = []
    for j in train_probas6.iloc[i]:
        if j >= 0.8:
            weights.append(2.2)
        elif 0.7 <= j <0.8:
            weights.append(1.9)
        elif 0.6 <= j <0.7:
            weights.append(1.5)
        elif 0.5 <= j <0.6:
            weights.append(1)
        elif 0.4 <= j <0.5:
            weights.append(1)
        elif 0.3 <= j <0.4:
            weights.append(1.5) 
        elif 0.2 <= j <0.3:
            weights.append(1.9) 
        else:
            weights.append(2.2)
            
    prediction_proba = sum(train_probas6.iloc[i] * weights) / sum(weights)
    final_probas.append(prediction_proba)
    if prediction_proba > 0.5:
        final_preds.append(0)
    else:
        final_preds.append(1)

In [None]:
submission_2 = pd.DataFrame()

In [None]:
submission_2['ID_code'] = df_test_ids.copy()
submission_2['target'] = final_preds

In [None]:
submission_2.head()

In [None]:
submission_2.to_csv('final_submission_ensemble_magic.csv', index=False)

#### The model results on kaggle were:

#### Score: 0.50004

#### Public score: 0.49541

#### Well, that's very disappointing. Even though the idea of using each variable to train a different model seems promising, the model didn't perform the way we expected. This could be related to a poor choice of algorithm (XGBoost) that might not be suited to this kind of problem, or could be due to a poor choice of weights, during the final predictions calculations.

In [None]:
ensemble_submission_score = 0.50004
ensemble_submission_public_score = 0.49541

In [None]:
# Adding this results to the df_submission dataframe, for further comparison between models

df_submissions.loc['Ensemble 200 models', :] = [ensemble_submission_score,
                                                ensemble_submission_public_score]

In [None]:
df_submissions

#### The final attemp will be perform the voting ensemble using a VotingClassifier function.

<a id="section-twelve"></a>
# Ensemble 200 models using VotingClassifier

In [None]:
# Using test accuracies as weights

weights = []
for i in test_accuracies4:
    if i >= 0.6:
        weights.append(4)
    elif 0.5 <= i <0.6:
        weights.append(2)
    else:
        weights.append(1)

In [None]:
# Creating a votingclassifier object and fitting it to X_train and y_train

eclf = VotingClassifier(estimators=models_list4, voting='soft', weights=weights)
eclf.fit(X_train, y_train)

In [None]:
# Predicting on X_test

y_val_pred_weighted = eclf.predict_proba(X_test)
y_pred_voting_weighted = eclf.predict(X_test)

In [None]:
# creating a variable for the votingclassifier roc_auc score

ensembling_roc_auc = round(roc_auc_score(y_test, y_pred_voting_weighted), 4)

In [None]:
print(classification_report(y_test, y_pred_voting_weighted))

In [None]:
# Creating a variable for the classification report

report = classification_report(y_test, y_pred_voting_weighted, output_dict=True)

In [None]:
# Creating variables for precision, recall, f1score, accuracy, roc_auc and class1 f1score

voting_precision = round(report['macro avg']['precision'], 4)
voting_recall = round(report['macro avg']['recall'], 4)
voting_f1score = round(report['macro avg']['f1-score'], 4)
voting_accuracy = round(report['accuracy'], 4)
voting_roc_auc = round(roc_auc_score(y_test, y_pred_voting_weighted), 4)
voting_class1_f1score = round(report['1']['f1-score'], 4)

In [None]:
# Checking confusion matrix

target_names = ['0', '1']

plt.figure(figsize=(6, 5))
ax= plt.subplot()
sns.heatmap(confusion_matrix(y_test, y_pred_voting_weighted),
            annot=True, fmt='g', ax=ax)

ax.set_xlabel('Predicted values', fontsize=14)
ax.set_ylabel('Expected values', fontsize=14)
ax.set_title('Confusion Matrix', fontsize=14)

ax.xaxis.set_ticklabels(target_names, fontsize=14)
ax.yaxis.set_ticklabels(target_names, fontsize=14)
plt.show()

In [None]:
df_performance.loc['VotingClassifier predictions', :] = [ensembling_precision,
                                                         ensembling_recall,
                                                         ensembling_f1score,
                                                         ensembling_accuracy,
                                                         ensembling_roc_auc,
                                                         ensembling_class1_f1score,
                                                         ensembling_roc_auc]

In [None]:
df_performance

#### Now we'll do another submission on kaggle and check if there's any improvement on the overall scores.

## Retraining the VotingClassifier model on the whole dataset and submitting the results to kaggle

In [None]:
# Using train accuracies as weights

weights = []

for i in train_accuracies6:
    if i >= 0.6:
        weights.append(4)
    elif 0.5 <= i <0.6:
        weights.append(2)
    else:
        weights.append(1)

In [None]:
train_probas6 = pd.DataFrame()
train_accuracies6 = []
train_preds6 = pd.DataFrame()

models_dic6 = {}
models_list6 = list()

for i in range(200): # loop over all features
    features  = [X.columns[i], 'new_'+X.columns[i]]
    #features  = X_train.columns[i]
    print('')
    print('*'*24)
    print(features)
    print('*'*24)
    print('')
    
    model_xgb6 =  XGBClassifier(max_depth=2,
                              colsample_bytree=0.7,
                              n_estimators=20000,
                              scale_pos_weight = 9,
                              learning_rate=0.02,
                              objective='binary:logistic', 
                              verbosity =1,
                              eval_metric  = 'auc',
                              tree_method='gpu_hist',
                              n_jobs=-1)

    model_xgb6.fit(X[features], y, eval_metric=f1_eval)
        
    train_preds6[X.columns[i]] = model_xgb6.predict(X[features])
    train_probas6[X.columns[i]] = model_xgb6.predict_proba(X[features])[:, 0]
    train_accuracies6.append(accuracy_score(y, train_preds6[X.columns[i]]))
    
    models_dic6['model_'+str(i)] = model_xgb6
    models_list6.append(('model_'+str(i), model_xgb6))
    
    print(accuracy_score(y, train_preds6[X.columns[i]]))

In [None]:
# Creating a votingclassifier object and fitting it to X and y

eclf = VotingClassifier(estimators=models_list6, voting='soft', weights=weights)
eclf.fit(X, y)

In [None]:
# Predicting on test_df_final (test set)

y_val_pred_weighted = eclf.predict_proba(test_df_final)
y_pred_voting_weighted = eclf.predict(test_df_final)

In [None]:
submission_2 = pd.DataFrame()

In [None]:
submission_2['ID_code'] = df_test_ids.copy()
submission_2['target'] = y_pred_voting_weighted

In [None]:
submission_2.head()

In [None]:
submission_2.to_csv('final_submission_tuned_weighted_9_magic_numbers_votingclassifier.csv', index=False)

#### The model results on kaggle were:

#### Score: 0.82466

#### Public score: 0.83206

In [None]:
voting_submission_score = 0.82466
voting_submission_public_score = 0.83206

In [None]:
# Adding this results to the df_submission dataframe, for further comparison between models

df_submissions.loc['Ensemble VotingClassifier 200 models', :] = [voting_submission_score,
                                                                 voting_submission_public_score]

In [None]:
df_submissions

<a id="section-thirteen"></a>
# Suggestions

#### As we can see, VotingClassifier didn't seem to improve much of our overall score.

####  Here are a few suggestions of future tests based on this kernel:

* Optimizing weights of the 200 models in some other way;
* Training and tuning the 200 models individually using a gridsearch for each one of them, because the best hyperparameters might be different for each of the models;
* Testing the ensemble method using a meta-learner for final predictions;
* Testing other algorithms.

#### This is my first kernel, so I'd really appreciate if you could comment and give me any advice.
#### Thanks!