# Quick Notes

This is a starter package for Ubiquant competition. This is an ensemble of 2 models, one multi layer perceptron and a light gradient boosting model

Here is the link for the training script for the multi layer perceptron (with encoder decoder block for extra features):
https://www.kaggle.com/ragnar123/ubiquant-tf-training-baseline-with-gpu

This model achives a CV score of 0.1490 and a LB of 0.146. I did some experiments with PCC loss and CV is better, it gain a big boost and achive a CV score of 0.1515 but LB is worst. My best guess is it because using the average to blend folds using PCC loss could not be the correct way.

On the other hand here is the link for the LGBM model:
https://www.kaggle.com/ragnar123/ubiquant-lgbm-training-baseline

This model achives a CV score of 0.1395, much worst than DNN. Both models where trained with the same KFold strategy and the same folds so blending the models is leak free. Nevetheless if we blend both models the CV score boost to 0.1556 and a LB to 0.148.

# KFold Strategy

A lof of folks are talking about CV strategy. Let's say we have the following data points which are aligned with time

Train:
1, 2, 3, 4, 5, 6, 7, 8, 9, 10

Test:
11, 12, 13

We could say that validating with 8, 9, 10 is good because we are using future data, we should not expect any leak. The downside of this validation is that we are asumming that 8, 9, 10 will be similar to 11, 12, 13 and that could not be the case and early sopping will not be perfect. 

In my notebooks im validating all the periods of the train set using GroupKFold. The downside of this validation is that you are using future data and validate on past data, like this.

Train with 1, 2, 3, 4, 8, 9, 10
Validate with 5, 6, 7

We are using big windows to avoid leakage, nevertheless we can't be certain that leak will not occur because we don't know how the features were build.

The conclusion would be that a correct validation strategy will required more E.D.A, nevertheless im my experiments I have seen a nice correlation between CV and the LB. This does not mean that we are good to go and we have a perfect validation strategy but it is a reasonable start point.

# Update

I improve the encoder decoder multi layer perceptron adding more features and CV got to 0.1512, this boost ensemble CV to 0.1582 and reach LB 0.149. I used a private dataset because I don't have more GPU hours but is almost the same model with a few changes. Obviously their is still much room for improvement.

In [None]:
import pandas as pd
import numpy as np
import os
from scipy.stats import pearsonr
import tensorflow as tf
import tensorflow_addons as tfa
import lightgbm as lgb
from tensorflow.keras import backend as K
import joblib
import random
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 100)
import ubiquant

In [None]:
# Calculate pearson correlation coefficient
def pearson_coef(data):
    return data.corr()['target']['prediction']

# Calculate mean pearson correlation coefficient
def comp_metric(valid_df):
    return np.mean(valid_df.groupby(['time_id']).apply(pearson_coef))

# Calculate out of folds score blend
fc_oof = pd.read_csv('../input/ubiquant-external-models/ed_mlp.csv')
lgb_oof = pd.read_csv('../input/ubiquant-lgbm-training-baseline/simple_lgbm.csv')
score_fc = comp_metric(fc_oof)
score_lgb = comp_metric(lgb_oof)
fc_oof['prediction'] = fc_oof['prediction'] * 0.7 + lgb_oof['prediction'] * 0.3
score_blend = comp_metric(fc_oof)
print(f'Fully connected model score {score_fc}')
print(f'Light gradient boosting score {score_lgb}')
print(f'Blend score {score_blend}')

In [None]:
# Function to build our model
def build_model(shape):
    def fc_block(x, units, dropout):
        x = tf.keras.layers.Dropout(dropout)(x)
        x = tf.keras.layers.Dense(units, activation = 'swish')(x)
        return x
    # Input layer
    inp = tf.keras.layers.Input(shape = (shape))
    # Encoder block
    encoder = tf.keras.layers.GaussianNoise(0.015)(inp)
    encoder = tf.keras.layers.Dense(192)(encoder)
    encoder = tf.keras.layers.Activation('swish')(encoder)
    # Decoder block to predict the input to generate more features
    decoder = tf.keras.layers.Dropout(0.05)(encoder)
    decoder = tf.keras.layers.Dense(shape, activation = 'linear', name = 'decoder')(decoder)
    # Autoencoder
    autoencoder = tf.keras.layers.Dense(256)(decoder)
    autoencoder = tf.keras.layers.Activation('swish')(autoencoder)
    autoencoder = tf.keras.layers.Dropout(0.40)(autoencoder)
    out_autoencoder = tf.keras.layers.Dense(1, activation = 'linear', name = 'autoencoder')(autoencoder)
    # Concatenate input and encoder output for extra features
    x = tf.keras.layers.Concatenate()([inp, encoder])
    x = fc_block(x, units = 1024, dropout = 0.4)
    x = fc_block(x, units = 512, dropout = 0.4)
    x = fc_block(x, units = 256, dropout = 0.4)
    output = tf.keras.layers.Dense(1, activation = 'linear', name = 'mlp')(x)
    model = tf.keras.models.Model(inputs = [inp], outputs = [decoder, out_autoencoder, output])
    return model

# Get our features list
dnn_features = list(np.load('../input/ubiquant-external-models/ed_mlp_features.npy'))
lgb_features = list(np.load('../input/ubiquant-lgbm-training-baseline/features.npy'))
corr_features = list(np.load('../input/ubiquant-external-models/ed_mlp_best_corr.npy'))
# Build 5 models and load 5 fold weights (tensorflow)
model1 = build_model(len(dnn_features))
model2 = build_model(len(dnn_features))
model3 = build_model(len(dnn_features))
model4 = build_model(len(dnn_features))
model5 = build_model(len(dnn_features))
model1.load_weights('../input/ubiquant-external-models/ed_mlp_1.h5')
model2.load_weights('../input/ubiquant-external-models/ed_mlp_2.h5')
model3.load_weights('../input/ubiquant-external-models/ed_mlp_3.h5')
model4.load_weights('../input/ubiquant-external-models/ed_mlp_4.h5')
model5.load_weights('../input/ubiquant-external-models/ed_mlp_5.h5')
# Load 5 light gradient boosting models
lgb1 = joblib.load('../input/ubiquant-lgbm-training-baseline/lgbm_1.pkl')
lgb2 = joblib.load('../input/ubiquant-lgbm-training-baseline/lgbm_2.pkl')
lgb3 = joblib.load('../input/ubiquant-lgbm-training-baseline/lgbm_3.pkl')
lgb4 = joblib.load('../input/ubiquant-lgbm-training-baseline/lgbm_4.pkl')
lgb5 = joblib.load('../input/ubiquant-lgbm-training-baseline/lgbm_5.pkl')
fc_models = [model1, model2, model3, model4, model5]
lgb_models = [lgb1, lgb2, lgb3, lgb4, lgb5]
# Predict
env = ubiquant.make_env()
iter_test = env.iter_test() 
for (test_df, sample_prediction_df) in iter_test:
    for col in corr_features:
        test_df['time_id'] = test_df['row_id'].str[0:4].astype(np.int64)
        mapper = test_df.groupby(['time_id'])[col].mean().to_dict()
        test_df[f'time_id_{col}'] = test_df['time_id'].map(mapper)
    fc_predictions = []
    lgb_predictions = []
    for model in fc_models:
        fc_predictions.append(model.predict(test_df[dnn_features])[2].reshape(-1))
    for model in lgb_models:
        lgb_predictions.append(model.predict(test_df[lgb_features]))
    # Blend 60% fc, 40% light gradient boosting
    predictions = np.average(fc_predictions, axis = 0) * 0.7 + np.average(lgb_predictions, axis = 0) * 0.3
    sample_prediction_df['target'] = predictions
    env.predict(sample_prediction_df)