<div style="background-color:rgba(5, 29, 31, 0.5);">
    <h1><center>Blending for Noobs</center></h1>
</div>

Hi and welcome to a tutorial on how to make the best of the best in any competition.
Since the last couple of months, I have been merely looking at how competitors would blend best model submissions and push their scores up the leaderboard.

Though the TPS is strictly for learning, I am making this notebook to introduce myself and everyone to the interesting world of blending.

I have taken the best submissions across the following models and publinc notebooks-
1. Single HGBM - https://www.kaggle.com/ankitkalauni/tps-21-oct-single-histgbm-0-85651
2. Single XGBoost - https://www.kaggle.com/mohammadkashifunique/tsp-single-xgboost-model
4. LightGBM - https://www.kaggle.com/mlanhenke/tps-10-lgbm-onemodel-threeseeds-blend
5. Stacking XGB, CB and LGB - https://www.kaggle.com/ankitkalauni/simple-overfitted-stacking-lgbm-xgb-cb

**Please do upvote their beautiful work and follow them for more :)**

<div style="background-color:rgba(5, 29, 31, 0.5);">
    <h1><center>Importing Packages and Submissions</center></h1>
</div>

In [None]:
# Import libraries

import numpy as np
import pandas as pd
import matplotlib as plt
import plotly.figure_factory as ff
import plotly.express as px

# Outputs taken as inputs and their individual scores

pred_lgb = pd.read_csv('../input/tps-10-lgbm-onemodel-threeseeds-blend/random_seeds_blending_submission.csv') # 0.85644
pred_xgb = pd.read_csv('../input/tsp-single-xgboost-model/xgb.csv') # 0.85649
pred_hgb = pd.read_csv('../input/tps-21-oct-single-histgbm-0-85651/HistGBM.csv') # 0.85651
pred_stack = pd.read_csv('../input/noob-stacking-0-85654/LGBM_overfit.csv') # 0.85654

submission = pd.read_csv('../input/tps-oct-2021-single-lightgbm/submission.csv')

<div style="background-color:rgba(5, 29, 31, 0.5);">
    <h1><center>Make a DataFrame of all submissions</center></h1>
</div>

In [None]:
predictions = [pred_lgb, pred_xgb, pred_hgb, pred_stack]

results = pd.DataFrame()
for i, ds in enumerate(predictions):
    results[f'target_{i+1}'] = ds['target']

In [None]:
results.head()

In [None]:
# Correlation Matrix - all are very very highly correlated with each other, as expected

results.corr()

In [None]:
# Plotting these 4 model predictions

hist_data = [pred_lgb.target, pred_xgb.target, pred_hgb.target, pred_stack.target]
group_labels = ['lgb', 'xgb', 'hgb', 'stack']
fig = ff.create_distplot(hist_data, group_labels, bin_size=0.3, show_hist=False, show_rug=False)
fig.show()

# stack is quite different compared to others. 

<div style="background-color:rgba(5, 29, 31, 0.5);">
    <h1><center>Weighted Averaging</center></h1>
</div>

I would also thank VASILEIOS KONSTANTAKOS for making an illuminating notebook on the same topic - https://www.kaggle.com/vkonstantakos/blending-weighted-average

In [None]:
# I will assign weights to the submissions, as per their scores

submission['target'] = (results['target_1']*1 +results['target_2']*2 +results['target_3']*3 +
results['target_4']*4)/10

In [None]:
submission.head()

<div style="background-color:rgba(5, 29, 31, 0.5);">
    <h1><center>Normalising The Predictions</center></h1>
</div>

In [None]:
submission['target'] = (submission['target']-submission['target'].min())/(submission['target'].max()-submission['target'].min())

In [None]:
submission.head()

In [None]:
submission.to_csv('submission_weighted_average.csv',index=False) # gave a score of 0.85655

<div style="background-color:rgba(5, 29, 31, 0.5);">
    <h1><center>Min Max of Predictions</center></h1>
</div>

I first came up with this custom, yet questionable method of aggregating predictions in previous month's TPS- https://www.kaggle.com/c/tabular-playground-series-sep-2021/discussion/272825).

In [None]:
results['mean']=results.mean(axis=1)
results['min']=results.min(axis=1)
results['max']=results.max(axis=1)

for i in np.arange(0,len(results)):
    if results.loc[i,'mean']>0.5:
        results.loc[i,'final']=results.loc[i,'max']
    else:
        results.loc[i,'final']=results.loc[i,'min']
        
results.head()

In [None]:
submission['target'] = results['final']
submission.to_csv('submission_minmax.csv',index=False) # gave a score of 0.85639

<div style="background-color:rgba(5, 29, 31, 0.5);">
    <h1><center>Power Averaging</center></h1>
</div>

My immense gratitude to Edrick Kesuma for creating this notebook on the topic of Power Averaging, in the previous month's TPS - https://www.kaggle.com/edrickkesuma/power-averaging-is-your-friend

In [None]:
# It works best on highly correlated models

#power = 3 # gave a score of 0.85651
power = 4 # gave a score of 0.85651

submission['target'] = (results['target_1']**power +results['target_2']**power
                        +results['target_3']**power + results['target_4']**power)/4

In [None]:
submission.head()

In [None]:
submission.to_csv('submission_power_average.csv',index=False)

<div style="background-color:rgba(5, 29, 31, 0.5);">
    <h1><center>Geometric Mean</center></h1>
</div>

kailai brought up this method in the following discussion - https://www.kaggle.com/c/tabular-playground-series-oct-2021/discussion/276069

In [None]:
# Normalising the predictions

predictions = [pred_lgb, pred_xgb, pred_hgb, pred_stack]

results = pd.DataFrame()
for i, ds in enumerate(predictions):
    ds['target'] = (ds['target'] - ds['target'].min())/(ds['target'].max() - ds['target'].min())
    results[f'target_{i+1}'] = ds['target']

In [None]:
results.head()

In [None]:
submission['target'] = (results['target_1']*results['target_2']*
                        results['target_3']*results['target_4'])**(1.0/4)

In [None]:
submission.head()

In [None]:
submission.to_csv('submission_geometric_mean.csv',index=False) # gave a score of 0.85650 (unscaled targets)
# gave a score of 0.85650 after normalising the predictions

<div style="background-color:rgba(5, 29, 31, 0.5);">
    <h1><center>Summary and Take-aways</center></h1>
</div>

After this experiment where I tried different ways of aggregating the predictions, I am mentioning the summary and my next action steps-

1. Weighted Average - 0.85652 - BEST
2. Power Averaging - 0.85651
3. Geometric Mean - 0.85650
4. MinMax aggregation - 0.85639 - WORST

Next step would be to re-tune some of the models I have mentioned (by changing ranges in Optuna study) or drop 1-2 of the worst ones, keeping only the best ones for the next experiment.

**Do share with me any other methods I can try and upvote if you found this useful :)**