# Blend Boosting study on dataset of the Santander Value Prediction Challenge.

Here I share with you a systematic blend boosting study on dataset of the Santander Value Prediction Challenge (https://www.kaggle.com/c/santander-value-prediction-challenge). I just collect some submission files (total 38) on Kaggle with scores up to 1.5 RMSLE.

Basically, I start to analysis of correlations, then decide to sort them according to their sum of correlation values in between. This lets me divide 38 scores into 5 subgroups. Then I make internal linear calibration in each subgroup by considering their scores. Finally I make recalling between subgroups to achieve higher scores on the Kaggle by resubmission. Of course, if you spend much more time, you can always achieve betters scores, but I stop it here because it is already highest score on Kaggle ;-).  

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# loading dummy submission file
sub_file = pd.read_csv(r'../input/santander-value-prediction-challenge/sample_submission.csv')

# loading data including 38 best scores
df_sub = pd.read_csv('../input/santander-value-pred/best_blend_1.csv')

# a rough correlation based visualization of 32 best scores
plt.figure(figsize=(10,10))
sns.heatmap(df_sub.iloc[:,:-1].corr(), cmap='Spectral')
plt.ylabel('file index numbers')
plt.xlabel('file index numbers')
plt.show()

In [None]:
# basic analysis and visualization  of subgroups in different color.
plt.figure(figsize=(12, 5))
df_mean_corr = pd.DataFrame({'mean_corr': df_sub.iloc[:,:-1].corr().mean()})
df_mean_corr = df_mean_corr.sort_values('mean_corr', ascending=False)
df_mean_corr = df_mean_corr.reset_index()
plt.plot(df_mean_corr.index[:3], df_mean_corr['mean_corr'].values[:3], 'o', ms=10)
plt.plot(df_mean_corr.index[3:18], df_mean_corr['mean_corr'].values[3:18], 'o', ms=10)
plt.plot(df_mean_corr.index[18:29], df_mean_corr['mean_corr'].values[18:29], 'o', ms=10)
plt.plot(df_mean_corr.index[29:35], df_mean_corr['mean_corr'].values[29:35], 'o', ms=10)
plt.plot(df_mean_corr.index[35:37], df_mean_corr['mean_corr'].values[35:37], 'o', ms=10)
plt.plot(df_mean_corr.index[37:], df_mean_corr['mean_corr'].values[37:], 'o', ms=10)
plt.xticks([*range(len(df_mean_corr))], df_mean_corr['index'].tolist())
plt.title('determination of sub_groups')
plt.ylabel('mean of sum of correlation values')
plt.xlabel('file index numbers')
plt.show()

In [None]:
# a linear combination to achieve much better scores
df_sub['weighted_avg'] = abs(1 * (
        -10 * (1 * df_sub['7'] + 1 * df_sub['12'] + 1 * df_sub['13']) / 3 +

        225 * (2 * df_sub['4'] + 5 * df_sub['5'] + 5 * df_sub['6'] - 50 * df_sub['10'] + 5 * df_sub['15'] + 5 * df_sub['16'] +
               5 * df_sub['17'] + 5 * df_sub['18'] + 5 * df_sub['19'] + 5 * df_sub['20'] + 200 * df_sub['27'] + 700 * df_sub['33'] + 
               3 * df_sub['35'] + 6 * df_sub['36'] - 300 * df_sub['37']) / 601 +

        25 * (5 * df_sub['0'] + 5 * df_sub['2'] + 7 * df_sub['3'] + 7 * df_sub['8'] +3 * df_sub['9'] + 7 * df_sub['11'] +
             3 * df_sub['14'] + 5 * df_sub['25'] + 400 * df_sub['28'] + 3 * df_sub['31'] + 4 * df_sub['32'] + 7 * df_sub['34']) / 456 +

        -3 * (150 * df_sub['1'] - 1 * df_sub['21'] - 1 * df_sub['22'] - 1 * df_sub['23'] - 1 * df_sub['24']) / 146 +
        -1 * (1 * df_sub['29'] + 1 * df_sub['30']) / 2 +
        -1 * (1 * df_sub['26']) / 1
    ) / 233)

# create the final submission file
submission = pd.DataFrame({'ID': sub_file.ID, 'target': df_sub['weighted_avg'].tolist()})
submission.to_csv(r'submission_file.csv', index=False)

## It gets a 0.47264 RMSLE as public score, and looks the best score on Kaggle so far ;-)