# Introduction

In this competition a significant delta between local CV and LB scores has been reported in some cases (https://www.kaggle.com/c/trends-assessment-prediction/discussion/153256). We have many features to work with... maybe too many. Reducing variance would seem to be a good thing here and I wanted to investigate the BaggingRegressor for that. The idea is to use the BaggingRegressor to build multiple models, each considering only a fraction of the features, then combine their outputs. From the scikit-learn docs:

"A Bagging regressor is an ensemble meta-estimator that fits base regressors each on random subsets of the original dataset and then aggregate their individual predictions (either by voting or by averaging) to form a final prediction. Such a meta-estimator can typically be used as a way to reduce the variance of a black-box estimator (e.g., a decision tree), by introducing randomization into its construction procedure and then making an ensemble out of it."

Ridge regression is known to work well on this dataset, so is used as the base regressor here. The use of the BaggingRegressor is considered as part of a high-performing ensemble, combining SVM and Ridge regression.

This notebook is heavily based on @aerdem4's excellent SVM notebook and @tunguz's notebook that adds Ridge regression. Those original notebooks can be found here:
https://www.kaggle.com/aerdem4/rapids-svm-on-trends-neuroimaging
https://www.kaggle.com/tunguz/rapids-ensemble-for-trends-neuroimaging/

## Results

After doing an offline sweep of blending weights, the final weights show that for the best local CV, the BaggingRegressor was hardly used for the "age" target. However, the BaggingRegressor provided more benefits for the domain variables. In particular for "domain1_var2" and "domain2_var2" the BaggingRegressor almost completely replaces the basic Ridge regression method.

In terms of local CV, the result is almost identical to Bojan's notebook referenced above. On the leaderboard, adding the BaggingRegressor into the ensemble scores 0.1593, an improvement over Bojan's 0.1595. So the local CV to LB delta is successfully reduced, albeit by a little.

I find it particularly interesting that only considering small subsets of the features, the BaggingRegressor is competitive for the domain variables but not at all for age.


# Load the data

In [1]:
# Install Rapids for faster SVM on GPUs

import sys
!cp ../input/rapids/rapids.0.13.0 /opt/conda/envs/rapids.tar.gz
!cd /opt/conda/envs/ && tar -xzvf rapids.tar.gz > /dev/null
sys.path = ["/opt/conda/envs/rapids/lib/python3.6/site-packages"] + sys.path
sys.path = ["/opt/conda/envs/rapids/lib/python3.6"] + sys.path
sys.path = ["/opt/conda/envs/rapids/lib"] + sys.path
!cp /opt/conda/envs/rapids/lib/libxgboost.so /opt/conda/lib/

In [2]:
import numpy as np
import pandas as pd
import cudf
import cupy as cp
import warnings
from cuml.neighbors import KNeighborsRegressor
from cuml import SVR
from cuml.linear_model import Ridge, Lasso
from sklearn.model_selection import KFold
from sklearn.ensemble import BaggingRegressor
from sklearn.preprocessing import RobustScaler, StandardScaler
def metric(y_true, y_pred):
    return np.mean(np.sum(np.abs(y_true - y_pred), axis=0)/np.sum(y_true, axis=0))

In [3]:
fnc_df = cudf.read_csv("../input/trends-assessment-prediction/fnc.csv")
loading_df = cudf.read_csv("../input/trends-assessment-prediction/loading.csv")

fnc_features, loading_features = list(fnc_df.columns[1:]), list(loading_df.columns[1:])
df = fnc_df.merge(loading_df, on="Id")


labels_df = cudf.read_csv("../input/trends-assessment-prediction/train_scores.csv")
labels_df["is_train"] = True

df = df.merge(labels_df, on="Id", how="left")

test_df = df[df["is_train"] != True].copy()
df = df[df["is_train"] == True].copy()



In [4]:
# Giving less importance to FNC features since they are easier to overfit due to high dimensionality.
FNC_SCALE = 1/600

df[fnc_features] *= FNC_SCALE
test_df[fnc_features] *= FNC_SCALE

In [5]:
scaler = RobustScaler()
dffit = scaler.fit_transform(df[fnc_features].to_pandas())
testfit = scaler.fit_transform(test_df[fnc_features].to_pandas())

In [6]:
dfn = pd.DataFrame(dffit)
dfn = cudf.DataFrame(dfn)


In [7]:
testn = pd.DataFrame(testfit)
testn = cudf.DataFrame(testn)

In [8]:
dfn

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1368,1369,1370,1371,1372,1373,1374,1375,1376,1377
0,-0.115210,-0.564800,-0.346825,-0.652229,0.309619,0.396194,0.321085,0.274943,0.408415,0.153251,...,-0.558840,1.121838,0.320730,0.400996,-0.840185,-0.738112,-0.441600,-0.072529,0.083354,0.041877
1,-0.860023,-0.806715,-1.990393,-0.283466,-0.125322,-0.171817,0.069740,0.702356,1.074101,0.964191,...,0.235684,-1.091728,-0.490821,0.054676,-0.002598,0.223839,-0.204375,-0.606344,-0.087591,-0.095824
2,-0.216902,0.450364,0.410574,0.159720,0.297364,0.384684,-0.331900,-0.214646,-0.072133,-0.304650,...,0.076588,0.707393,-1.650938,1.291996,-0.453432,0.447590,0.072392,-0.305047,0.420707,-0.337515
3,0.162220,0.551207,-0.867616,-0.325059,0.270928,0.901800,0.141748,0.252126,-0.235481,0.590514,...,-0.565377,0.416175,-0.072337,0.397661,-0.078050,-1.188077,-1.155756,-0.345728,-1.446115,0.019472
4,-0.885463,-1.594216,-2.094349,-1.541049,-0.869234,-0.586263,-0.305070,-0.061576,0.340020,-0.098498,...,-0.995042,-0.845467,-1.149858,-1.172092,-0.229493,-0.191816,-0.331939,-0.017552,-0.727600,0.223092
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5872,-0.252459,0.089144,0.466345,-1.239722,0.688019,0.493994,0.229923,-0.151674,-0.062975,0.332829,...,-0.267457,0.477674,-1.392595,1.068518,-0.062064,-0.349212,-0.613537,-0.324463,0.226153,-0.373786
5873,-0.129314,-0.003485,-0.364198,0.036579,0.311234,0.060626,0.000601,-0.281670,-0.388019,-0.230108,...,-0.919615,0.183870,1.123991,0.739125,-1.473121,-1.160874,-1.449639,0.070451,-0.107701,0.350745
5874,-1.659498,-1.032821,-0.748854,-1.728834,0.901963,0.623707,0.768582,0.753376,-0.315562,0.458227,...,0.541791,-0.097608,0.581478,0.208402,0.491851,0.315748,-0.111102,0.354104,-0.542409,0.009353
5875,-0.326964,-0.742584,-0.310322,-1.211476,0.288336,-0.285977,0.935511,-0.138513,0.958717,-0.293915,...,0.867476,-0.721944,0.608446,-0.007352,0.323922,-0.363531,0.057072,-0.392303,0.698606,-0.586296


In [9]:
df[fnc_features] = dfn
test_df[fnc_features] = testn

In [10]:
df

Unnamed: 0,Id,SCN(53)_vs_SCN(69),SCN(98)_vs_SCN(69),SCN(99)_vs_SCN(69),SCN(45)_vs_SCN(69),ADN(21)_vs_SCN(69),ADN(56)_vs_SCN(69),SMN(3)_vs_SCN(69),SMN(9)_vs_SCN(69),SMN(2)_vs_SCN(69),...,IC_30,IC_22,IC_29,IC_14,age,domain1_var1,domain1_var2,domain2_var1,domain2_var2,is_train
0,18240,-0.115210,-0.564800,-0.346825,-0.652229,0.309619,0.396194,0.321085,0.274943,0.408415,...,0.003238,-0.006451,0.030605,0.016639,50.427747,29.90305385,61.05115855,50.06049358,61.24677336,True
1,13820,-0.860023,-0.806715,-1.990393,-0.283466,-0.125322,-0.171817,0.069740,0.702356,1.074101,...,0.001940,-0.011408,0.023409,0.013241,55.456978,49.38457261,60.29300963,60.54383047,37.34441091,True
2,13809,-0.216902,0.450364,0.410574,0.159720,0.297364,0.384684,-0.331900,-0.214646,-0.072133,...,-0.000252,-0.014255,0.030720,0.020297,36.961174,43.71688625,58.09714123,42.14923716,44.51487978,True
3,13825,0.162220,0.551207,-0.867616,-0.325059,0.270928,0.901800,0.141748,0.252126,-0.235481,...,-0.001308,-0.012427,0.021317,0.011156,47.470203,57.40391654,71.73026129,46.26066048,43.9657463,True
4,13810,-0.885463,-1.594216,-2.094349,-1.541049,-0.869234,-0.586263,-0.305070,-0.061576,0.340020,...,0.000825,-0.008199,0.020756,0.014812,48.948756,50.30134784,63.01577288,44.89238229,56.51086817,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11732,16299,-0.252459,0.089144,0.466345,-1.239722,0.688019,0.493994,0.229923,-0.151674,-0.062975,...,0.001949,-0.018767,0.021829,0.016573,64.203107,62.68657086,62.63511667,57.4445276,38.03955027,True
11733,16301,-0.129314,-0.003485,-0.364198,0.036579,0.311234,0.060626,0.000601,-0.281670,-0.388019,...,0.004872,-0.011617,0.023664,0.010137,57.436077,41.56829107,68.55638681,34.59675355,37.58509756,True
11734,16278,-1.659498,-1.032821,-0.748854,-1.728834,0.901963,0.623707,0.768582,0.753376,-0.315562,...,-0.000530,-0.013141,0.019981,0.012371,36.961174,41.84858971,55.78433221,19.69845939,16.32976701,True
11735,20136,-0.326964,-0.742584,-0.310322,-1.211476,0.288336,-0.285977,0.935511,-0.138513,0.958717,...,0.003586,-0.007180,0.034187,0.018442,28.442742,55.73784501,37.3279806,36.88775907,48.46559788,True


# BaggingRegressor + RAPIDS Ensemble

In [11]:
%%time

# To suppress the "Expected column ('F') major order, but got the opposite." warnings from cudf. It should be fixed properly,
# although as the only impact is additional memory usage, I'll supress it for now.
warnings.filterwarnings("ignore", message="Expected column")

# Take a copy of the main dataframe, to report on per-target scores for each model.
# TODO Copy less to make this more efficient.
df_model1 = df.copy()
df_model2 = df.copy()
df_model3 = df.copy()

NUM_FOLDS = 7
kf = KFold(n_splits=NUM_FOLDS, shuffle=True, random_state=0)

features = loading_features + fnc_features

# Blending weights between the three models are specified separately for the 5 targets. 
#                                 SVR,  Ridge, BaggingRegressor
blend_weights = {"age":          [0.4,  0.55,  0.05],
                 "domain1_var1": [0.55, 0.15,  0.3],
                 "domain1_var2": [0.45, 0.0,   0.55],
                 "domain2_var1": [0.55, 0.15,  0.3],
                 "domain2_var2": [0.5,  0.05,  0.45]}

overall_score = 0
for target, c, w in [("age", 60, 0.3), ("domain1_var1", 12, 0.175), ("domain1_var2", 8, 0.175), ("domain2_var1", 9, 0.175), ("domain2_var2", 12, 0.175)]:    
    y_oof = np.zeros(df.shape[0])
    y_oof_model_1 = np.zeros(df.shape[0])
    y_oof_model_2 = np.zeros(df.shape[0])
    y_oof_model_3 = np.zeros(df.shape[0])
    y_test = np.zeros((test_df.shape[0], NUM_FOLDS))
    
    for f, (train_ind, val_ind) in enumerate(kf.split(df, df)):
        train_df, val_df = df.iloc[train_ind], df.iloc[val_ind]
        train_df = train_df[train_df[target].notnull()]

        model_1 = SVR(C=c, cache_size=3000.0)
        model_1.fit(train_df[features].values, train_df[target].values)
        model_2 = Ridge(alpha = 0.0001)
        model_2.fit(train_df[features].values, train_df[target].values)
        
        ### The BaggingRegressor, using the Ridge regression method as a base, is added here. The BaggingRegressor
        # is from sklearn, not RAPIDS, so dataframes need converting to Pandas.
        model_3 = BaggingRegressor(Ridge(alpha = 0.0001), n_estimators=30, random_state=42, max_samples=0.3, max_features=0.3)
        model_3.fit(train_df.to_pandas()[features].values, train_df.to_pandas()[target].values)

        val_pred_1 = model_1.predict(val_df[features])
        val_pred_2 = model_2.predict(val_df[features])
        val_pred_3 = model_3.predict(val_df.to_pandas()[features])
        val_pred_3 = cudf.from_pandas(pd.Series(val_pred_3))
        
        test_pred_1 = model_1.predict(test_df[features])
        test_pred_2 = model_2.predict(test_df[features])
        test_pred_3 = model_3.predict(test_df.to_pandas()[features])
        test_pred_3 = cudf.from_pandas(pd.Series(test_pred_3))
        
        val_pred = blend_weights[target][0]*val_pred_1+blend_weights[target][1]*val_pred_2+blend_weights[target][2]*val_pred_3
        val_pred = cp.asnumpy(val_pred.values.flatten())
        
        test_pred = blend_weights[target][0]*test_pred_1+blend_weights[target][1]*test_pred_2+blend_weights[target][2]*test_pred_3
        test_pred = cp.asnumpy(test_pred.values.flatten())
        
        y_oof[val_ind] = val_pred
        y_oof_model_1[val_ind] = val_pred_1
        y_oof_model_2[val_ind] = val_pred_2
        y_oof_model_3[val_ind] = val_pred_3
        y_test[:, f] = test_pred
        
    df["pred_{}".format(target)] = y_oof
    df_model1["pred_{}".format(target)] = y_oof_model_1
    df_model2["pred_{}".format(target)] = y_oof_model_2
    df_model3["pred_{}".format(target)] = y_oof_model_3
    test_df[target] = y_test.mean(axis=1)
    
    score = metric(df[df[target].notnull()][target].values, df[df[target].notnull()]["pred_{}".format(target)].values)
    overall_score += w*score
    
    score_model1 = metric(df_model1[df_model1[target].notnull()][target].values, df_model1[df_model1[target].notnull()]["pred_{}".format(target)].values)
    score_model2 = metric(df_model2[df_model2[target].notnull()][target].values, df_model2[df_model1[target].notnull()]["pred_{}".format(target)].values)
    score_model3 = metric(df_model3[df_model3[target].notnull()][target].values, df_model3[df_model1[target].notnull()]["pred_{}".format(target)].values)

    print(f"For {target}:")
    print("SVR:", np.round(score_model1, 6))
    print("Ridge:", np.round(score_model2, 6))
    print("BaggingRegressor:", np.round(score_model3, 6))
    print("Ensemble:", np.round(score, 6))
    print()
    
print("Overall score:", np.round(overall_score, 6))

For age:
SVR: 0.177918
Ridge: 0.161465
BaggingRegressor: 0.152276
Ensemble: 0.15406

For domain1_var1:
SVR: 0.15621
Ridge: 0.177086
BaggingRegressor: 0.154881
Ensemble: 0.154691

For domain1_var2:
SVR: 0.153625
Ridge: 0.178628
BaggingRegressor: 0.155128
Ensemble: 0.153234

For domain2_var1:
SVR: 0.184502
Ridge: 0.210831
BaggingRegressor: 0.185102
Ensemble: 0.183948

For domain2_var2:
SVR: 0.179985
Ridge: 0.206575
BaggingRegressor: 0.179392
Ensemble: 0.178435

Overall score: 0.163522
CPU times: user 8min 39s, sys: 26.4 s, total: 9min 5s
Wall time: 9min 8s


In [12]:
sub_df = cudf.melt(test_df[["Id", "age", "domain1_var1", "domain1_var2", "domain2_var1", "domain2_var2"]], id_vars=["Id"], value_name="Predicted")
sub_df["Id"] = sub_df["Id"].astype("str") + "_" +  sub_df["variable"].astype("str")

sub_df = sub_df.drop("variable", axis=1).sort_values("Id")
assert sub_df.shape[0] == test_df.shape[0]*5
sub_df.head(10)

Unnamed: 0,Id,Predicted
1586,10003_age,67.524744
7463,10003_domain1_var1,52.94782
13340,10003_domain1_var2,57.650484
19217,10003_domain2_var1,52.470429
25094,10003_domain2_var2,56.440817
1593,10006_age,67.370129
7470,10006_domain1_var1,57.678881
13347,10006_domain1_var2,59.048261
19224,10006_domain2_var1,44.898143
25101,10006_domain2_var2,52.379146


In [13]:
sub_df.to_csv("submission_rapids_ensemble_with_baggingregressor.csv", index=False)