# A Basic Ensembling Technique with Scipy Optimize

In [VPP: A Basic Ensembling Technique](https://www.kaggle.com/tolgadincer/vpp-a-basic-ensembling-technique) notebook, I demonstrated how to stack several weak models to improve the CV score, and hence the LB score. As a metalearner I had used RidgeRegressorCV, which minimizes the mean square error.

In this notebook, I'll estimate the ensemble weights with linear programming using scipy optimize. For this we will use 

## Problem Formulation
The ensemble prediction (EP) can be written as:

<center>
    EP = $\sum_{i=1}^{4}$ w$_i$ $\times$ OOF$_i$
</center>

Then, the cost function will be just the mean absolute error calculated by the ensemble predictions, which can be written as follow:

<center>
    cost = $\frac{1}{N}$ $\sum$ $|y_i - EP_i|$
</center>

The task is to minimize the cost function while keeping the sum of the weights at 1 and the weights themselves in the [0, 1] range. Finding the minimum of the cost function will give us the optimum weights.

<center>
    minimize(cost)
</center>

<center>
    s.t
</center>

<center>
    $\sum_{i=1}^{4}$ w$_i$ = 1
</center>

<center>
    0 $\leq$ w$_i$ $\leq$ 1
</center>

## Load Packages

In [None]:
import numpy as np
import pandas as pd
from scipy.optimize import minimize

## Competition Metric

Competition metric is the mean absolute error, which is calculated only over the inspiration phase data (`u_out==0`). 

In [None]:
def mae(ytrue, ypred, uout=None):
    if isinstance(uout, pd.Series):
        # print(f'MAE (Inspiration Phase):')
        return np.mean(np.abs((ytrue - ypred)[uout == 0]))
    else:
        # print('MAE (All Phases):')
        return np.mean(np.abs((ytrue - ypred)))

## Load Data

From the training data, we will need the training targets (pressure values). Since the competition metric, mean absolute error, is evaluated only on the inspiration phase (`u_out==0`), we also need the u_out values.

In [None]:
data = pd.read_csv('../input/ventilator-pressure-prediction/train.csv', usecols=['pressure', 'u_out'])
ytrue = data.pressure
uout = data.u_out

## Postprocessing Tools

In [None]:
pressure_sorted = np.sort(data['pressure'].unique())
PRESSURE_MIN = pressure_sorted[0]
PRESSURE_MAX = pressure_sorted[-1]
PRESSURE_STEP = pressure_sorted[1] - pressure_sorted[0]

def post_process(pressure):
    pressure = np.round((pressure - PRESSURE_MIN) / PRESSURE_STEP) * PRESSURE_STEP + PRESSURE_MIN
    pressure = np.clip(pressure, PRESSURE_MIN, PRESSURE_MAX)
    return pressure

## Load the OOF and the Submission Predictions

In [None]:
oof1 = np.load('../input/vpp-156-oof/oof_preds.npy')
sub1 = pd.read_csv('../input/tfbidirectional156/submission_median_round_tfbidirec.csv')
sub1.pressure = post_process(sub1.pressure)
print(mae(ytrue, oof1, uout))

oof2 = np.load('../input/gb-vpp-why-so-serious/oof_preds.npy')
sub2 = pd.read_csv('../input/gb-vpp-why-so-serious/submission_median_round.csv')
print(mae(ytrue, oof2, uout))

oof3 = np.load('../input/gb-vpp-to-infinity-and-beyond-td/oof_preds.npy')
sub3 = pd.read_csv('../input/gb-vpp-to-infinity-and-beyond-td/submission_median_round.csv')
print(mae(ytrue, oof3, uout))


oof4 = pd.read_csv('../input/ventilator-train-classification/exp080_conti_rc/oof.csv', usecols=['oof'])
oof4 = oof4.to_numpy().ravel()
sub4 = pd.read_csv('../input/ventilator-train-classification/exp080_conti_rc/submission_median_pp.csv')
print(mae(ytrue, oof4, uout))


## Input & Target Values

In [None]:
X = np.stack([oof1, oof2, oof3, oof4], 1)[uout == 0]
y = ytrue[uout == 0]

print(f'X shape: {X.shape}')
print(f'y shape: {y.shape}')

Since we ensemble four models, we have only four features for each timestep as can be seen above.

## Optimization with Scipy

In [None]:
# Optimization method requires us to give some initial guess values for the weights.
# The most straightforward way is to pick equal weights.
x0 = np.array([0.25, 0.25, 0.25, 0.25])

# The cost function is just the ensemble's mean absolute error.
def cost(x0):
    return mae(y, np.sum(X*x0, -1), uout)

bnds = ((0, 1), (0, 1), (0, 1), (0, 1)) # Weights have to be between 0 and 1.
cons = ({'type': 'eq', 'fun': lambda x:  1 - sum(x)}) # Sum of the weights will be 1.
res = minimize(cost, x0, bounds=bnds)

In [None]:
print(res)

The MAE score printed above is the score of the stacked models with the ensemble weights found from the optimization procedure. It's 0.015 better than the best scoring model!

Note that the MAE score we found is same as the MAE score in the [VPP: A Basic Ensembling Technique](https://www.kaggle.com/tolgadincer/vpp-a-basic-ensembling-technique) notebook up to the 4th decimal; however the weights are slightly different. The difference in the weights do not change the LB score (see Version 1).

## Ensembling the Predictions

In [None]:
submission = pd.read_csv('../input/ventilator-pressure-prediction/sample_submission.csv')
for sub in zip([sub1, sub2, sub3, sub4], res.x):
    submission.pressure += sub[0].pressure * sub[1]

In [None]:
submission.to_csv('submission.csv', index=False)

## Post Processing

In [None]:
submission.pressure = post_process(submission.pressure)
submission.to_csv('submission_pp.csv', index=False)