# Model Blending Weights Optimisation

This demo shows how to use [scipy.optimize][1] to optimise your model blending weights using your models' OOFs.

**UPDATE:** Getting rid of the penalty term by using 'SLSQP' solver with a relatively small tolerance and Jacobian matrix.

**UPDATE:** Calculate the gradients with paper and pencil to accelerate the optimisation...

**UPDATE:** Add numba gradient function

[1]: https://docs.scipy.org/doc/scipy/reference/optimize.html

In [None]:
# !pip install autograd --quiet

In [None]:
import datetime
import pandas as pd
from time import time
# from autograd import grad
# import autograd.numpy as np
import numpy as np
from numba import njit
from scipy.optimize import minimize, fsolve

# Objective Function and Gradients

$$
F = -\frac{1}{NM}\sum_{m=1}^{M}\sum_{i=1}^{N}\left[ y_{i,m}{\rm log}\left( \sum_{k=1}^{K}w_{k}\hat{y}_{i,m,k} \right) + \left( 1 - y_{i,m} \right) {\rm log}\left( 1 - \sum_{k=1}^{K}w_{k}\hat{y}_{i,m,k} \right) \right],
$$

$$
\frac{\partial F}{\partial w_{k}} = -\frac{1}{NM}\sum_{m=1}^{M}\sum_{i=1}^{N}\left[ \frac{-y_{i,m}\hat{y}_{i,m,k}+\hat{y}_{i,m,k}^{2}w_{k}+\hat{y}_{i,m,k}\sum_{j=1, j\neq k}^{K}\left( w_{j}\hat{y}_{i,m,j}\right)}{\hat{y}_{i,m,k}^{2}w_{k}^{2}+2\hat{y}_{i,m,k}\sum_{j=1, j\neq k}^{K}\left( w_{j}\hat{y}_{i,m,j}\right)w_{k}-\hat{y}_{i,m,k}w_{k}+\left(\sum_{j=1, j\neq k}^{K}\left( w_{j}\hat{y}_{i,m,j}\right)\right)^{2}-\sum_{j=1, j\neq k}^{K}\left( w_{j}\hat{y}_{i,m,j}\right)} \right], \quad k = 1, ..., K.
$$

In [None]:
# CPMP's logloss from https://www.kaggle.com/c/lish-moa/discussion/183010
def log_loss_numpy(y_pred):
    y_true_ravel = np.asarray(y_true).ravel()
    y_pred = np.asarray(y_pred).ravel()
    y_pred = np.clip(y_pred, 1e-15, 1 - 1e-15)
    loss = np.where(y_true_ravel == 1, - np.log(y_pred), - np.log(1 - y_pred))
    return loss.mean()

def func_numpy_metric(weights):
    oof_blend = np.tensordot(weights, oof, axes = ((0), (0)))
    return log_loss_numpy(oof_blend)

def grad_func(weights):
    oof_clip = np.clip(oof, 1e-15, 1 - 1e-15)
    gradients = np.zeros(oof.shape[0])
    for i in range(oof.shape[0]):
        a, b, c = y_true, oof_clip[i], np.zeros((oof.shape[1], oof.shape[2]))
        for j in range(oof.shape[0]):
            if j != i:
                c += weights[j] * oof_clip[j]
        gradients[i] = -np.mean((-a*b+(b**2)*weights[i]+b*c)/((b**2)*(weights[i]**2)+2*b*c*weights[i]-b*weights[i]+(c**2)-c))
    return gradients

@njit
def grad_func_jit(weights):
    oof_clip = np.minimum(1 - 1e-15, np.maximum(oof, 1e-15))
    gradients = np.zeros(oof.shape[0])
    for i in range(oof.shape[0]):
        a, b, c = y_true, oof_clip[i], np.zeros((oof.shape[1], oof.shape[2]))
        for j in range(oof.shape[0]):
            if j != i:
                c += weights[j] * oof_clip[j]
        gradients[i] = -np.mean((-a*b+(b**2)*weights[i]+b*c)/((b**2)*(weights[i]**2)+2*b*c*weights[i]-b*weights[i]+(c**2)-c))
    return gradients

# Model OOF Scores

Here are my oof scores. You may use your own oof scores.

In [None]:
y_true = pd.read_csv('../input/lish-moa/train_targets_scored.csv', index_col = 'sig_id').values

oof_dict = {'Model 1': '../input/moa-oof-demo/oof1.npy', 
            'Model 2': '../input/moa-oof-demo/oof2.npy', 
            'Model 3': '../input/moa-oof-demo/oof3.npy'
           }

oof = np.zeros((len(oof_dict), y_true.shape[0], y_true.shape[1]))
for i in range(oof.shape[0]):
    oof[i] = np.load(list(oof_dict.values())[i])

In [None]:
%%time

log_loss_scores = {}
for n, key in enumerate(oof_dict.keys()):
    score_oof = log_loss_numpy(oof[n])
    log_loss_scores[key] = score_oof
    print(f'{key} CV:\t', score_oof)
print('-' * 50)

# Test Numba Gradient Function

In [None]:
test_weights = np.array([1 / oof.shape[0]] * oof.shape[0])

In [None]:
%timeit -r 10 grad_func(test_weights)

In [None]:
%timeit -r 10 grad_func_jit(test_weights)

# Blending Weights Optimisation

Providing jac is optional because scipy uses its own 2-point finite difference estimation for the Jacobian matrix.

In [None]:
tol = 1e-10
init_guess = [1 / oof.shape[0]] * oof.shape[0]
bnds = [(0, 1) for _ in range(oof.shape[0])]
cons = {'type': 'eq', 
        'fun': lambda x: np.sum(x) - 1, 
        'jac': lambda x: [1] * len(x)}

print('Inital Blend OOF:', func_numpy_metric(init_guess))
start_time = time()
res_scipy = minimize(fun = func_numpy_metric, 
                     x0 = init_guess, 
                     method = 'SLSQP', 
                     jac = grad_func_jit, # grad_func 
                     bounds = bnds, 
                     constraints = cons, 
                     tol = tol)
print(f'[{str(datetime.timedelta(seconds = time() - start_time))[2:7]}] Optimised Blend OOF:', res_scipy.fun)
print('Optimised Weights:', res_scipy.x)

In [None]:
print('Check the sum of all weights:', np.sum(res_scipy.x))
if np.sum(res_scipy.x) - 1 <= tol:
    print('Great! The sum of all weights equals to 1!')
else:
    print('Manual adjustion is needed to modify the weights.')

# Bonus (Lagrange Multiplier)

Congratulations! You have found this bonus. In this section, I optimise the blending weights in a more mathematical way using Lagrange Multiplier method. The following equation is the minimisation problem that we want to solve:

$$
\begin{align}
\min_{w_{1}, w_{2},..., w_{K}} \quad &-\frac{1}{NM}\sum_{m=1}^{M}\sum_{i=1}^{N}\left[ y_{i,m}{\rm log}\left( \sum_{k=1}^{K}w_{k}\hat{y}_{i,m,k} \right) + \left( 1 - y_{i,m} \right) {\rm log}\left( 1 - \sum_{k=1}^{K}w_{k}\hat{y}_{i,m,k} \right) \right], \qquad {\rm (1)} \\
s.t. \quad &\sum_{k=1}^{k}w_{k} = 1, \qquad {\rm (1a)} \\
& 0 \leqslant w_{k} \leqslant 1, \quad k = 1, ..., K, \qquad {\rm (1b)}
\end{align}
$$

where $N$ is the number of 'sigid' observations in the test data $(i = 1, ...,N)$;

$M$ is the number of scored MoA targets $(m = 1, ...,M)$;

$w_{k}$ is the blending weight for the $k$th model's prediction results $(k = 1, ...,K)$; 

$\hat{y}_{i,m,k}$ is the $k$th model's predicted probability of the $m$th positive MoA response for the $n$th 'sigid'; 

$y_{i,m}$ is the groundtruth of the $m$th positive MoA response for the $n$th 'sigid', 1 for a positive response, 0 otherwise; 

${\rm log}(.)$ is the natural (base e) logarithm.

According to the [Extreme Value Thereom][1], Constraint (1b) indicates Eq. (1) has absolute maximum and minimum values. We apply the [Lagrange Multiplier][2] method to this optimsiation problem. The new optimisation problem is expressed as follows:

$$
\begin{align}
\min_{w_{1}, w_{2},..., w_{K}} \quad &L = -\frac{1}{NM}\sum_{m=1}^{M}\sum_{i=1}^{N}\left[ y_{i,m}{\rm log}\left( \sum_{k=1}^{K}w_{k}\hat{y}_{i,m,k} \right) + \left( 1 - y_{i,m} \right) {\rm log}\left( 1 - \sum_{k=1}^{K}w_{k}\hat{y}_{i,m,k} \right) \right] - \lambda\left(\sum_{k=1}^{K}w_{k} - 1\right), \qquad {\rm (2)} \\
s.t. \quad &0 \leqslant w_{k} \leqslant 1, \quad k = 1, ..., K, \qquad {\rm (2b)}
\end{align}
$$

where $\lambda$ is the Lagrange multiplier.

The [Karush–Kuhn–Tucker condition][3] for the optimal solution is:
$$
\left\{\begin{matrix}
\frac{\partial L}{\partial w_{k}} = 0, & k = 1, ..., K, \\ 
\frac{\partial L}{\partial \lambda} = 0, & 
\end{matrix}\right., \qquad {\rm (3)}
$$

From Eq. (3), we end up with $K+1$ equations that equal zero, we can simply use [autograd][4] to calculate the partial derivatives and [scipy.optimize.fsolve][5] to get the optimal solution.

[1]: https://en.wikipedia.org/wiki/Extreme_value_theorem
[2]: https://en.wikipedia.org/wiki/Lagrange_multiplier
[3]: https://en.wikipedia.org/wiki/Karush%E2%80%93Kuhn%E2%80%93Tucker_conditions
[4]: https://github.com/HIPS/autograd
[5]: https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.fsolve.html

In [None]:
# def Lagrange_func(params):
#     w1, w2, w3, _lambda = params
#     oof_blend = w1 * oof[0] + w2 * oof[1] + w3 * oof[2]
#     return log_loss_numpy(oof_blend) - _lambda * (w1 + w2 + w3 - 1) 

In [None]:
# grad_L = grad(Lagrange_func)

In [None]:
# def Lagrange_obj(params):
#     w1, w2, w3, _lambda = params
#     dLdw1, dLdw2, dLdw3, dLdlam = grad_L(params)
#     return [dLdw1, dLdw2, dLdw3, w1 + w2 + w3 - 1]

In [None]:
# start_time = time()
# w1, w2, w3, _lambda = fsolve(Lagrange_obj, [0.3, 0.3, 0.4, 1.0])
# print(f'[{str(datetime.timedelta(seconds = time() - start_time))[2:7]}] Optimised Weights:', [w1, w2, w3])
# oof_b = w1 * oof[0] + w2 * oof[1] + w3 * oof[2]
# print('Optimised Blend OOF:', log_loss_numpy(oof_b))

In [None]:
# print('Check Condition (1a):', w1 + w2 + w3)
# if w1 + w2 + w3 - 1 <= tol:
#     print('Great! The sum of all weights equals to 1!')
# else:
#     print('Manual adjustion is needed to modify the weights.')