## This tutorial shows how to handle NaN targets in multioutput tasks

### Imports

In [1]:
import os
# Optional: set the device to run
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "1"

os.makedirs('../data', exist_ok=True)
import numpy as np
import joblib
from sklearn.datasets import make_regression

# simple case - just one class is used
from py_boost import GradientBoosting
from py_boost.multioutput.sketching import *

### Generate dummy multilabel task with NaN values in target

Some times it happends that some target values in multioutput task are missing. For example, you are solving multilabel task and some labels are unknown for some of the rows, so acually your target could be one of 0/1/NaN. Normaly you can not using ML algorithms directly in that case, so you can do one of the following:

- Drop NaN rows, but that case you are going to miss some part of the data
- Train binary models separately, but your model will be more complex and probably overfitted
- Fill NaNs with 0 or 1, so your labeling will become wrong
- Use Neural Networks with masked loss function

In Py-Boost you can write the loss wrapper to handle such scenario and train your model directly on known labels ignoring NaNs, and here is shown how.

We will create it as the regression task and then thresholding the target. And then add some random NaNs

In [2]:
%%time
X, y = make_regression(150000, 100, n_targets=10, random_state=42)
# binarize
y = (y > y.mean(axis=0)).astype(np.float32)
# add some NaNs
y[np.random.rand(150000, 10) > 0.5] = np.nan


X_test, y_test = X[:50000], y[:50000]
X, y = X[-50000:], y[-50000:]

CPU times: user 2.33 s, sys: 1.66 s, total: 3.99 s
Wall time: 876 ms


### NaN loss and metric wrappers

Here it is shown how to write loss wrapper ignoring NaNs

In [3]:
import cupy as cp
from py_boost.gpu.losses import BCELoss

class BCEWithNaNLoss(BCELoss):
    
    def base_score(self, y_true):
        # Replace .mean with nanmean function to calc base score
        means = cp.clip(cp.nanmean(y_true, axis=0), self.clip_value, 1 - self.clip_value)
        return cp.log(means / (1 - means))
    
    def get_grad_hess(self, y_true, y_pred):
        # first, get nan mask for y_true
        mask = cp.isnan(y_true)
        # then, compute loss with any values at nan places just to prevent the exception
        grad, hess = super().get_grad_hess(cp.where(mask, 0, y_true), y_pred)
        # invert mask
        mask = (~mask).astype(cp.float32)
        # multiply grad and hess on inverted mask
        # now grad and hess eq. 0 on NaN points
        # that actually means that prediction on that place should not be updated
        grad = grad * mask
        hess = hess * mask
        
        return grad, hess


And here is column-wise roc-auc metric ignoring NaNs

In [4]:
from py_boost.gpu.losses.metrics import Metric, auc

class NaNAucMetric(Metric):
    
    def __call__(self, y_true, y_pred, sample_weight=None):
        
        aucs = []
        mask = ~cp.isnan(y_true)
        
        for i in range(y_true.shape[1]):
            m = mask[:, i]
            w = None if sample_weight is None else sample_weight[:, 0][m]
            aucs.append(
                auc(y_true[:, i][m], y_pred[:, i][m], w)
            )
            
        return np.mean(aucs)
    
    def compare(self, v0 ,v1):

        return v0 > v1    
            
            
        

In [5]:
%%time
model = GradientBoosting(BCEWithNaNLoss(), NaNAucMetric(), lr=0.01,
                         verbose=100, ntrees=1000, es=200, multioutput_sketch=RandomProjectionSketch(1))

model.fit(X, y, eval_sets=[{'X': X_test, 'y': y_test},])

[08:31:38] Stdout logging level is INFO.
[08:31:38] GDBT train starts. Max iter 1000, early stopping rounds 200
[08:31:39] Iter 0; Sample 0, score = 0.7906884535541213; 
[08:31:41] Iter 100; Sample 0, score = 0.9687261163054176; 
[08:31:44] Iter 200; Sample 0, score = 0.9785187659166686; 
[08:31:46] Iter 300; Sample 0, score = 0.9844858052685057; 
[08:31:49] Iter 400; Sample 0, score = 0.9883780152591723; 
[08:31:51] Iter 500; Sample 0, score = 0.9908004122540589; 
[08:31:54] Iter 600; Sample 0, score = 0.9923353340683694; 
[08:31:57] Iter 700; Sample 0, score = 0.9935137491384962; 
[08:31:59] Iter 800; Sample 0, score = 0.9943018456130359; 
[08:32:02] Iter 900; Sample 0, score = 0.9949417958344802; 
[08:32:04] Iter 999; Sample 0, score = 0.9954331107999328; 
CPU times: user 32.1 s, sys: 1.59 s, total: 33.7 s
Wall time: 31.9 s


<py_boost.gpu.boosting.GradientBoosting at 0x7f5e559de730>