## Py-Boost

In this kernel I want to introduce my boosting framework called **Py-Boost** and the example of its usage for imputation task. 

The main idea of **Py-Boost** is to make a simple and fast boosting for the researchers that could be easily customized and convinient to implement your own ideas. Key feature of **Py-Boost** is that it is written on Python and, despite this fact, is fast enough (comparable by speed with state of art GPU implementations such as **XGBoost** and **CatBoost**) because it uses GPU computation frameworks such as **CuPy** and **Numba**.

To learn more, visit our [Github repo](https://github.com/sb-ai-lab/Py-Boost). Here you will find some more usage tutorials. If you like this tool, you also can star us :)

Also there is an example of training **Py-Boost** on a simple binary task in [this kernel](https://www.kaggle.com/code/btbpanda/fast-metric-and-py-boost-baseline)

My own research today is mainly focused on applying GBDTs to the **multioutput tasks (multiclass, multilabel, and multitask regression)**, so another important thing is that there are few features that could be helpful for this competition. To learn more you can check our [multioutput tutorial](https://github.com/sb-ai-lab/Py-Boost/blob/master/tutorials/Tutorial_2_Advanced_multioutput.ipynb).

## Gradient Boosting for imputation idea

The imputation task is actually very similar to multitask regression, where Py-Boost can be very efficient. In both tasks the model should take an array of features and output an array of target predictions (not a single value). The only difference is that the input and the output for imputation are the same arrays. 

But the obvious fact is that if we will fit the model like `model.fit(X, X)` we will definetly be overfitted. We need to modify somehow the classic boosting scheme to prevent that. And here is the solution:

**At each boosting step let's split all X columns into the 2 parts - randomly decide which columns are the features and which are the targets**. So each particular tree will not use the column to predict itself. It is very similar to the common colsample strategy, but we also sample the targets simultaneously. 

And here I show you that Py-Boost is flexible enought to create such custom scheme easily :)

## Installation and imports

In [None]:
%%capture
!pip install py-boost

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from pandas import Series, DataFrame

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
import cudf
import cupy as cp
import numpy as np

import joblib

from collections import defaultdict

from py_boost import GradientBoosting # basic GradientBoosting class
from py_boost.gpu.losses import * # utils for the custom loss
from py_boost.multioutput.sketching import * # utils for multioutput
from py_boost.sampling.bagging import BaseSampler # utils for custom sampler
from py_boost.callbacks.callback import Callback # other customization via callback

from IPython.core.display import HTML

## Data processing part

In [None]:
%%time

def get_dtypes(path):
    """
    Get data types
    """
    df = pd.read_csv(path, nrows=2)
    dtypes = {x: np.float32 for x in df.columns}
    dtypes['row_id'] = np.int64
    
    return dtypes

path = '../input/tabular-playground-series-jun-2022/data.csv'
data = cudf.read_csv(path, index_col='row_id', dtype=get_dtypes(path))
# save features names
cols = data.columns.tolist()

Similar to other public kernels we will deal with F_4 feature group only. Others will be just filled with the mean values

In [None]:
%%time
# columns to impute with means
fillna_cols = [x for x in data.columns if int(x.split('_')[1]) in [1, 2, 3]]
fillna_index = [cols.index(x) for x in fillna_cols]
means = data[fillna_cols].mean().to_pandas()

# features to process
features_cols = [x for x in data.columns if int(x.split('_')[1]) in [4]]
features_index = [cols.index(x) for x in features_cols]

# cut dataset
data = data[features_cols]

Here we separate the columns, that can be NULL from columns that are not NULL. In our case all F_4 are nullable, so this cell just represents the general case usage - for example if you are going to train the model on both F_2 (not nullable group) and F_4

Nullable columnns will be used both as features and targets, not nullable - only as features

In [None]:
%%time
null_stats = data.isnull().sum()
cols = data.columns.tolist()

# columns that can not  be NULL (empty list in this particular case) - only used as features
not_null_cols = null_stats[null_stats == 0].index.to_pandas().tolist()

# columns that cab be null (all inn our case) - used both as features and targets
null_cols = null_stats[null_stats > 0].index.to_pandas().tolist()

# save numeric indices of targets
target_index = [cols.index(x) for x in null_cols]

# move to NumPy as clean GPU memory
X = data.to_pandas().values
index = data.index.to_pandas()

del data

## Separate by NaN count

In previous kernels it was shown, that building separate models for different NaN count in row helps training. Despite the fact we will build single multioutput model for ALL features, we still will be seaparating the models by NaN count. As I checked locally, it is also helpful for my approach too. So finally, we will have 4 models for the final solution (4 and 5 NaNs group will be merged together). 

In [None]:
nan_cnt = np.isnan(X).sum(axis=1)
Series(nan_cnt).value_counts()

Split on train/test

In [None]:
%%time
X_train = X[nan_cnt == 0]

test_sl = nan_cnt > 0
X_test = X[test_sl]

test_index = index[test_sl]
test_nan = nan_cnt[test_sl]

## Customizing the Py-Boost

Here I show how to customize Py-Boost to implement the idea of imputation boosting described at the begining. Custom classes should be implemented via modifying the followinng methods:

- **.before_iteration**
- **.before_iteration**
- **.before_train**
- **.after_train**

All methods takes **build_info** dict as Input. It contains the full training, validation data, and also all model parameters. Also you can save your own attributes to the build_info. The full structure of build info is presented in the hidden cell below. NOTE: All GPU data attributes are stored as CuPy arrays and CPU data as NumPy arrays.

To create callback we should inherit Callbak class. There are 4 methods, that could be redefined:
        - before_train - outputs None
        - before_iteration - outputs None
        - after_train - outputs None
        - after_iteration - outputs bool - if training should be stopped after iteration

    Methods receive build_info - the state dict, that could be accessed and modifier

    Basic build info structure:

    build_info = {
            'data': {
                'train': {
                    'features_cpu': np.ndarray - raw feature matrix,
                    'features_gpu': cp.ndarray - uint8 quantized feature matrix on GPU,
                    'target': y - cp.ndarray - processed target variable on GPU,
                    'sample_weight': cp.ndarray - processed sample_weight on GPU or None,
                    'ensemble': cp.ndarray - current model prediction (with no postprocessing,
                        ex. before sigmoid for logloss) on GPU,
                    'grad': cp.ndarray of gradients on GPU, before first iteration - None,
                    'hess': cp.ndarray of hessians on GPU, before first iteration - None,

                    'last_tree': {
                        'leaves': cp.ndarray - nodes indices of the last trained tree,
                        'preds': cp.ndarray - predictions of the last trained tree,
                    }

                },
                'valid': {
                    'features_cpu' the same as train, but list, each element corresponds each validation sample,
                    'features_gpu': ...,
                    'target': ...,
                    'sample_weight': ...,
                    'ensemble': ...,

                    'last_tree': {
                        'leaves': ...,
                        'preds': ...,
                    }

                }
            },
            'borders': list of np.ndarray - list or quantization borders,
            'model': GradientBoosting - model, that is trained,
            'mempool': cp.cuda.MemoryPool - memory pool used for train, could be used to clean memory to prevent OOM,
            'builder': DepthwiseTreeBuilder - the instance of tree builder, contains training params,

            'num_iter': int, current number of iteration,
            'iter_scores': list of float - list of metric values for all validation sets for the last iteration,
        }



The First thing we want to do is to customize the **Sampler** class. **Sampler** in **Py-Boost** defines the strategy of both columns/rows sampling. Here it will be used as the custom columns sampler. Remember the idea - on **each boosting step we will select the random portion of columns that will be used as Features. Other will be used as Target**. Small modification - let's select columns with larger gradient norm more frequent, because they need more to be updated

In [None]:
class ImputeSampler(BaseSampler):
    """
    Class Sampler will sample nullable columns to be target or feature
    at the each boosting step
    """
    def __init__(self, sample=0.8, target_cols=None):
        """
        
        Args:
            sample: float, rate of columns that will be sampled as features
            target_cols: list of int, indices of columns that could be both features and targets
            
        Returns:

        """
        assert sample < 1, 'Sample should be lower than 1'
        
        self.target_cols = list(target_cols)
        super().__init__(sample, axis=1)
        self._temp = None
        
    def before_train(self, build_info):
        """Here we create indexers for the nullable columns, that could be a target

        Args:
            build_info: dict

        Returns:

        """
        self.length = build_info['data']['train']['features_gpu'].shape[self.axis]
        indexer = np.arange(self.length)
        self.nullable = np.asarray(self.target_cols, dtype=np.uint64)
        self.nout = max(1, int(self.nullable.shape[0] * self.sample))
        
    def before_iteration(self, build_info):
        """Shuffle indexers and save to build info to pass it into the other classes

        Args:
            build_info: dict

        Returns:

        """
        grad = build_info['data']['train']['grad'][:, self.target_cols]
        # use gradient norm as importance measure
        grad_norm = (grad ** 2).sum(axis=0) ** .5
        # norm to get probabilities. Add smooth constant
        p = grad_norm / grad_norm.sum()
        # sample target columnns
        target = np.sort(np.random.choice(self.nullable, size=self.nout, replace=False, p=p.get()))
        # others are features columns
        features = np.setdiff1d(np.arange(self.length, dtype=np.uint64), target)
        
        build_info['use_as_target'] = cp.asarray(target)
        self._temp = cp.asarray(features)
        build_info['use_as_features'] = self._temp
        
    def __call__(self):
        """Return the indices of features cols

        Returns:

        """

        return self._temp
    
    def after_train(self, build_info):
        
        del self._temp, self.indexer, self.valid_sl

The other important thing is to modify the **loss** and the **metric** functions (NOTE: some target values could be NaNs). Here we use **MSE** as loss and **RMSE** as metric with the small modification - they will be able to accept NaN values in the target input. The idea is following - loss function will output zeros in the gradinet and hessian values on the places where it was NaN value in target (if we do not have information we should not update solution). Metric will just ignores the NaN in target while computing the mean error

The one other thing to implement here is **LossImputer** class. It will be implemented via **Callback**. It will help loss to ignore not only NaNs in loss, but also loss by columns that are selected to be a feature

In [None]:
class MSEWithNanLoss(MSELoss):
    """
    This is custom MSE Loss that accepts NaN values and ignores features
    """
    def __init__(self, ):
        
        self.feats_cols = None
    
    def get_grad_hess(self, y_true, y_pred):
        """
        
        Args:
            y_true: cp.ndarray of target values
            y_pred: cp.ndarray of predicted values
            
        Returns:

        """
        mask = ~cp.isnan(y_true)
        # apply features mask
        grad = y_pred - cp.where(mask, y_true, 0)
        hess = mask.astype(cp.float32)
        grad *= hess
        # we will ignore not only NaNs but also columns that are used as features !!!
        if self.feats_cols is not None:
            hess[:, self.feats_cols] = 0
            grad *= hess
        
        return grad, hess

    def base_score(self, y_true):
        """This method defines how to initialize the ensemble
        
        Args:
            y_true: cp.ndarray of target values
            
        Returns:

        """
        return cp.nanmean(y_true, axis=0)
    
    
class LossImputer(Callback):
    """
    This is Callback. It modifies the Loss to ignore features columns
    """
    def before_iteration(self, build_info):
        """
        
        Args:
            build_info: dict
            
        Returns:
        
        """
        build_info['data']['train']['hess'][:, build_info['use_as_features']] = 0
        build_info['data']['train']['grad'] *= build_info['data']['train']['hess']
        
        build_info['model'].loss.feats_cols = build_info['use_as_features']
    
    def after_iteration(self, build_info):
        """
        
        Args:
            build_info: dict
            
        Returns:
        
        """
        build_info['model'].loss.feats_cols = build_info['use_as_features'] = None
    

    
class RMSEWithNaNMetric(RMSEMetric):
    """
    This is custom MSE Loss that accepts NaN values and ignores features
    """
    def __init__(self, target_cols):
        """
        
        Args:
            target_cols: list of int, indices of columns that could be both features and targets
            
        Returns:

        """
        self.target_cols = target_cols

    
    def __call__(self, y_true, y_pred, sample_weight=None):
        """
        
        Args:
            y_true: cp.ndarray of target values
            y_pred: cp.ndarray of predicted values
            sample_weight: cp.nndarray of sample weights or None
            
        Returns:

        """
        y_true = y_true[:, self.target_cols]
        y_pred = y_pred[:, self.target_cols]         
        
        mask = ~cp.isnan(y_true)
        
        err = (cp.where(mask, y_true, 0) - y_pred) ** 2
        return err[mask].mean() ** .5

The last thing to modify is the **sketching strategy**. Here I will not depthly explain how it works, but simply it reduces the output dimensions on the tree structure search step. To learn how it speed up the training you can see [multioutput tutorial](https://github.com/sb-ai-lab/Py-Boost/blob/master/tutorials/Tutorial_2_Advanced_multioutput.ipynb) on the Github. The only thing we need to modify in the sketch - just to tell it, which columns are targets on the current iteration

In [None]:
class ImputeSketch(RandomProjectionSketch):
    """
    Multioutput sketch is actually the main feature of Py-Boost
    that helps it being efficient on multioutput tasks.

    """
    def __init__(self, *args, **kwargs):
        """
        
        Args:
            *args: args of sketch
            **kwargs: kwargs of sketch
            
        Returns:
        
        """
        self.target_cols = None  
        super().__init__(*args, **kwargs)
        
    def before_iteration(self, build_info):
        """Before each iteration it saves target indices
        
        Args:
            build_info: dict
        
        Returns:
        
        """
        self.target_cols = build_info['use_as_target']
        
    def __call__(self, grad, hess):
        """Call method just select the target columns and pass it to the original sketch
        
        Args:
            grad: cp.ndarray, gradient
            hess: cp.ndarray, hessian
            
        Returns:
        
        """
        # select target
        grad = grad[:, self.target_cols]
        
        # empty hess
        hess = cp.ones((grad.shape[0], 1), dtype=cp.float32)
        
        return super().__call__(grad, hess)

## Training Py-Boost

Here we will build it all together and train our models. It takes quite a long time (few hours).

In [None]:
%%time
test_pred = np.zeros_like(X_test)

N_MOD = 4

for nnull in range(N_MOD):
    
    X_nan = X_train.copy()
    # add random NaNs into the train
    sl = np.random.rand(*X_nan.shape).argsort(axis=1) < nnull
    X_nan[sl] = np.nan
    
    # Py-Boost customization components
    loss = MSEWithNanLoss()
    loss_imp = LossImputer()
    metric = RMSEWithNaNMetric(target_index)
    sketch = ImputeSketch(1)
    sampler = ImputeSampler(0.5, target_index)
    
    # Boosting
    model = GradientBoosting(
            loss, metric,
            ntrees=50000,
            lr=0.05,
            max_depth=6,
            min_data_in_leaf=10,
            lambda_l2=1,
            gd_steps=2,
            quantization='Quantile', 
            colsample=sampler,
            subsample=1, 
            use_hess=True,
            multioutput_sketch=sketch,
            callbacks=[loss_imp, ],
            verbose=10000, 
        )
    
    # fit
    # NOTE: the same data is used as features and targets
    model.fit(X_nan, X_nan)
    
    # predict on current NaN slice
    if (nnull + 1) < N_MOD:
        sl = test_nan == (nnull + 1)
    else:
        sl = test_nan >= (nnull + 1)

    test_pred[sl] += model.predict(X_test[sl], batch_size=1000000)

## Create submission

In [None]:
%%time
sub = pd.read_csv('../input/tabular-playground-series-jun-2022/sample_submission.csv')

f123 = sub['row-col'] \
    .map(lambda x: x.split('-')[1]) \
    .map(means).values

nan_sl = np.isnan(X_test)
rows = np.tile(test_index.values, (X_test.shape[1], 1)).T[nan_sl]
cols = np.tile(features_cols, (X_test.shape[0], 1))[nan_sl]
res = Series(test_pred[nan_sl], index=Series(rows).astype(str) + '-' + Series(cols))

f4 = sub['row-col'].map(res).values

sub['value'] = np.where(np.isnan(f123), f4, f123)

sub.to_csv('submission.csv', index=False)

Despite the fact, in **TPS** competitions neural network approaches are more likely to be a better tool, this solution seems to be the best of boosing family from public kernels and can be competitive on the real world datasets.

Hope you like my approach and the **Py-Boost** framework:) And don't forget to upvote this kernel and star us on the GitHub :) Good luck!

In [None]:
s = '<iframe src="https://ghbtns.com/github-btn.html?user=sb-ai-lab&repo=Py-Boost&type=star&count=true&size=large" frameborder="0" scrolling="0" width="170" height="30" title="Py-Boost GitHub"></iframe>'
HTML(s)