Improve HGBT leaf values updates #22173

jjerphan · 2022-01-10T16:43:16Z

Describe the workflow you want to enable

This issue references this TODO comment:

scikit-learn/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

Lines 65 to 67 in be89eb7

    
           # TODO: Ideally this should be computed in parallel over the leaves using something 
        
           # similar to _update_raw_predictions(), but this requires a cython version of 
        
           # median().

Describe your proposed solution

Some APIs (like std::nth_element) allows getting the median or a given quantile of a contiguous buffer in O(n) but it does need to mutate a data-structure to sort (it can be the buffer or another data-structure if using a Comparator).

Could it be used there?

Moreover, why can't we compute the median in parallel here?

Additional context

Follows-up with discussions: https://github.com/scikit-learn/scikit-learn/pull/20811/files/4df17828e439ad09a7784e99cc3d8d956eb50fe0#r781036357

/cc @NicolasHug who might be interested in following this issue as he initially worked on the update and authored this comment.

The text was updated successfully, but these errors were encountered:

lorentzenchr · 2022-01-11T15:46:01Z

If we do this, we should take care to keep backwards compatibility and use the median definition of numpy implemented here:
https://github.com/numpy/numpy/blob/v1.22.0/numpy/lib/function_base.py#L3801-L3853

Using std::nth_element is using C++. What side effect would that have on Cython C code?

jjerphan · 2022-01-12T10:11:08Z

Using std::nth_element is using C++. What side effect would that have on Cython C code?

std::nth_element is already used within scikit-learn here:

scikit-learn/sklearn/neighbors/_partition_nodes.pyx

Lines 3 to 64 in ee524f4

    
           # BinaryTrees rely on partial sorts to partition their nodes during their 
        
           # initialisation. 
        
           # 
        
           # The C++ std library exposes nth_element, an efficient partial sort for this 
        
           # situation which has a linear time complexity as well as the best performances. 
        
           # 
        
           # To use std::algorithm::nth_element, a few fixture are defined using Cython: 
        
           # - partition_node_indices, a Cython function used in BinaryTrees, that calls 
        
           # - partition_node_indices_inner, a C++ function that wraps nth_element and uses 
        
           # - an IndexComparator to state how to compare KDTrees' indices 
        
           # 
        
           # IndexComparator has been defined so that partial sorts are stable with 
        
           # respect to the nodes initial indices. 
        
           # 
        
           # See for reference: 
        
           #  - https://en.cppreference.com/w/cpp/algorithm/nth_element. 
        
           #  - https://github.com/scikit-learn/scikit-learn/pull/11103 
        
           #  - https://github.com/scikit-learn/scikit-learn/pull/19473 
        
           cdef extern from *: 
        
               """ 
        
               #include <algorithm> 
        
               template<class D, class I> 
        
               class IndexComparator { 
        
               private: 
        
                   const D *data; 
        
                   I split_dim, n_features; 
        
               public: 
        
                   IndexComparator(const D *data, const I &split_dim, const I &n_features): 
        
                       data(data), split_dim(split_dim), n_features(n_features) {} 
        
                   bool operator()(const I &a, const I &b) const { 
        
                       D a_value = data[a * n_features + split_dim]; 
        
                       D b_value = data[b * n_features + split_dim]; 
        
                       return a_value == b_value ? a < b : a_value < b_value; 
        
                   } 
        
               }; 
        
               template<class D, class I> 
        
               void partition_node_indices_inner( 
        
                   const D *data, 
        
                   I *node_indices, 
        
                   const I &split_dim, 
        
                   const I &split_index, 
        
                   const I &n_features, 
        
                   const I &n_points) { 
        
                   IndexComparator<D, I> index_comparator(data, split_dim, n_features); 
        
                   std::nth_element( 
        
                       node_indices, 
        
                       node_indices + split_index, 
        
                       node_indices + n_points, 
        
                       index_comparator); 
        
               } 
        
               """ 
        
               void partition_node_indices_inner[D, I]( 
        
                           D *data, 
        
                           I *node_indices, 
        
                           I split_dim, 
        
                           I split_index, 
        
                           I n_features, 
        
                           I n_points) except +

Cython is used to have an interface to this C++ algorithm. The side effect is that it has to modify an array somewhere, it can be a temporary array.

NicolasHug · 2022-01-13T14:41:50Z

Some APIs (like std::nth_element) allows getting the median or a given quantile of a contiguous buffer in O(n) but it does need to mutate a data-structure to sort (it can be the buffer or another data-structure if using a Comparator).

Could it be used there?

We should avoid mutating y_pred or raw_predections, but it's fine to mutate/shuffle the leaf.sample_indices view, so I assume a Comparator could do the job.

Moreover, why can't we compute the median in parallel here?

I'm not sure this will answer your question, but the reason I wrote "this requires a cython version of median() " is because to parallelize the median computation over the leaves, we'd need to do something like

# Cython code:
for leave in prange(leaves):
    median(...)

jjerphan · 2022-01-13T14:50:04Z

Thank you for the update, @NicolasHug.

We could std::nth_element in prange; we first need to access if it makes sense optimizing it.

lorentzenchr · 2022-01-13T15:48:42Z

My concern is that adding one single C++ function median and using it inside other Cython code will force the whole HGBT Cython base to be C++. I'm thinking about the following but can't make it work due to Python interactions

_update_leaves_values calls a new function, defined in a pyx file

def median_over_leaves(n_leaves, indices, y_true, raw_prediction, sample_weight):
    """
    indices : list of sample_indices (ndarray)  # Note that each node has sample_indices of different shape
"""
    # initialise the array median_leaf
    for i in prange(n_leaves):
        tmp = y_true[indices] - raw_prediction[indices]
        median_leaf[i] = median_cpp(tmp)

As long as indices is a list, this won't work.

NicolasHug · 2022-01-13T17:21:23Z

We can probably work around this "list of arrays of different sizes" issue by doing something similar to https://github.com/scikit-learn/scikit-learn/blob/5d7dc4ba327c138cf63be5cd9238200037c1eb13/sklearn/ensemble/_hist_gradient_boosting/_gradient_boosting.pyx (in particular the start/stop logic)

will force the whole HGBT Cython base to be C++.

I'm not sure what you mean by that, could you clarify? Do you mean that the Cython code will be compiled to C++ instead of C?

lorentzenchr · 2022-01-13T17:41:44Z

will force the whole HGBT Cython base to be C++.

I'm not sure what you mean by that, could you clarify? Do you mean that the Cython code will be compiled to C++ instead of C?

Yes, at least that is my understanding. If we implement median_cpp in Cython, e.g. in file1.pyx, and then use this function inside other Cython code, e.g. file2.pyx calls median_cpp inside a prange, then we need to compile file2.pyx with C++ as well. Or am I mistaken?

jjerphan · 2022-01-13T18:04:15Z

Using:

py-spy record --native -o profile.svg -f speedscope -- python ./hgbt.py

on:

# hgbt.py
import numpy as np
import os
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.datasets import make_regression

if os.path.exists("X.npy"):
    X = np.load("X.npy")
    y = np.load("y.npy")
else:
    X, y = make_regression(100_000, 100, n_informative=100)
    np.save("X.npy", X)
    np.save("y.npy", y)

clf = HistGradientBoostingRegressor(loss="absolute_error").fit(X, y)

And searching for _update_leaves_values via Speedscope indicates that 5% of the execution is spent there (see the highlighted boxes on the image bellow):

Hence, is it worth optimizing?

NicolasHug · 2022-01-14T12:38:07Z

We'll probably see a stronger effect with more leaves, e.g. max_leaf_nodes=255 and min_samples_leaf=1 ?

then we need to compile file2.pyx with C++ as well. Or am I mistaken?

I think that's correct. Do you think it would be a problem? We already inline some C++ code in some places IIRC so I think our tooling can support that

ogrisel · 2022-01-14T14:22:09Z

I think that's correct. Do you think it would be a problem?

It would be worth quickly checking the impact on the overall build time of scikit-learn and the size of the generated binary.

But even if it has an impact on the build time, I suspect it won't be much for our CI if we properly configure ccache everywhere (which might still not be the case as was recently discovered on the ARM64 builds).

jjerphan · 2022-01-18T08:09:59Z

@NicolasHug: using max_leaf_nodes=255 and min_samples_leaf=1 gives similar proportions.

lorentzenchr · 2022-01-22T16:23:38Z

@jjerphan Does your profiling indicate other room for improvements?

We could at least remove the TODO comment hinting at this issue.

jjerphan · 2022-01-22T16:41:56Z

@jjerphan Does your profiling indicate other room for improvements?

We could at least remove the TODO comment hinting at this issue.

I think the best way to tell is to try optimizing it.

For what I recall of my exploration of the OpenMP parallel sections (namely GOMP_parallel, as shown above which are large portions of the profiling report) a few months, it seems that they can't really be improved much (those are nogil-context and the instructions are simple and optimized there).

github-actions bot added the Needs Triage Issue requires triage label Jan 10, 2022

jjerphan mentioned this issue Jan 10, 2022

ENH Replace loss module HGBT #20811

Merged

thomasjpfan added module:ensemble Performance and removed Needs Triage Issue requires triage labels Jan 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve HGBT leaf values updates #22173

Improve HGBT leaf values updates #22173

jjerphan commented Jan 10, 2022 •

edited

lorentzenchr commented Jan 11, 2022

jjerphan commented Jan 12, 2022

NicolasHug commented Jan 13, 2022 •

edited

jjerphan commented Jan 13, 2022

lorentzenchr commented Jan 13, 2022

NicolasHug commented Jan 13, 2022 •

edited

lorentzenchr commented Jan 13, 2022

jjerphan commented Jan 13, 2022

NicolasHug commented Jan 14, 2022

ogrisel commented Jan 14, 2022 •

edited

jjerphan commented Jan 18, 2022

lorentzenchr commented Jan 22, 2022

jjerphan commented Jan 22, 2022 •

edited

Improve HGBT leaf values updates #22173

Improve HGBT leaf values updates #22173

Comments

jjerphan commented Jan 10, 2022 • edited

Describe the workflow you want to enable

Describe your proposed solution

Additional context

lorentzenchr commented Jan 11, 2022

jjerphan commented Jan 12, 2022

NicolasHug commented Jan 13, 2022 • edited

jjerphan commented Jan 13, 2022

lorentzenchr commented Jan 13, 2022

NicolasHug commented Jan 13, 2022 • edited

lorentzenchr commented Jan 13, 2022

jjerphan commented Jan 13, 2022

NicolasHug commented Jan 14, 2022

ogrisel commented Jan 14, 2022 • edited

jjerphan commented Jan 18, 2022

lorentzenchr commented Jan 22, 2022

jjerphan commented Jan 22, 2022 • edited

jjerphan commented Jan 10, 2022 •

edited

NicolasHug commented Jan 13, 2022 •

edited

NicolasHug commented Jan 13, 2022 •

edited

ogrisel commented Jan 14, 2022 •

edited

jjerphan commented Jan 22, 2022 •

edited