Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve HGBT leaf values updates #22173

Open
jjerphan opened this issue Jan 10, 2022 · 13 comments
Open

Improve HGBT leaf values updates #22173

jjerphan opened this issue Jan 10, 2022 · 13 comments

Comments

@jjerphan
Copy link
Member

jjerphan commented Jan 10, 2022

Describe the workflow you want to enable

This issue references this TODO comment:

# TODO: Ideally this should be computed in parallel over the leaves using something
# similar to _update_raw_predictions(), but this requires a cython version of
# median().

Describe your proposed solution

Some APIs (like std::nth_element) allows getting the median or a given quantile of a contiguous buffer in O(n) but it does need to mutate a data-structure to sort (it can be the buffer or another data-structure if using a Comparator).

Could it be used there?

Moreover, why can't we compute the median in parallel here?

Additional context

Follows-up with discussions: https://github.com/scikit-learn/scikit-learn/pull/20811/files/4df17828e439ad09a7784e99cc3d8d956eb50fe0#r781036357


/cc @NicolasHug who might be interested in following this issue as he initially worked on the update and authored this comment.

@github-actions github-actions bot added the Needs Triage Issue requires triage label Jan 10, 2022
@lorentzenchr
Copy link
Member

If we do this, we should take care to keep backwards compatibility and use the median definition of numpy implemented here:
https://github.com/numpy/numpy/blob/v1.22.0/numpy/lib/function_base.py#L3801-L3853

Using std::nth_element is using C++. What side effect would that have on Cython C code?

@jjerphan
Copy link
Member Author

Using std::nth_element is using C++. What side effect would that have on Cython C code?

std::nth_element is already used within scikit-learn here:

# BinaryTrees rely on partial sorts to partition their nodes during their
# initialisation.
#
# The C++ std library exposes nth_element, an efficient partial sort for this
# situation which has a linear time complexity as well as the best performances.
#
# To use std::algorithm::nth_element, a few fixture are defined using Cython:
# - partition_node_indices, a Cython function used in BinaryTrees, that calls
# - partition_node_indices_inner, a C++ function that wraps nth_element and uses
# - an IndexComparator to state how to compare KDTrees' indices
#
# IndexComparator has been defined so that partial sorts are stable with
# respect to the nodes initial indices.
#
# See for reference:
# - https://en.cppreference.com/w/cpp/algorithm/nth_element.
# - https://github.com/scikit-learn/scikit-learn/pull/11103
# - https://github.com/scikit-learn/scikit-learn/pull/19473
cdef extern from *:
"""
#include <algorithm>
template<class D, class I>
class IndexComparator {
private:
const D *data;
I split_dim, n_features;
public:
IndexComparator(const D *data, const I &split_dim, const I &n_features):
data(data), split_dim(split_dim), n_features(n_features) {}
bool operator()(const I &a, const I &b) const {
D a_value = data[a * n_features + split_dim];
D b_value = data[b * n_features + split_dim];
return a_value == b_value ? a < b : a_value < b_value;
}
};
template<class D, class I>
void partition_node_indices_inner(
const D *data,
I *node_indices,
const I &split_dim,
const I &split_index,
const I &n_features,
const I &n_points) {
IndexComparator<D, I> index_comparator(data, split_dim, n_features);
std::nth_element(
node_indices,
node_indices + split_index,
node_indices + n_points,
index_comparator);
}
"""
void partition_node_indices_inner[D, I](
D *data,
I *node_indices,
I split_dim,
I split_index,
I n_features,
I n_points) except +

Cython is used to have an interface to this C++ algorithm. The side effect is that it has to modify an array somewhere, it can be a temporary array.

@NicolasHug
Copy link
Member

NicolasHug commented Jan 13, 2022

Some APIs (like std::nth_element) allows getting the median or a given quantile of a contiguous buffer in O(n) but it does need to mutate a data-structure to sort (it can be the buffer or another data-structure if using a Comparator).

Could it be used there?

We should avoid mutating y_pred or raw_predections, but it's fine to mutate/shuffle the leaf.sample_indices view, so I assume a Comparator could do the job.

Moreover, why can't we compute the median in parallel here?

I'm not sure this will answer your question, but the reason I wrote "this requires a cython version of median() " is because to parallelize the median computation over the leaves, we'd need to do something like

# Cython code:
for leave in prange(leaves):
    median(...)

@jjerphan
Copy link
Member Author

Thank you for the update, @NicolasHug.

We could std::nth_element in prange; we first need to access if it makes sense optimizing it.

@lorentzenchr
Copy link
Member

My concern is that adding one single C++ function median and using it inside other Cython code will force the whole HGBT Cython base to be C++. I'm thinking about the following but can't make it work due to Python interactions

_update_leaves_values calls a new function, defined in a pyx file

def median_over_leaves(n_leaves, indices, y_true, raw_prediction, sample_weight):
    """
    indices : list of sample_indices (ndarray)  # Note that each node has sample_indices of different shape
"""
    # initialise the array median_leaf
    for i in prange(n_leaves):
        tmp = y_true[indices] - raw_prediction[indices]
        median_leaf[i] = median_cpp(tmp)

As long as indices is a list, this won't work.

@NicolasHug
Copy link
Member

NicolasHug commented Jan 13, 2022

We can probably work around this "list of arrays of different sizes" issue by doing something similar to https://github.com/scikit-learn/scikit-learn/blob/5d7dc4ba327c138cf63be5cd9238200037c1eb13/sklearn/ensemble/_hist_gradient_boosting/_gradient_boosting.pyx (in particular the start/stop logic)

will force the whole HGBT Cython base to be C++.

I'm not sure what you mean by that, could you clarify? Do you mean that the Cython code will be compiled to C++ instead of C?

@lorentzenchr
Copy link
Member

will force the whole HGBT Cython base to be C++.

I'm not sure what you mean by that, could you clarify? Do you mean that the Cython code will be compiled to C++ instead of C?

Yes, at least that is my understanding. If we implement median_cpp in Cython, e.g. in file1.pyx, and then use this function inside other Cython code, e.g. file2.pyx calls median_cpp inside a prange, then we need to compile file2.pyx with C++ as well. Or am I mistaken?

@jjerphan
Copy link
Member Author

Using:

py-spy record --native -o profile.svg -f speedscope -- python ./hgbt.py

on:

# hgbt.py
import numpy as np
import os
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.datasets import make_regression

if os.path.exists("X.npy"):
    X = np.load("X.npy")
    y = np.load("y.npy")
else:
    X, y = make_regression(100_000, 100, n_informative=100)
    np.save("X.npy", X)
    np.save("y.npy", y)

clf = HistGradientBoostingRegressor(loss="absolute_error").fit(X, y)

And searching for _update_leaves_values via Speedscope indicates that 5% of the execution is spent there (see the highlighted boxes on the image bellow):

py-spy profile - speedscope

Hence, is it worth optimizing?

@NicolasHug
Copy link
Member

We'll probably see a stronger effect with more leaves, e.g. max_leaf_nodes=255 and min_samples_leaf=1 ?

then we need to compile file2.pyx with C++ as well. Or am I mistaken?

I think that's correct. Do you think it would be a problem? We already inline some C++ code in some places IIRC so I think our tooling can support that

@ogrisel
Copy link
Member

ogrisel commented Jan 14, 2022

I think that's correct. Do you think it would be a problem?

It would be worth quickly checking the impact on the overall build time of scikit-learn and the size of the generated binary.

But even if it has an impact on the build time, I suspect it won't be much for our CI if we properly configure ccache everywhere (which might still not be the case as was recently discovered on the ARM64 builds).

@jjerphan
Copy link
Member Author

@NicolasHug: using max_leaf_nodes=255 and min_samples_leaf=1 gives similar proportions.
Screenshot 2022-01-18 at 09-08-28 py-spy profile - speedscope

@lorentzenchr
Copy link
Member

@jjerphan Does your profiling indicate other room for improvements?

We could at least remove the TODO comment hinting at this issue.

@jjerphan
Copy link
Member Author

jjerphan commented Jan 22, 2022

@jjerphan Does your profiling indicate other room for improvements?

We could at least remove the TODO comment hinting at this issue.

I think the best way to tell is to try optimizing it.

For what I recall of my exploration of the OpenMP parallel sections (namely GOMP_parallel, as shown above which are large portions of the profiling report) a few months, it seems that they can't really be improved much (those are nogil-context and the instructions are simple and optimized there).

@thomasjpfan thomasjpfan added module:ensemble Performance and removed Needs Triage Issue requires triage labels Jan 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants