ENH make initial binning in HGBT parallel#28064
Conversation
| approximately the same number of samples. | ||
| """ | ||
| # Author: Nicolas Hug | ||
| import concurrent.futures |
There was a problem hiding this comment.
Is there a reason, we never use a ThreadPoolExecutor?
I think loky is preferred and used as a back-end via joblib instead.
There was a problem hiding this comment.
But this here uses ThreadPoolExecutor, not ProcessPoolExecutor.
There was a problem hiding this comment.
I see.
I think @ogrisel is much more knowledgeable than me to make an educated decision.
There was a problem hiding this comment.
This pattern is interesting ; I wonder whether we could abstract it and reuse it in some places.
There was a problem hiding this comment.
I confirm that loky is only useful for process-based parallelism. For Python-level thread-based parallelism, ThreadPoolExecutor is perfectly fine.
We traditionally used Internally, joblib with the "threading" backend uses In cases where we want to hard-code the uses of threads, I think that
Could you please post the results of a quick ad-hoc timeit for that step alone to quantify the speed-up out of curiosity? Note that So when |
|
Note: prior to making the code more complex for parallelization, maybe we should investigate optimizing the single-threaded variant: at the moment we sort each (subsampled) columns data twice: Once in |
|
Edit: Update (long) after merge of #28102, commit 9c5e16d
import numpy as np
from sklearn.datasets import make_classification
from sklearn.ensemble._hist_gradient_boosting.binning import _BinMapper
n_samples, n_features = 200_000, 20
n_bins = 256
X, y = make_classification(n_samples=n_samples, n_features=n_features)
categorical_remapped = np.zeros(n_features, dtype=bool)
bin_mapper = _BinMapper(
n_bins=n_bins,
is_categorical=categorical_remapped,
known_categories=None,
random_state=1,
n_threads=1,
)
%timeit bin_mapper.fit(X) # 378 ms ± 2.57 ms (old 465 ms ± 3.08 ms)
%timeit bin_mapper.fit_transform(X) # 602 ms ± 8.5 ms (old 682 ms ± 4.33 ms)
bin_mapper = _BinMapper(
n_bins=n_bins,
is_categorical=categorical_remapped,
known_categories=None,
random_state=1,
n_threads=4,
)
%timeit bin_mapper.fit(X) # 103 ms ± 1.06 ms (old 137 ms ± 3.04 ms)
%timeit bin_mapper.fit_transform(X) # 188 ms ± 9.76 ms (old 227 ms ± 3.85 ms) |
OmarManzoor
left a comment
There was a problem hiding this comment.
LGTM. Thanks @lorentzenchr
|
So this broke Pyodide, very likely because you can not start a thread in Pyodide, see build log Stack-traceThe work-around is to check whether we are inside Pyodide and not use multi-threading in this case? I guess Pyodide is the only case where you can not create threads. I don't have a strong opinion on using joblib for multi-threading, but using |
Reference Issues/PRs
None
What does this implement/fix? Explain your changes.
This PR make the the finding of thresholds/quantiles in
_BinMapper.fitparallel withwith concurrent.futures.ThreadPoolExecutor, see https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor-example. This works indeed one of the recommended ways to make numpy parallel in pure Python, see https://numpy.org/doc/stable/reference/random/multithreading.html.Any other comments?
Is there a reason, we never use a ThreadPoolExecutor?
The gain in execution speed is low as only a fraction of the fit time is spend finding the thresholds.