Nearest neighbors with trees perf decreased by debugging stats #13330

rmenuet · 2019-02-28T14:42:28Z

Description

For ball_tree and kd_tree algorithms, some stats about the tree queries highly decrease the parallelization performances increase.

Those stats are:

n_trims: queried points outside node radius
n_leaves: leaves reached while querying
n_splits: non-leaves queried nodes
n_calls: num of computed distances

Those stats only seem useful for debugging, do not look like part of the official API (no documentation) and only 2 (personal) git repos use the method (get_tree_stats) to get them.

Deactivating them highly improves performances of associated algorithms.

Benchmark

Test of kneighbors function with default parameters and:

samples dimension: 100
fit: 10k samples
kneighbors: 10k samples

(also tested openMP prange parallism but it does not improve perf)

=============
=== brute ===
=============
Joblib (loky) :
- n_jobs = 1 (MKL mono threaded) -> 2.6s
- n_jobs = 1 (MKL multi threaded, 40 threads) -> 1.9s
- n_jobs = 4  -> 4.0s
- n_jobs = 10 -> 3.5s
- n_jobs = 40 -> 3.5s

=================
=== ball_tree ===
=================
Joblib (loky) :
- n_jobs = 1  -> 10.9s
- n_jobs = 4  ->  7.7s
- n_jobs = 10 ->  6.8s
- n_jobs = 40 ->  3.8s

Joblib (loky) no stats:
- n_jobs = 1  -> 12.0s
- n_jobs = 4  ->  3.2s
- n_jobs = 10 ->  1.4s
- n_jobs = 40 ->  0.6s

OpenMP no stats:
- n_jobs = 4  ->  3.2s
- n_jobs = 10 ->  1.4s

===============
=== kd_tree ===
===============
Joblib (loky) :
- n_jobs = 1  -> 19.1s
- n_jobs = 4  ->  9.0s
- n_jobs = 10 ->  10.9s
- n_jobs = 40 ->  8.5s

Joblib (loky) no stats:
- n_jobs = 1  -> 19.0s
- n_jobs = 4  ->  5.1s
- n_jobs = 10 ->  2.2s
- n_jobs = 40 ->  1.0s

OpenMP no stats:
- n_jobs = 4  ->  5.1s
- n_jobs = 10 ->  2.2s

The text was updated successfully, but these errors were encountered:

rth · 2019-02-28T20:01:21Z

The performance scaling without stats seems indeed almost perfect (as x 1/n_jobs) while it's much worse when they are enabled.

Do you think it might be possible to optionally allow computing them with some parameter that defaults to False?

rmenuet · 2019-03-01T13:42:34Z

The performance scaling without stats seems indeed almost perfect (as x 1/n_jobs) while it's much worse when they are enabled.

Do you think it might be possible to optionally allow computing them with some parameter that defaults to False?

From what I saw:

those stats are computed very frequently during tree queries: checking a parameter each time would remove most of the performance gain (we would check a boolean at many places instead of incrementing an int)
the solution would be compiling a tree query with or without stats, but it cannot be handled with directives like #ifdef and compilation arguments in Cython (without an additional preprocessor like this one https://github.com/interpreters/pypreprocessor): therefore I would need to duplicate some tree query functions and use them depending on the this parameter (to make the test outside those for loops instead of inside)

rth · 2019-03-01T13:51:15Z

but it cannot be handled with directives like #ifdef and compilation arguments in Cython

Could you elaborate on that? So you are saying something like http://docs.cython.org/en/latest/src/userguide/language_basics.html#conditional-statements won't work?

rmenuet · 2019-03-01T14:14:35Z

Could you elaborate on that? So you are saying something like http://docs.cython.org/en/latest/src/userguide/language_basics.html#conditional-statements won't work?

whoops... I did not know about that, I will delve a bit more and test something based on this

ogrisel · 2021-04-13T20:27:57Z

The performance scaling without stats seems indeed almost perfect (as x 1/n_jobs) while it's much worse when they are enabled.

One possible explanation would be a typical case of False Sharing: CPU cache invalidation by concurrent write access in contiguously allocated data structure fields that live in the same cache line.

ogrisel · 2021-04-13T20:32:10Z

One way to check this hypothesis would be to use linux perf or cachegrind to collect cache invalidation statistics with and without #19884.

This was referenced Feb 28, 2019

[WIP] ENH : Nearest-neighbors removal of unused stats computations on fit #13331

Closed

PERF: Replace range by prange when applicable #13213

Open

cmarmo added help wanted Performance module:neighbors Needs Benchmarks A tag for the issues and PRs which require some benchmarks labels Feb 23, 2021

jjerphan mentioned this issue Apr 13, 2021

MAINT Remove statistics from neighbors.BinaryTree #19884

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nearest neighbors with trees perf decreased by debugging stats #13330

Nearest neighbors with trees perf decreased by debugging stats #13330

rmenuet commented Feb 28, 2019

rth commented Feb 28, 2019

rmenuet commented Mar 1, 2019

rth commented Mar 1, 2019

rmenuet commented Mar 1, 2019

ogrisel commented Apr 13, 2021 •

edited

ogrisel commented Apr 13, 2021 •

edited

Nearest neighbors with trees perf decreased by debugging stats #13330

Nearest neighbors with trees perf decreased by debugging stats #13330

Comments

rmenuet commented Feb 28, 2019

Description

Benchmark

rth commented Feb 28, 2019

rmenuet commented Mar 1, 2019

rth commented Mar 1, 2019

rmenuet commented Mar 1, 2019

ogrisel commented Apr 13, 2021 • edited

ogrisel commented Apr 13, 2021 • edited

ogrisel commented Apr 13, 2021 •

edited

ogrisel commented Apr 13, 2021 •

edited