Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nearest neighbors with trees perf decreased by debugging stats #13330

Open
rmenuet opened this issue Feb 28, 2019 · 6 comments
Open

Nearest neighbors with trees perf decreased by debugging stats #13330

rmenuet opened this issue Feb 28, 2019 · 6 comments
Labels
help wanted module:neighbors Needs Benchmarks A tag for the issues and PRs which require some benchmarks Performance

Comments

@rmenuet
Copy link
Contributor

rmenuet commented Feb 28, 2019

Description

For ball_tree and kd_tree algorithms, some stats about the tree queries highly decrease the parallelization performances increase.

Those stats are:

  • n_trims: queried points outside node radius
  • n_leaves: leaves reached while querying
  • n_splits: non-leaves queried nodes
  • n_calls: num of computed distances

Those stats only seem useful for debugging, do not look like part of the official API (no documentation) and only 2 (personal) git repos use the method (get_tree_stats) to get them.

Deactivating them highly improves performances of associated algorithms.

Benchmark

Test of kneighbors function with default parameters and:

  • samples dimension: 100
  • fit: 10k samples
  • kneighbors: 10k samples

(also tested openMP prange parallism but it does not improve perf)

=============
=== brute ===
=============
Joblib (loky) :
- n_jobs = 1 (MKL mono threaded) -> 2.6s
- n_jobs = 1 (MKL multi threaded, 40 threads) -> 1.9s
- n_jobs = 4  -> 4.0s
- n_jobs = 10 -> 3.5s
- n_jobs = 40 -> 3.5s

=================
=== ball_tree ===
=================
Joblib (loky) :
- n_jobs = 1  -> 10.9s
- n_jobs = 4  ->  7.7s
- n_jobs = 10 ->  6.8s
- n_jobs = 40 ->  3.8s

Joblib (loky) no stats:
- n_jobs = 1  -> 12.0s
- n_jobs = 4  ->  3.2s
- n_jobs = 10 ->  1.4s
- n_jobs = 40 ->  0.6s

OpenMP no stats:
- n_jobs = 4  ->  3.2s
- n_jobs = 10 ->  1.4s

===============
=== kd_tree ===
===============
Joblib (loky) :
- n_jobs = 1  -> 19.1s
- n_jobs = 4  ->  9.0s
- n_jobs = 10 ->  10.9s
- n_jobs = 40 ->  8.5s

Joblib (loky) no stats:
- n_jobs = 1  -> 19.0s
- n_jobs = 4  ->  5.1s
- n_jobs = 10 ->  2.2s
- n_jobs = 40 ->  1.0s

OpenMP no stats:
- n_jobs = 4  ->  5.1s
- n_jobs = 10 ->  2.2s
@rth
Copy link
Member

rth commented Feb 28, 2019

The performance scaling without stats seems indeed almost perfect (as x 1/n_jobs) while it's much worse when they are enabled.

Do you think it might be possible to optionally allow computing them with some parameter that defaults to False?

@rmenuet
Copy link
Contributor Author

rmenuet commented Mar 1, 2019

The performance scaling without stats seems indeed almost perfect (as x 1/n_jobs) while it's much worse when they are enabled.

Do you think it might be possible to optionally allow computing them with some parameter that defaults to False?

From what I saw:

  • those stats are computed very frequently during tree queries: checking a parameter each time would remove most of the performance gain (we would check a boolean at many places instead of incrementing an int)
  • the solution would be compiling a tree query with or without stats, but it cannot be handled with directives like #ifdef and compilation arguments in Cython (without an additional preprocessor like this one https://github.com/interpreters/pypreprocessor): therefore I would need to duplicate some tree query functions and use them depending on the this parameter (to make the test outside those for loops instead of inside)

@rth
Copy link
Member

rth commented Mar 1, 2019

  • but it cannot be handled with directives like #ifdef and compilation arguments in Cython

Could you elaborate on that? So you are saying something like http://docs.cython.org/en/latest/src/userguide/language_basics.html#conditional-statements won't work?

@rmenuet
Copy link
Contributor Author

rmenuet commented Mar 1, 2019

Could you elaborate on that? So you are saying something like http://docs.cython.org/en/latest/src/userguide/language_basics.html#conditional-statements won't work?

whoops... I did not know about that, I will delve a bit more and test something based on this

@ogrisel
Copy link
Member

ogrisel commented Apr 13, 2021

The performance scaling without stats seems indeed almost perfect (as x 1/n_jobs) while it's much worse when they are enabled.

One possible explanation would be a typical case of False Sharing: CPU cache invalidation by concurrent write access in contiguously allocated data structure fields that live in the same cache line.

@ogrisel
Copy link
Member

ogrisel commented Apr 13, 2021

One way to check this hypothesis would be to use linux perf or cachegrind to collect cache invalidation statistics with and without #19884.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted module:neighbors Needs Benchmarks A tag for the issues and PRs which require some benchmarks Performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants