Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: fix KD Tree node construction #11556

Merged
merged 2 commits into from Jul 19, 2018

Conversation

Projects
None yet
6 participants
@jakevdp
Copy link
Member

jakevdp commented Jul 16, 2018

Fixes #11313

The issue is in pre-computation of KD Tree node radii, which are used during querying to rank the queue of nodes to be split & explored. As such, I think the main effect of this bug will be related to performance – this is why our tests didn't catch it.

@agramfort

This comment has been minimized.

Copy link
Member

agramfort commented Jul 16, 2018

do you have a tiny bench script to share?

@jakevdp

This comment has been minimized.

Copy link
Member Author

jakevdp commented Jul 16, 2018

do you have a tiny bench script to share?

No, I could create one if it would be useful

@agramfort

This comment has been minimized.

Copy link
Member

agramfort commented Jul 16, 2018

@jakevdp

This comment has been minimized.

Copy link
Member Author

jakevdp commented Jul 16, 2018

Here's a quick test:

from sklearn.datasets import make_blobs
from sklearn.neighbors import KDTree

import sklearn
print(sklearn.__version__)

X, y = make_blobs(n_samples=100000, n_features=3)

X1 = X[:50000]
X2 = X[50000:]

%timeit KDTree(X1).query(X2, 1)

Current release:

0.19.1
151 ms ± 1.66 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

On this branch:

0.20.dev0
136 ms ± 2.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

@TomDLT TomDLT added the Performance label Jul 16, 2018

@jakevdp

This comment has been minimized.

Copy link
Member Author

jakevdp commented Jul 16, 2018

Exploring a bit more... the interesting thing is that the use of the correct radius does not seem to appreciably impact the query time... as currently written, it ends up being essentially a proxy for number of points in the node.

We could probably explicitly rank query nodes by number of points rather than by radius and marginally improve both memory use (from not having to store each node's radius) as well as tree construction time (from not having to compute each node's radius), without an appreciable impact on query time in most cases... but that's probably for another PR.

from sklearn.datasets import make_blobs
from sklearn.neighbors import KDTree

import sklearn
print(sklearn.__version__)

X, y = make_blobs(n_samples=100000, n_features=3)

X1 = X[:50000]
X2 = X[50000:]

%timeit KDTree(X1)

tree = KDTree(X1)
%timeit tree.query(X2, 5)
0.19.1
32.4 ms ± 920 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
214 ms ± 14.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
0.20.dev0
14.9 ms ± 550 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
214 ms ± 3.28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Jul 17, 2018

Appveyor had failed. I've restarted.

@agramfort

This comment has been minimized.

Copy link
Member

agramfort commented Jul 17, 2018

thx @jakevdp

@jnothman can you reastart appveyor here?

@agramfort

This comment has been minimized.

Copy link
Member

agramfort commented Jul 17, 2018

@jakevdp can you just add a what's new entry? thx

jakevdp added a commit to jakevdp/scikit-learn that referenced this pull request Jul 17, 2018

@rth

rth approved these changes Jul 17, 2018

Copy link
Member

rth left a comment

LGTM, thanks a lot for this fix @jakevdp !

I'm not sure if it's worth waiting for Appveyor or merge as it is, given the issues we have been having lately with it.

@ogrisel

This comment has been minimized.

Copy link
Member

ogrisel commented Jul 17, 2018

This PR is soon to be processed by appveyor, let's wait a bit.

@agramfort

This comment has been minimized.

Copy link
Member

agramfort commented Jul 18, 2018

can someone restart appveyor here?

@agramfort agramfort force-pushed the jakevdp:kd-tree-fix branch from e9b2c5f to 976acb3 Jul 18, 2018

@agramfort

This comment has been minimized.

Copy link
Member

agramfort commented Jul 18, 2018

rebased to restart CIs

@rth

This comment has been minimized.

Copy link
Member

rth commented Jul 19, 2018

Thanks a lot for this fix @jakevdp ! Appveyor passed, merging.

BTW, if you have an opinion on #11103 and #597 (comment) it would be very much appreciated.

@rth rth merged commit 73fc2e4 into scikit-learn:master Jul 19, 2018

7 checks passed

ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: python2 Your tests passed on CircleCI!
Details
ci/circleci: python3 Your tests passed on CircleCI!
Details
codecov/patch Coverage not affected when comparing bf11d44...976acb3
Details
codecov/project 95.3% (+<.01%) compared to bf11d44
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.