New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Parallel radius neighbors #10887

Merged
merged 13 commits into from Apr 16, 2018

Conversation

Projects
None yet
3 participants
@recamshak
Contributor

recamshak commented Mar 28, 2018

Reference Issues/PRs

Fixes #8003

What does this implement/fix? Explain your changes.

This makes RadiusNeighborsMixin.radius_neighbors honor the n_jobs argument and split the queries among processors. This also makes query_radius GIL-free so that it is actually faster than single thread.

TODO:

  • fix memory leak by making the indices and distances array own the data
  • fix Windows 32bits issue
  • write tests
  • run some benchmark

Any other comments?

@jnothman

This comment has been minimized.

Member

jnothman commented Mar 28, 2018

@recamshak

This comment has been minimized.

Contributor

recamshak commented Mar 28, 2018

I haven't done a proper benchmark yet but I saw much better runtime on my laptop. I'll do a benchmark today and post the results here.

@recamshak

This comment has been minimized.

Contributor

recamshak commented Mar 29, 2018

I ran 10 times the following benchmark on a Google Cloud instance with 64 vCPUs:

from sklearn.datasets import make_blobs
from sklearn.neighbors import NearestNeighbors
from sklearn.externals.joblib import cpu_count
import time

d = make_blobs(100000, 100)[0]
nn = NearestNeighbors().fit(d)

for n_jobs in range(1, cpu_count() + 1):
    nn.n_jobs = n_jobs
    start = time.time()
    nn.radius_neighbors()
    end = time.time()
    print('{},{}'.format(n_jobs, end - start))

Although the scaling is not linear it definitely runs faster:

image

@jnothman

This comment has been minimized.

Member

jnothman commented Mar 29, 2018

@recamshak

This comment has been minimized.

Contributor

recamshak commented Mar 30, 2018

One thing that was not obvious to me but made a huge difference was that in _query_radius_single doing this:

-                        raise ValueError("Fatal: count out of range. "
-                                         "This should never happen.")
+                        with gil:
+                            raise ValueError("Fatal: count out of range. "
+                                             "This should never happen.")

instead of this:

-                        raise ValueError("Fatal: count out of range. "
-                                         "This should never happen.")
+                        return -1

have a completely different scaling behavior.

In both case the function is nogil and I expected that in the first case the GIL would be acquired only if the with gil: statement is reached. But as explained here any function with a with gil: statement will have to acquire the GIL on return, regardless of whether with gil: was reached or not.

@recamshak recamshak changed the title from [WIP] Parallel radius neighbors to [MRG] Parallel radius neighbors Apr 2, 2018

@recamshak

This comment has been minimized.

Contributor

recamshak commented Apr 2, 2018

@jnothman Thank you for your comments. I think this is ready for review.

@jnothman

This comment has been minimized.

Member

jnothman commented Apr 2, 2018

@TomDLT

This comment has been minimized.

Member

TomDLT commented Apr 5, 2018

Nice work ! The code seems reasonable, though I am a novice in C memory allocation.

As a general comment, your code could benefit from more comments, especially near non-standard function like np.PyArray_SimpleNewFromData, np.PyArray_UpdateFlags, or memcpy.

For instance, you could add comments with the equivalent python expressions:
# equivalent to: distances[i] = np_dist_arr[:counts[i]].copy()

@recamshak

This comment has been minimized.

Contributor

recamshak commented Apr 9, 2018

@TomDLT Thank you for your comment. I added some comments as you suggested.

@TomDLT

TomDLT approved these changes Apr 13, 2018 edited

LGTM
Disclaimer: I am a novice in C memory allocation.

Could you add an entry in doc/whats_new/v0.20.rst?

@TomDLT TomDLT changed the title from [MRG] Parallel radius neighbors to [MRG+1] Parallel radius neighbors Apr 13, 2018

Joel Billaud
@recamshak

This comment has been minimized.

Contributor

recamshak commented Apr 15, 2018

@TomDLT thank you for the review. I updated the release notes.

@jnothman

LGTM! Nice work, thanks

@jnothman

This comment has been minimized.

Member

jnothman commented Apr 16, 2018

Merge when green

@jnothman jnothman merged commit 4335199 into scikit-learn:master Apr 16, 2018

8 checks passed

ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: python2 Your tests passed on CircleCI!
Details
ci/circleci: python3 Your tests passed on CircleCI!
Details
codecov/patch 100% of diff hit (target 95.06%)
Details
codecov/project 95.09% (+0.02%) compared to 4989a95
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
lgtm analysis: Python No alert changes
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment