-
-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SubsampledNeighborsTransformer: Subsampled nearest neighbors for faster and more space efficient estimators that accept precomputed distance matrices #17843
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR @jenniferjang !
You can run make flake8-diff
locally to find the flake8 errors.
Are there references of this approach being used before https://arxiv.org/abs/2006.06743 ?
Hi @thomasjpfan, thanks for reviewing this. I initially implemented it as a transformer, but do you think it would work better as a function instead? I can't get the test check_methods_subset_invariance to pass for the transformer because in our case, transform generates the distance matrix for the data, which shouldn't satisfy the subset invariance property. To my knowledge I haven't seen this approach being used before in the literature, at least in the context of DBSCAN. |
It's not because of the distance matrix that it shouldn't satisfy the subset invariance property, only because of the random nature of the transformation. See other estimators that disable 'check_methods_subset_invariance', such as |
Sorry for the late response! @jnothman, @thomasjpfan, @cmarmo, it appears that the class I’ve also added a new example, One thing I noticed is that In order to speed up subsampled DBSCAN, I sort the output of |
May you see if |
Hi @thomasjpfan, at your suggestion I looked into pairwise_distances_chunked. It calculates the distance matrix for all pairs of points, right? In that case our performance would be n^2 -- the same as DBSCAN -- and wouldn't work for large inputs. Is there a way for |
Interesting. The pairwise_distances implementation of Euclidean distance is
taking advantage of having a constant norm to use in calculating each row
and column of the distance matrix.
I don't think you can use chunking here in any straightforward way. You
would need to implement a version of paired distances that accepted norms
for each X,Y, as with the fast Euclidean pairwise distances implementation.
|
Hi @jenniferjang, any update on this? |
Reference Issues/PRs
See #17650
What does this implement/fix? Explain your changes.
Instead of calculating the pairwise distances for all pairs of points to obtain nearest-neighbor graphs for estimators like DBSCAN, SubsampledNeighborsTransformer only calculates distances for a fraction s of the pairs (selected uniformly at random). This would make estimators that accept precomputed distance matrices feasible for larger datasets. In a very recent work I did with Google Research [1], we found that you can get over 200x speedup and 250x savings in memory this way without hurting the clustering quality (in some cases s = 0.001 suffices).
[1] https://arxiv.org/abs/2006.06743
Any other comments?