Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SubsampledNeighborsTransformer: Subsampled nearest neighbors for faster and more space efficient estimators that accept precomputed distance matrices #17843

Open
wants to merge 60 commits into
base: main
Choose a base branch
from

Conversation

jenniferjang
Copy link

@jenniferjang jenniferjang commented Jul 5, 2020

Reference Issues/PRs

See #17650

What does this implement/fix? Explain your changes.

Instead of calculating the pairwise distances for all pairs of points to obtain nearest-neighbor graphs for estimators like DBSCAN, SubsampledNeighborsTransformer only calculates distances for a fraction s of the pairs (selected uniformly at random). This would make estimators that accept precomputed distance matrices feasible for larger datasets. In a very recent work I did with Google Research [1], we found that you can get over 200x speedup and 250x savings in memory this way without hurting the clustering quality (in some cases s = 0.001 suffices).

[1] https://arxiv.org/abs/2006.06743

Any other comments?

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR @jenniferjang !

You can run make flake8-diff locally to find the flake8 errors.

Are there references of this approach being used before https://arxiv.org/abs/2006.06743 ?

sklearn/neighbors/_subsampled.py Outdated Show resolved Hide resolved
sklearn/neighbors/_subsampled.py Show resolved Hide resolved
sklearn/neighbors/_subsampled.py Outdated Show resolved Hide resolved
sklearn/neighbors/_subsampled.py Outdated Show resolved Hide resolved
sklearn/neighbors/__init__.py Show resolved Hide resolved
@jenniferjang
Copy link
Author

Thank you for the PR @jenniferjang !

You can run make flake8-diff locally to find the flake8 errors.

Are there references of this approach being used before https://arxiv.org/abs/2006.06743 ?

Hi @thomasjpfan, thanks for reviewing this. I initially implemented it as a transformer, but do you think it would work better as a function instead? I can't get the test check_methods_subset_invariance to pass for the transformer because in our case, transform generates the distance matrix for the data, which shouldn't satisfy the subset invariance property.

To my knowledge I haven't seen this approach being used before in the literature, at least in the context of DBSCAN.

@jnothman
Copy link
Member

jnothman commented Jul 6, 2020

I can't get the test check_methods_subset_invariance to pass for the transformer because in our case, transform generates the distance matrix for the data, which shouldn't satisfy the subset invariance property.

It's not because of the distance matrix that it shouldn't satisfy the subset invariance property, only because of the random nature of the transformation. See other estimators that disable 'check_methods_subset_invariance', such as DummyClassifier and BernoulliRBM

@jenniferjang
Copy link
Author

jenniferjang commented Sep 20, 2020

Sorry for the late response! @jnothman, @thomasjpfan, @cmarmo, it appears that the class UnsupervisedMixin, which was one of the parent classes of SubsampledNeighborsTransformer, was removed from scikit-learn, which caused the errors. I've removed dependencies on UnsupervisedMixin, and I believe it builds now.

I’ve also added a new example, plot_subsampled_neighbors_transformer_dbscan.py, which plots DBSCAN and subsampled DBSCAN results side by side. However this seem to have issues building, and I'm not sure how to fix it.

One thing I noticed is that paired_distances is awfully inefficient: on 30,000 points, paired_distances alone on 10% of edges took more than twice as much time as it took to run the entirety of DBSCAN without sampling. Depending on the size of the dataset, paired_distances takes between 50%-80% of the total runtime of subsampled DBSCAN. I’d like to look into improving paired_distances, either before or after finishing this pull request. Otherwise it would be infeasible to use SubsampledNeighborsTransformer to speed up DBSCAN unless the dataset is extremely large.

In order to speed up subsampled DBSCAN, I sort the output of SubsampledNeighborsTransformer’s fit_transform. If you see other ways to make the subsampled_neighbors function faster, please let me know.

@thomasjpfan
Copy link
Member

One thing I noticed is that paired_distances is awfully inefficient: on 30,000 points, paired_distances alone on 10% of edges took more than twice as much time as it took to run the entirety of DBSCAN without sampling.

May you see if pairwise_distances_chunked would improve the performance?

@jenniferjang
Copy link
Author

jenniferjang commented Oct 5, 2020

One thing I noticed is that paired_distances is awfully inefficient: on 30,000 points, paired_distances alone on 10% of edges took more than twice as much time as it took to run the entirety of DBSCAN without sampling.

May you see if pairwise_distances_chunked would improve the performance?

Hi @thomasjpfan, at your suggestion I looked into pairwise_distances_chunked. It calculates the distance matrix for all pairs of points, right? In that case our performance would be n^2 -- the same as DBSCAN -- and wouldn't work for large inputs. Is there a way for pairwise_distances_chunked to calculate distances for a different (random) subset of neighbors per point? I looked into using reduce_func but as far as I saw, that would be applied after the full distance matrix is calculated.

@jnothman
Copy link
Member

jnothman commented Oct 7, 2020 via email

Base automatically changed from master to main January 22, 2021 10:52
@iqbalfarz
Copy link

Hi @jenniferjang, any update on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants