SubsampledNeighborsTransformer: Subsampled nearest neighbors for faster and more space efficient estimators that accept precomputed distance matrices #17843

jenniferjang · 2020-07-05T17:14:16Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Instead of calculating the pairwise distances for all pairs of points to obtain nearest-neighbor graphs for estimators like DBSCAN, SubsampledNeighborsTransformer only calculates distances for a fraction s of the pairs (selected uniformly at random). This would make estimators that accept precomputed distance matrices feasible for larger datasets. In a very recent work I did with Google Research [1], we found that you can get over 200x speedup and 250x savings in memory this way without hurting the clustering quality (in some cases s = 0.001 suffices).

[1] https://arxiv.org/abs/2006.06743

Any other comments?

thomasjpfan

Thank you for the PR @jenniferjang !

You can run make flake8-diff locally to find the flake8 errors.

Are there references of this approach being used before https://arxiv.org/abs/2006.06743 ?

sklearn/neighbors/_subsampled.py

sklearn/neighbors/__init__.py

jenniferjang · 2020-07-06T00:44:08Z

Thank you for the PR @jenniferjang !

You can run make flake8-diff locally to find the flake8 errors.

Are there references of this approach being used before https://arxiv.org/abs/2006.06743 ?

Hi @thomasjpfan, thanks for reviewing this. I initially implemented it as a transformer, but do you think it would work better as a function instead? I can't get the test check_methods_subset_invariance to pass for the transformer because in our case, transform generates the distance matrix for the data, which shouldn't satisfy the subset invariance property.

To my knowledge I haven't seen this approach being used before in the literature, at least in the context of DBSCAN.

jnothman · 2020-07-06T12:25:52Z

I can't get the test check_methods_subset_invariance to pass for the transformer because in our case, transform generates the distance matrix for the data, which shouldn't satisfy the subset invariance property.

It's not because of the distance matrix that it shouldn't satisfy the subset invariance property, only because of the random nature of the transformation. See other estimators that disable 'check_methods_subset_invariance', such as DummyClassifier and BernoulliRBM

…into example

jenniferjang · 2020-09-20T05:39:26Z

Sorry for the late response! @jnothman, @thomasjpfan, @cmarmo, it appears that the class UnsupervisedMixin, which was one of the parent classes of SubsampledNeighborsTransformer, was removed from scikit-learn, which caused the errors. I've removed dependencies on UnsupervisedMixin, and I believe it builds now.

I’ve also added a new example, plot_subsampled_neighbors_transformer_dbscan.py, which plots DBSCAN and subsampled DBSCAN results side by side. However this seem to have issues building, and I'm not sure how to fix it.

One thing I noticed is that paired_distances is awfully inefficient: on 30,000 points, paired_distances alone on 10% of edges took more than twice as much time as it took to run the entirety of DBSCAN without sampling. Depending on the size of the dataset, paired_distances takes between 50%-80% of the total runtime of subsampled DBSCAN. I’d like to look into improving paired_distances, either before or after finishing this pull request. Otherwise it would be infeasible to use SubsampledNeighborsTransformer to speed up DBSCAN unless the dataset is extremely large.

In order to speed up subsampled DBSCAN, I sort the output of SubsampledNeighborsTransformer’s fit_transform. If you see other ways to make the subsampled_neighbors function faster, please let me know.

thomasjpfan · 2020-09-24T18:43:11Z

One thing I noticed is that paired_distances is awfully inefficient: on 30,000 points, paired_distances alone on 10% of edges took more than twice as much time as it took to run the entirety of DBSCAN without sampling.

May you see if pairwise_distances_chunked would improve the performance?

jenniferjang · 2020-10-05T16:20:31Z

One thing I noticed is that paired_distances is awfully inefficient: on 30,000 points, paired_distances alone on 10% of edges took more than twice as much time as it took to run the entirety of DBSCAN without sampling.

May you see if pairwise_distances_chunked would improve the performance?

Hi @thomasjpfan, at your suggestion I looked into pairwise_distances_chunked. It calculates the distance matrix for all pairs of points, right? In that case our performance would be n^2 -- the same as DBSCAN -- and wouldn't work for large inputs. Is there a way for pairwise_distances_chunked to calculate distances for a different (random) subset of neighbors per point? I looked into using reduce_func but as far as I saw, that would be applied after the full distance matrix is calculated.

jnothman · 2020-10-07T10:51:55Z

Interesting. The pairwise_distances implementation of Euclidean distance is taking advantage of having a constant norm to use in calculating each row and column of the distance matrix. I don't think you can use chunking here in any straightforward way. You would need to implement a version of paired distances that accepted norms for each X,Y, as with the fast Euclidean pairwise distances implementation.

iqbalfarz · 2022-10-23T00:49:27Z

Hi @jenniferjang, any update on this?

jenniferjang added 18 commits July 3, 2020 13:11

Added SubsampledNeighborsTransformer and tests

6dd06c6

Added SubsampledNeighborsTransformer and tests

c4b3802

Tabs -> spaces

d496a83

Changed to use s/2 samplign rate

aee49ab

Got rid of symmetric keyword, updated tests

ae17805

Removed some print statements

6a8d6de

Renamed variable for clarity

9cd97f8

Tabs -> spaces

a4ac952

Docstring consistency

6fb5aeb

Added parameter eps

af402d5

Moved eps to be optional parameter

deefb85

Spacing

0d7412b

Spacing

44b3b7a

Comments

67b0c75

!= -> is not

5e4d4c8

Comments

ae32fc4

Comments

f277cf3

Merge remote-tracking branch 'upstream/master'

914cc73

github-actions bot added the module:neighbors label Jul 5, 2020

Merge https://github.com/scikit-learn/scikit-learn

b5c569f

thomasjpfan reviewed Jul 5, 2020

View reviewed changes

jenniferjang added 7 commits July 6, 2020 18:48

Tests pass

898db0a

Linted

ff2523e

_xfail_checks: check_methods_subset_invariance

367eaf2

Resolved diff comments

7b92f53

Made fit(X) fitted to X

11b1e9b

Changed test test_iris_manhattan

eaca39b

Added _more_tags

7ed75cb

jenniferjang added 24 commits September 13, 2020 23:14

Subsampled with commented out code

f8793c0

Deleted commented out code

48e821c

Removed newlines

e632cb0

Fixed most tests

6911403

Tests passing

e640d78

Added examples file for subsampled neighbors

f1bb428

Comments in DBSCAN

bbe9ed6

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

0839314

…into example

Forced transform dtype to be same dtype as input

63ce68f

Removed UnsupervisedMixin parent class

2459de1

Finished test example

1abdc75

Removed Cython functions

6adeea8

Started deduping neighborhood graph again

85e1587

Started deduping neighborhood graph again

e61eeb1

Consolidated some code for mild speedup

fa0ca02

Removed some time printouts

3cf7ad5

Fixed some lint comments

f9523aa

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

d9a2e16

…into example

Updated subsampled tests

b98da20

Merged branch example with branch master

1629945

Lint/merge issues

b377779

Converted pyx file back to py because not using cython anymore

8291d00

Added docstring for random_state_ attribute

708c668

Lint

0d16857

Base automatically changed from master to main January 22, 2021 10:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SubsampledNeighborsTransformer: Subsampled nearest neighbors for faster and more space efficient estimators that accept precomputed distance matrices #17843

SubsampledNeighborsTransformer: Subsampled nearest neighbors for faster and more space efficient estimators that accept precomputed distance matrices #17843

jenniferjang commented Jul 5, 2020 •

edited by thomasjpfan

thomasjpfan left a comment

jenniferjang commented Jul 6, 2020

jnothman commented Jul 6, 2020

jenniferjang commented Sep 20, 2020 •

edited

thomasjpfan commented Sep 24, 2020

jenniferjang commented Oct 5, 2020 •

edited

jnothman commented Oct 7, 2020 via email

iqbalfarz commented Oct 23, 2022

SubsampledNeighborsTransformer: Subsampled nearest neighbors for faster and more space efficient estimators that accept precomputed distance matrices #17843

Are you sure you want to change the base?

SubsampledNeighborsTransformer: Subsampled nearest neighbors for faster and more space efficient estimators that accept precomputed distance matrices #17843

Conversation

jenniferjang commented Jul 5, 2020 • edited by thomasjpfan

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

thomasjpfan left a comment

Choose a reason for hiding this comment

jenniferjang commented Jul 6, 2020

jnothman commented Jul 6, 2020

jenniferjang commented Sep 20, 2020 • edited

thomasjpfan commented Sep 24, 2020

jenniferjang commented Oct 5, 2020 • edited

jnothman commented Oct 7, 2020 via email

iqbalfarz commented Oct 23, 2022

jenniferjang commented Jul 5, 2020 •

edited by thomasjpfan

jenniferjang commented Sep 20, 2020 •

edited

jenniferjang commented Oct 5, 2020 •

edited