Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEA Generalize the use of precomputed sparse distance matr… #10482

Merged
merged 132 commits into from Sep 18, 2019

Conversation

@TomDLT
Copy link
Member

TomDLT commented Jan 16, 2018

This PR implements solution (A) proposed in #10463.

Fixes #2792, #7426, #9691, #9994, #10463, #13596
Closes #3922, #7876, #8999, #10206
Gives a workaround for #8611

It generalizes the use of precomputed sparse distance matrices, and proposes to use pipelines to chain nearest neighbors with other estimators.
More precisely, it introduces new estimators KNeighborsTransformer and RadiusNeighborsTransformer, which transforms an input X into a (weighted) graph of its nearest neighbors. The output is a sparse distance/affinity matrix.

The proposed usecase is the following:

from sklearn.pipeline import make_pipeline
from sklearn.neighbors import KNeighborsTransformer
from sklearn.manifold import TSNE

est_chain = make_pipeline(
    KNeighborsTransformer(n_neighbors=n_neighbors, mode='distance',
                          metric=metric, include_self=False),
    TSNE(metric='precomputed', method="barnes_hut"),
    memory='path/to/cache')

est_compact = TSNE(metric=metric, method="barnes_hut")

In this example est_chain and est_compact are equivalent, but est_chain gives more control on the nearest neighbors estimator. Moreover, it can use the pipeline caching properties, to reuse the same nearest neighbors matrix for different TSNE parameters. Finally, it is possible to replace KNeighborsTransformer with a custom nearest neighbors estimator returning the graph of connections with the nearest neighbors.


This pipeline works with:

  • TSNE, Isomap, SpectralEmbedding
  • DBSCAN, SpectralClustering,
  • KNeighborsClassifier, RadiusNeighborsClassifier, KNeighborsRegressor, RadiusNeighborsRegressor, LocalOutlierFactor

This pipeline cannot work with:

  • LocallyLinearEmbedding (which needs the original X)

Possible improvements:

  • Add a kernel parameter to transform distances into affinities
  • Add symmetric option for affinities (to remove warning in SpectralEmbedding, SpectralClustering)
Copy link
Member

jnothman left a comment

The changes to neighbors/base.py more-or-less duplicate a limited version of the work in #10206. Please consider adopting at least the tests from neighbors/tests/test_neighbors.py there, if not the implementation (but that is more WIP).

sklearn/neighbors/base.py Outdated Show resolved Hide resolved
Copy link
Member

jnothman left a comment

I think this is a good design. It makes using precomputed sparse matrices in DBSCAN much clearer, and it avoids complicated new API design proposed in #8999. It should be easy for existing ANN implementations to be adapted to this design.

The key caveat is that it requires all required neighborhoods to be computed in advance. This may be a memory burden still for something like DBSCAN (although our current implementation is already memory heavy). It also doesn't handle the case that an estimator repeatedly fits a nearest neighbor object, but I think those are rare.

It would be really good if we could illustrate plugging in an approximate nearest neighbors estimator, but it would likely include adding a pyann or similar dependency, unless we attempt to write a very short LSH just to prove the point.

But I think for consistency we need to have RadiusNeighborsTransformer and KNeighborsTransformer separately.

I also think we should adopt or adapt as much of the existing work at #10206 as possible. We've long iterated on it already.

sklearn/neighbors/unsupervised.py Outdated Show resolved Hide resolved
Copy link
Member

jnothman left a comment

I think this is a good design. It makes using precomputed sparse matrices in DBSCAN much clearer, and it avoids complicated new API design proposed in #8999. It should be easy for existing ANN implementations to be adapted to this design.

The key caveat is that it requires all required neighborhoods to be computed in advance. This may be a memory burden still for something like DBSCAN (although our current implementation is already memory heavy). It also doesn't handle the case that an estimator repeatedly fits a nearest neighbor object, but I think those are rare.

It would be really good if we could illustrate plugging in an approximate nearest neighbors estimator, but it would likely include adding a pyann or similar dependency, unless we attempt to write a very short LSH just to prove the point.

But I think for consistency we need to have RadiusNeighborsTransformer and KNeighborsTransformer separately.

I also think we should adopt or adapt as much of the existing work at #10206 as possible. We've long iterated on it already.

@rth

This comment has been minimized.

Copy link
Member

rth commented Jan 23, 2018

Nice work @TomDLT and thanks for the summary in #10463.

The key caveat is that it requires all required neighborhoods to be computed in advance. This may be a memory burden still for something like DBSCAN (although our current implementation is already memory heavy).
It would be really good if we could illustrate plugging in an approximate nearest neighbors estimator, but it would likely include adding a pyann or similar dependency, unless we attempt to write a very short LSH just to prove the point.

The API with a transformer looks quite nice, and caching the NN can certainly help for repeated estimators fits with different parameters. However, as far as ANN are concerned I don't find it immediately obvious that pre-computing a sparse distance matrix with ANN, and e.g. using it in DBSCAN has an equivalent performance to directly using the ANN in the estimator.

For instance, I would be curious to know the results of the following benchmark. Take some low dimensional data where we know that kd_tree will perform well (e.g. n_features=2). Use it to pre-compute the sparse distance matrix and fit DBSCAN. Compare the performance to using kd_tree directly in the DBSCAN. This would allow estimating the overhead of this approach, without necessarily having to add an ANN implementation.

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Jan 23, 2018

I'm not sure what you're trying to show about ANN with that, @rth. Currently DBSCAN calculates the nearest neighbors of all X once and computes the clustering from that. There is some overhead when passing in sparse D rather than just passing in features X: unpacking, thresholding on radius and repacking.

This is not how it was originally implemented, mind you, and was optimised for speed rather than memory (see e.g. #5275). Perhaps with a *NeighborsTransformer, we could go back to that original low memory implementation, but get most of the speed benefits as long as the user provides the precomputed sparse matrix.

One disadvantage of this design as opposed to #8999 is that there the host algorithm can pass in a radius or a n_neighbors, which would save time relative to precomputing and then filtering/checking the precomputed input. We also do not currently require that the precomputed input is in sorted order, so there is a O(n_train_samples log n_train_samples) cost per query in order to sort it. We could make some constraints on precomputed sparse matrix input corresponding to *neighbors_graph(None, mode='distance') output (or check it and raise a warning but accept it) if we'd rather: it needs to be in CSR with each row's data increasing, and at train time the diagonal needs to be implicit, not stored zeros.

One advantage of this design as opposed to #8999 is that it may allow you to precompute more neighbors than you need, then estimate downstream with various parameters (e.g. DBSCAN.eps); thus it works well with caching pipelines. Of course you could do that with #8999 by implementing your own precomputed neighborhood wrapper.

Regardless of all this, I think it would be good to give an example of, or at least describe, the use of this for exploiting approximate nearest neighbor approaches. Writing up a simple (but not super-efficient) gaussian projection hash, or a minhash, might not be excessive.

@TomDLT

This comment has been minimized.

Copy link
Member Author

TomDLT commented Jan 23, 2018

Thanks for the feedback.

additional TODO:

  • Separate into RadiusNeighborsTransformer and KNeighborsTransformer
  • Consider adding a constraint on graph sorting order
  • Adapt as much of #10206 as possible, in particular unpacking with different n_neighbors/radius
  • Add small speed benchmark to quantify unpacking overhead
  • Add example with ANN
@scikit-learn scikit-learn deleted a comment from sklearn-lgtm Jan 23, 2018
@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Jan 23, 2018

I suggest we do something like:

def _check_precomputed(D, n_neighbors=None):
    if not issparse(D):
        return check_array(D, ensure_min_features=n_neighbors)
    if X.format not in ('csr', 'csc', 'coo', 'lil'):
        raise TypeError('Sparse matrix in {!r} format is not supported due to its handling of explicit zeros'.format(D.format))
    copied = X.format != 'csr'
    D = D.tocsr()
    if n_neighbors is not None and np.diff(D.indptr).min() < n_neighbors:
        raise ValueError('Provide at least {} precomputed sparse neighbors. Got {}'.format(n_neighbors, np.diff(D.indptr).min()))
    # check sort
    out_of_order = D.data[:-1] > D.data[1:]
    if out_of_order.sum() != out_of_order.take(D.indptr[1:-1], mode='clip').sum():
        warnings.warn('Precomputed sparse input was not sorted by data.', EfficiencyWarning)
        if not copied:
            D = D.copy()
        for start, stop in zip(D.indptr, D.indptr[1:]):
            order = np.argsort(D.data[start:stop], kind='mergesort')
            D.data[start:stop] = D.data[order]
            D.indices[start:stop] = D.indices[order]
    return D

It's a pity the sorting check takes O(nnz) memory and time

Copy link
Member

jnothman left a comment

Nice example and benchmark! Seeing as it's too slow to run with benefits in the example gallery anyway, keeping it there and not requiring the dependency looks like a reasonable idea.

doc/modules/neighbors.rst Outdated Show resolved Hide resolved
doc/modules/neighbors.rst Outdated Show resolved Hide resolved
doc/modules/neighbors.rst Outdated Show resolved Hide resolved
TomDLT and others added 4 commits Jan 26, 2018
Conflicts:
	sklearn/manifold/t_sne.py
	sklearn/manifold/tests/test_t_sne.py
@TomDLT TomDLT force-pushed the TomDLT:nn_api branch from ee1303f to e75ea28 Jan 26, 2018
@TomDLT TomDLT force-pushed the TomDLT:nn_api branch from e75ea28 to c573bc2 Jan 26, 2018
@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Jan 27, 2018

Everything is red (and that might be my fault)

Copy link
Member

thomasjpfan left a comment

Minor comment

sklearn/neighbors/base.py Outdated Show resolved Hide resolved
@thomasjpfan

This comment has been minimized.

Copy link
Member

thomasjpfan commented Sep 18, 2019

One of the selling points of this PR is to allow for external libraries to provide their own sparse graphs with their own transformers. We should make it clear, what makes a custom KNeighborsTransformer or RadiusNeighborsTransformer compatible with our API.

At the moment, it is unclear when to set n_neighbor = self.n_neighbor + 1 in the implementation of a custom *NeighborsTransformer. Does this only need to happen with KNeighbors that outputs distance-based sparse graphs?

@thomasjpfan

This comment has been minimized.

Copy link
Member

thomasjpfan commented Sep 18, 2019

We kind of hide this implementation detail in our KNeighborsTransformer, but a custom *NeighborsTransformer would need to know when to do this. (Which is only shown in the example)

Copy link
Member

thomasjpfan left a comment

After the additional commits that adds details about KNeighborsTransformer, I think this PR is ready to merge.

Copy link
Member

thomasjpfan left a comment

Minor versionadded in docstrings

sklearn/neighbors/base.py Show resolved Hide resolved
sklearn/neighbors/base.py Show resolved Hide resolved
@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Sep 18, 2019

@TomDLT

This comment has been minimized.

Copy link
Member Author

TomDLT commented Sep 18, 2019

I think this PR is ready to merge.

Awesome !

it is unclear when to set n_neighbors = self.n_neighbors + 1 in the implementation of a custom *NeighborsTransformer

  • This happens only for KNeighborsTransformer, since RadiusNeighborsTransformer does not need a n_neighbors parameter.
  • This is not really a strong API specification, since passing too many neighbors will always work (the following estimators will just filter the extra neighbors). Passing too few neighbors will raise an error.
  • The distinction mode == 'distance' is only for internal compatibility of n_neighbors definitions. A custom estimator does not need to do this distinction.

I added a bit more details in the documentation, tell me if you think something else should be added.

@thomasjpfan

This comment has been minimized.

Copy link
Member

thomasjpfan commented Sep 18, 2019

To maximise compatiblity with all estimators, a safe choice is to always include one extra neighbor in a custom nearest neighbors estimator, since unnecessary neighbors will be filtered by following estimators.

I like this. As a user, I would always include one more just to keep things simple . (Or a few more to have the flexibility to explore parameters downstream.)

@thomasjpfan thomasjpfan changed the title [MRG+1] Generalize the use of precomputed sparse distance matrices, for estimators which use nearest neighbors FEA Generalize the use of precomputed sparse distance matr… Sep 18, 2019
@thomasjpfan thomasjpfan merged commit e52e9c8 into scikit-learn:master Sep 18, 2019
18 checks passed
18 checks passed
LGTM analysis: C/C++ No code changes detected
Details
LGTM analysis: JavaScript No code changes detected
Details
LGTM analysis: Python 1 fixed alert
Details
ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: doc Your tests passed on CircleCI!
Details
ci/circleci: doc artifact Link to 0/doc/_changed.html
Details
ci/circleci: doc-min-dependencies Your tests passed on CircleCI!
Details
ci/circleci: lint Your tests passed on CircleCI!
Details
codecov/patch 99.36% of diff hit (target 96.71%)
Details
codecov/project 96.73% (+0.01%) compared to c56bce4
Details
scikit-learn.scikit-learn Build #20190918.54 succeeded
Details
scikit-learn.scikit-learn (Linux py35_conda_openblas) Linux py35_conda_openblas succeeded
Details
scikit-learn.scikit-learn (Linux py35_ubuntu_atlas) Linux py35_ubuntu_atlas succeeded
Details
scikit-learn.scikit-learn (Linux pylatest_pip_openblas_pandas) Linux pylatest_pip_openblas_pandas succeeded
Details
scikit-learn.scikit-learn (Linux32 py35_ubuntu_atlas_32bit) Linux32 py35_ubuntu_atlas_32bit succeeded
Details
scikit-learn.scikit-learn (Windows py35_pip_openblas_32bit) Windows py35_pip_openblas_32bit succeeded
Details
scikit-learn.scikit-learn (Windows py37_conda_mkl) Windows py37_conda_mkl succeeded
Details
scikit-learn.scikit-learn (macOS pylatest_conda_mkl) macOS pylatest_conda_mkl succeeded
Details
@thomasjpfan

This comment has been minimized.

Copy link
Member

thomasjpfan commented Sep 18, 2019

Thank you @TomDLT !

@GaelVaroquaux

This comment has been minimized.

Copy link
Member

GaelVaroquaux commented Sep 18, 2019

@ogrisel

This comment has been minimized.

Copy link
Member

ogrisel commented Sep 18, 2019

Now it opens a lot of opportunity to try to precompute neighbors with state of the art approximate NN implementations such as annoy or https://github.com/kakao/n2.

@GaelVaroquaux

This comment has been minimized.

Copy link
Member

GaelVaroquaux commented Sep 18, 2019

@agramfort

This comment has been minimized.

Copy link
Member

agramfort commented Sep 18, 2019

oh boy ! big congrats @TomDLT for never giving up on this one ! 🍻

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Sep 18, 2019

@TomDLT

This comment has been minimized.

Copy link
Member Author

TomDLT commented Sep 18, 2019

Awesome !
Thanks a lot for all the comments and reviews !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.