Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Toward a consistent API for NearestNeighbors & co #10463

Closed
TomDLT opened this issue Jan 12, 2018 · 8 comments
Closed

Toward a consistent API for NearestNeighbors & co #10463

TomDLT opened this issue Jan 12, 2018 · 8 comments
Labels

Comments

@TomDLT
Copy link
Member

TomDLT commented Jan 12, 2018

Estimators relying on NearestNeighbors (NN), and their related params:

params = (algorithm, leaf_size, metric, p, metric_params, n_jobs)

sklearn.neighbors:

  • NearestNeighbors(n_neighbors, radius, *params)
  • KNeighborsClassifier(n_neighbors, *params)
  • KNeighborsRegressor(n_neighbors, *params)
  • RadiusNeighborsClassifier(radius, *params)
  • RadiusNeighborsRegressor(radius, *params)
  • LocalOutlierFactor(n_neighbors, *params)
  • ~KernelDensity(algorithm, metric, leaf_size, metric_params)

sklearn.manifold:

  • TSNE(method="barnes_hut", metric)
  • Isomap(n_neighbors, neighbors_algorithm, n_jobs)
  • LocallyLinearEmbedding(n_neighbors, neighbors_algorithm, n_jobs)
  • SpectralEmbedding(affinity='nearest_neighbors', n_neighbors, n_jobs)

sklearn.cluster:

  • SpectralClustering(affinity='nearest_neighbors', n_neighbors, n_jobs)
  • DBSCAN(eps, *params)

How do they call NearestNeighbors ?

  • Inherit from NeighborsBase._fit: NearestNeighbors, KNeighborsClassifier, KNeighborsRegressor, RadiusNeighborsClassifier, RadiusNeighborsRegressor, LocalOutlierFactor
  • Call BallTree/KDTree(X): KernelDensity
  • Call kneighbors_graph(X): SpectralClustering, SpectralEmbedding
  • Call NearestNeighbors().fit(X): TSNE, DBSCAN, Isomap, kneighbors_graph

Do they handle other form of input X?

  • Handle precomputed distances matrix, with (metric/affinity='precomputed'): TSNE, DBSCAN, SpectralEmbedding, SpectralClustering
  • Handle KNeighborsMixin object: kneighbors_graph
  • Handle NeighborsBase object: all estimators inheriting NeighborsBase + UnsupervisedMixin
  • Handle BallTree/KDTree object: all estimators inheriting NeighborsBase + SupervisedFloatMixin/SupervisedIntegerMixin

Issues:

  1. We don't have all NN parameters in all classes (e.g. n_jobs in TSNE).
  2. We can't give a custom NN estimators to these classes. (PR [WIP] allow nearest neighbors algorithm to be an estimator #3922 [WIP] allow nearest neighbors algorithm to be an estimator (v2)  #8999)
  3. The handle of input X as a NearestNeighbors/BallTree/KDTree object is not consistent, and not well documented. Sometimes it is documented but does not work (e.g. Isomap), or not well documented but it does work (e.g. LocalOutlierFactor). Most classes almost handle it since NearestNeighbors().fit(NearestNeighbors().fit(X)) works, but a call to check_array(X) prevents it (e.g. Isomap, DBSCAN, SpectralEmbedding, SpectralClustering, LocallyLinearEmbedding, TSNE).
  4. The handle of X as a precomputed distances matrix is not consistent, and sometimes does not work with sparse matrices (as given by kneighbors_graph) (e.g. TSNE T-SNE fails for CSR matrix #9691).

Proposed solutions:

A. We could generalize the use of precomputed distances matrix, and use pipelines to chain NearestNeighbors with other estimators. Yet it might not be possible/efficient for some estimators. I this case one would have to adapt the estimators to allow for the following: Estimator(neighbors='precomputed').fit(distance_matrix, y)

B. We could improve the checking of X to enable more widely having X as a NearestNeighbors/BallTree/KDTree fitted object. The changes would be probably limited, however, this solution is not pipeline-friendly.

C. To be pipeline-friendly, a custom NearestNeighbors object could be passed in the params, unfitted. We could then put all NN-related parameters in this estimator parameter, and allow custom estimators with a clear API. This is essentially what is proposed in #8999.

@ogrisel
Copy link
Member

ogrisel commented Jan 12, 2018

Solution A could be combined with a new transformer estimator to precompute neighbors, for instance:

make_pipeline(
    NearestNeighborsTransformer(radius=42.0, mode='distances'),
    DBSCAN(min_samples=30, neighbors='precomputed'),
    memory='/path/to/cache'
)

make_pipeline(
    NearestNeighborsTransformer(n_neighbors=5, mode='connectivity'),
    SpectralClustering(n_clusters=5, neighbors='precomputed'),
    memory='/path/to/cache'
)


make_pipeline(
    NearestNeighborsTransformer(radius=42.0, mode='distances'),
    TSNE(method='barnes_hut', learning_rate=10, angle=0.3,  neighbors='precomputed'),
    memory='/path/to/cache'
)

This way it would be easy to tweak the value of the last stage of the pipeline (e.g. min_samples or n_clusters) without recomputing the neighbors thanks to the pipeline cache.

@jnothman
Copy link
Member

jnothman commented Jan 13, 2018

I've not looked at this in detail yet. But a related idea is the ability of algorithms that can apply NN to precomputed distance matrices can do so to a precomputed sparse graph style of matrix (e.g. kneighbors_graph(X, mode='distance')). DBSCAN currently supports this. #10206 proposes to support it in tSNE and in NearestNeighbors in general.

@jnothman
Copy link
Member

I think you're talking about such sparse precomputed matrices above. Yes, a transformer would be an interesting way to make that easier. Are precomputed (approximate) nearest neighborhoods sufficient for most/all of these algorithms that use NN?

@ogrisel
Copy link
Member

ogrisel commented Jan 13, 2018

Yes exactly, I started work to adapt TSNE to use kneighbors_graph(X, mode='distance') internally: TomDLT@aeb956b

This is part of an experimental branch by @TomDLT to explore what kinds of changes are required:

master...TomDLT:nn_api

@ogrisel
Copy link
Member

ogrisel commented Jan 13, 2018

Note: DBSCAN already supports metric='precomputed' and SpectralClustering already supports affinity='precomputed'.

I am not sure whether or not we should introduce an additional neighbors constructor param. This additional parameter could also be used to do things like:

  • TSNE(neighbors='ball_tree').fit(features_data) or
  • TSNE(neighbors=CustomApproximateNeighbors()).fit(features_data)

The latter usecase is related to #8999.

We should probably do 2 separate PRs:

  • one for an homogeneous support for precomputed distance neighbors graphs,
  • one for making it easy and homogeneous to select the neighbors method with a string ('ball_tree', 'kd_tree'...) or even plug custom nearest neighbors implementations as an estimator-like object.

We just have to keep in mind that those 2 PRs could share the use of a standard neighbors= constructor param for the different use cases.

@rth
Copy link
Member

rth commented Jan 23, 2018

To follow up to #10482 (comment) (as it's not really a comment about the PR), just a few general questions,

  • I understand the advantage of using standardized parameters and caching with NearestNeighborsTransformer. But other than that I was wondering about the rationale to generalize sparse precomputed distances everywhere. Is it mostly about API consistency or is it also expected to have some memory or performance impact ?
  • Does it make sense to use pre-computed distances in all NN based estimators as proposed in FEA Generalize the use of precomputed sparse distance matr… #10482 ? e.g. what would be the use case for KNeighborsClassifier? Essentially I'm wondering if there are any documentation on when to use or not to use pre-computed distances, unless that's estimator specific?

@jnothman
Copy link
Member

Anywhere where all computations are based only on the sparse neighborhoods of each training point in relation to the training data is applicable, KNeighborsClassifier included. @TomDLT lists which apply in the PR.

Beyond caching, other specialised handling of particular data could be applied to produce sparse precomputed neighborhoods: batching, parallel/distributed computation of nearest neighbors, specialised handling for some data type, approximate nearest neighbors. Any of these could result in performance gains.

@TomDLT
Copy link
Member Author

TomDLT commented Sep 18, 2019

Solution A was implemented and merged in #10482

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants