Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Toward a consistent API for NearestNeighbors & co #10463

Closed
TomDLT opened this issue Jan 12, 2018 · 8 comments
Labels
API

Comments

@TomDLT
Copy link
Member

@TomDLT TomDLT commented Jan 12, 2018

Estimators relying on NearestNeighbors (NN), and their related params:

params = (algorithm, leaf_size, metric, p, metric_params, n_jobs)

sklearn.neighbors:

  • NearestNeighbors(n_neighbors, radius, *params)
  • KNeighborsClassifier(n_neighbors, *params)
  • KNeighborsRegressor(n_neighbors, *params)
  • RadiusNeighborsClassifier(radius, *params)
  • RadiusNeighborsRegressor(radius, *params)
  • LocalOutlierFactor(n_neighbors, *params)
  • ~KernelDensity(algorithm, metric, leaf_size, metric_params)

sklearn.manifold:

  • TSNE(method="barnes_hut", metric)
  • Isomap(n_neighbors, neighbors_algorithm, n_jobs)
  • LocallyLinearEmbedding(n_neighbors, neighbors_algorithm, n_jobs)
  • SpectralEmbedding(affinity='nearest_neighbors', n_neighbors, n_jobs)

sklearn.cluster:

  • SpectralClustering(affinity='nearest_neighbors', n_neighbors, n_jobs)
  • DBSCAN(eps, *params)

How do they call NearestNeighbors ?

  • Inherit from NeighborsBase._fit: NearestNeighbors, KNeighborsClassifier, KNeighborsRegressor, RadiusNeighborsClassifier, RadiusNeighborsRegressor, LocalOutlierFactor
  • Call BallTree/KDTree(X): KernelDensity
  • Call kneighbors_graph(X): SpectralClustering, SpectralEmbedding
  • Call NearestNeighbors().fit(X): TSNE, DBSCAN, Isomap, kneighbors_graph

Do they handle other form of input X?

  • Handle precomputed distances matrix, with (metric/affinity='precomputed'): TSNE, DBSCAN, SpectralEmbedding, SpectralClustering
  • Handle KNeighborsMixin object: kneighbors_graph
  • Handle NeighborsBase object: all estimators inheriting NeighborsBase + UnsupervisedMixin
  • Handle BallTree/KDTree object: all estimators inheriting NeighborsBase + SupervisedFloatMixin/SupervisedIntegerMixin

Issues:

  1. We don't have all NN parameters in all classes (e.g. n_jobs in TSNE).
  2. We can't give a custom NN estimators to these classes. (PR #3922 #8999)
  3. The handle of input X as a NearestNeighbors/BallTree/KDTree object is not consistent, and not well documented. Sometimes it is documented but does not work (e.g. Isomap), or not well documented but it does work (e.g. LocalOutlierFactor). Most classes almost handle it since NearestNeighbors().fit(NearestNeighbors().fit(X)) works, but a call to check_array(X) prevents it (e.g. Isomap, DBSCAN, SpectralEmbedding, SpectralClustering, LocallyLinearEmbedding, TSNE).
  4. The handle of X as a precomputed distances matrix is not consistent, and sometimes does not work with sparse matrices (as given by kneighbors_graph) (e.g. TSNE #9691).

Proposed solutions:

A. We could generalize the use of precomputed distances matrix, and use pipelines to chain NearestNeighbors with other estimators. Yet it might not be possible/efficient for some estimators. I this case one would have to adapt the estimators to allow for the following: Estimator(neighbors='precomputed').fit(distance_matrix, y)

B. We could improve the checking of X to enable more widely having X as a NearestNeighbors/BallTree/KDTree fitted object. The changes would be probably limited, however, this solution is not pipeline-friendly.

C. To be pipeline-friendly, a custom NearestNeighbors object could be passed in the params, unfitted. We could then put all NN-related parameters in this estimator parameter, and allow custom estimators with a clear API. This is essentially what is proposed in #8999.

@ogrisel

This comment has been minimized.

Copy link
Member

@ogrisel ogrisel commented Jan 12, 2018

Solution A could be combined with a new transformer estimator to precompute neighbors, for instance:

make_pipeline(
    NearestNeighborsTransformer(radius=42.0, mode='distances'),
    DBSCAN(min_samples=30, neighbors='precomputed'),
    memory='/path/to/cache'
)

make_pipeline(
    NearestNeighborsTransformer(n_neighbors=5, mode='connectivity'),
    SpectralClustering(n_clusters=5, neighbors='precomputed'),
    memory='/path/to/cache'
)


make_pipeline(
    NearestNeighborsTransformer(radius=42.0, mode='distances'),
    TSNE(method='barnes_hut', learning_rate=10, angle=0.3,  neighbors='precomputed'),
    memory='/path/to/cache'
)

This way it would be easy to tweak the value of the last stage of the pipeline (e.g. min_samples or n_clusters) without recomputing the neighbors thanks to the pipeline cache.

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Jan 13, 2018

I've not looked at this in detail yet. But a related idea is the ability of algorithms that can apply NN to precomputed distance matrices can do so to a precomputed sparse graph style of matrix (e.g. kneighbors_graph(X, mode='distance')). DBSCAN currently supports this. #10206 proposes to support it in tSNE and in NearestNeighbors in general.

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Jan 13, 2018

I think you're talking about such sparse precomputed matrices above. Yes, a transformer would be an interesting way to make that easier. Are precomputed (approximate) nearest neighborhoods sufficient for most/all of these algorithms that use NN?

@ogrisel

This comment has been minimized.

Copy link
Member

@ogrisel ogrisel commented Jan 13, 2018

Yes exactly, I started work to adapt TSNE to use kneighbors_graph(X, mode='distance') internally: TomDLT@aeb956b

This is part of an experimental branch by @TomDLT to explore what kinds of changes are required:

master...TomDLT:nn_api

@ogrisel

This comment has been minimized.

Copy link
Member

@ogrisel ogrisel commented Jan 13, 2018

Note: DBSCAN already supports metric='precomputed' and SpectralClustering already supports affinity='precomputed'.

I am not sure whether or not we should introduce an additional neighbors constructor param. This additional parameter could also be used to do things like:

  • TSNE(neighbors='ball_tree').fit(features_data) or
  • TSNE(neighbors=CustomApproximateNeighbors()).fit(features_data)

The latter usecase is related to #8999.

We should probably do 2 separate PRs:

  • one for an homogeneous support for precomputed distance neighbors graphs,
  • one for making it easy and homogeneous to select the neighbors method with a string ('ball_tree', 'kd_tree'...) or even plug custom nearest neighbors implementations as an estimator-like object.

We just have to keep in mind that those 2 PRs could share the use of a standard neighbors= constructor param for the different use cases.

@rth

This comment has been minimized.

Copy link
Member

@rth rth commented Jan 23, 2018

To follow up to #10482 (comment) (as it's not really a comment about the PR), just a few general questions,

  • I understand the advantage of using standardized parameters and caching with NearestNeighborsTransformer. But other than that I was wondering about the rationale to generalize sparse precomputed distances everywhere. Is it mostly about API consistency or is it also expected to have some memory or performance impact ?
  • Does it make sense to use pre-computed distances in all NN based estimators as proposed in #10482 ? e.g. what would be the use case for KNeighborsClassifier? Essentially I'm wondering if there are any documentation on when to use or not to use pre-computed distances, unless that's estimator specific?
@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Jan 23, 2018

Anywhere where all computations are based only on the sparse neighborhoods of each training point in relation to the training data is applicable, KNeighborsClassifier included. @TomDLT lists which apply in the PR.

Beyond caching, other specialised handling of particular data could be applied to produce sparse precomputed neighborhoods: batching, parallel/distributed computation of nearest neighbors, specialised handling for some data type, approximate nearest neighbors. Any of these could result in performance gains.

@TomDLT

This comment has been minimized.

Copy link
Member Author

@TomDLT TomDLT commented Sep 18, 2019

Solution A was implemented and merged in #10482

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.