Toward a consistent API for NearestNeighbors & co #10463

TomDLT · 2018-01-12T13:37:30Z

Estimators relying on `NearestNeighbors` (NN), and their related params:

params = (algorithm, leaf_size, metric, p, metric_params, n_jobs)

sklearn.neighbors:

NearestNeighbors(n_neighbors, radius, *params)
KNeighborsClassifier(n_neighbors, *params)
KNeighborsRegressor(n_neighbors, *params)
RadiusNeighborsClassifier(radius, *params)
RadiusNeighborsRegressor(radius, *params)
LocalOutlierFactor(n_neighbors, *params)
~KernelDensity(algorithm, metric, leaf_size, metric_params)

sklearn.manifold:

TSNE(method="barnes_hut", metric)
Isomap(n_neighbors, neighbors_algorithm, n_jobs)
LocallyLinearEmbedding(n_neighbors, neighbors_algorithm, n_jobs)
SpectralEmbedding(affinity='nearest_neighbors', n_neighbors, n_jobs)

sklearn.cluster:

SpectralClustering(affinity='nearest_neighbors', n_neighbors, n_jobs)
DBSCAN(eps, *params)

How do they call `NearestNeighbors` ?

Inherit from NeighborsBase._fit: NearestNeighbors, KNeighborsClassifier, KNeighborsRegressor, RadiusNeighborsClassifier, RadiusNeighborsRegressor, LocalOutlierFactor
Call BallTree/KDTree(X): KernelDensity
Call kneighbors_graph(X): SpectralClustering, SpectralEmbedding
Call NearestNeighbors().fit(X): TSNE, DBSCAN, Isomap, kneighbors_graph

Do they handle other form of input X?

Handle precomputed distances matrix, with (metric/affinity='precomputed'): TSNE, DBSCAN, SpectralEmbedding, SpectralClustering
Handle KNeighborsMixin object: kneighbors_graph
Handle NeighborsBase object: all estimators inheriting NeighborsBase + UnsupervisedMixin
Handle BallTree/KDTree object: all estimators inheriting NeighborsBase + SupervisedFloatMixin/SupervisedIntegerMixin

Issues:

We don't have all NN parameters in all classes (e.g. n_jobs in TSNE).
We can't give a custom NN estimators to these classes. (PR [WIP] allow nearest neighbors algorithm to be an estimator #3922 [WIP] allow nearest neighbors algorithm to be an estimator (v2) #8999)
The handle of input X as a NearestNeighbors/BallTree/KDTree object is not consistent, and not well documented. Sometimes it is documented but does not work (e.g. Isomap), or not well documented but it does work (e.g. LocalOutlierFactor). Most classes almost handle it since NearestNeighbors().fit(NearestNeighbors().fit(X)) works, but a call to check_array(X) prevents it (e.g. Isomap, DBSCAN, SpectralEmbedding, SpectralClustering, LocallyLinearEmbedding, TSNE).
The handle of X as a precomputed distances matrix is not consistent, and sometimes does not work with sparse matrices (as given by kneighbors_graph) (e.g. TSNE T-SNE fails for CSR matrix #9691).

Proposed solutions:

A. We could generalize the use of precomputed distances matrix, and use pipelines to chain NearestNeighbors with other estimators. Yet it might not be possible/efficient for some estimators. I this case one would have to adapt the estimators to allow for the following: Estimator(neighbors='precomputed').fit(distance_matrix, y)

B. We could improve the checking of X to enable more widely having X as a NearestNeighbors/BallTree/KDTree fitted object. The changes would be probably limited, however, this solution is not pipeline-friendly.

C. To be pipeline-friendly, a custom NearestNeighbors object could be passed in the params, unfitted. We could then put all NN-related parameters in this estimator parameter, and allow custom estimators with a clear API. This is essentially what is proposed in #8999.

The text was updated successfully, but these errors were encountered:

ogrisel · 2018-01-12T14:25:05Z

Solution A could be combined with a new transformer estimator to precompute neighbors, for instance:

make_pipeline(
    NearestNeighborsTransformer(radius=42.0, mode='distances'),
    DBSCAN(min_samples=30, neighbors='precomputed'),
    memory='/path/to/cache'
)

make_pipeline(
    NearestNeighborsTransformer(n_neighbors=5, mode='connectivity'),
    SpectralClustering(n_clusters=5, neighbors='precomputed'),
    memory='/path/to/cache'
)


make_pipeline(
    NearestNeighborsTransformer(radius=42.0, mode='distances'),
    TSNE(method='barnes_hut', learning_rate=10, angle=0.3,  neighbors='precomputed'),
    memory='/path/to/cache'
)

This way it would be easy to tweak the value of the last stage of the pipeline (e.g. min_samples or n_clusters) without recomputing the neighbors thanks to the pipeline cache.

jnothman · 2018-01-13T12:28:09Z

I've not looked at this in detail yet. But a related idea is the ability of algorithms that can apply NN to precomputed distance matrices can do so to a precomputed sparse graph style of matrix (e.g. kneighbors_graph(X, mode='distance')). DBSCAN currently supports this. #10206 proposes to support it in tSNE and in NearestNeighbors in general.

jnothman · 2018-01-13T12:29:56Z

I think you're talking about such sparse precomputed matrices above. Yes, a transformer would be an interesting way to make that easier. Are precomputed (approximate) nearest neighborhoods sufficient for most/all of these algorithms that use NN?

ogrisel · 2018-01-13T19:32:39Z

Yes exactly, I started work to adapt TSNE to use kneighbors_graph(X, mode='distance') internally: TomDLT@aeb956b

This is part of an experimental branch by @TomDLT to explore what kinds of changes are required:

master...TomDLT:nn_api

ogrisel · 2018-01-13T19:57:08Z

Note: DBSCAN already supports metric='precomputed' and SpectralClustering already supports affinity='precomputed'.

I am not sure whether or not we should introduce an additional neighbors constructor param. This additional parameter could also be used to do things like:

TSNE(neighbors='ball_tree').fit(features_data) or
TSNE(neighbors=CustomApproximateNeighbors()).fit(features_data)

The latter usecase is related to #8999.

We should probably do 2 separate PRs:

one for an homogeneous support for precomputed distance neighbors graphs,
one for making it easy and homogeneous to select the neighbors method with a string ('ball_tree', 'kd_tree'...) or even plug custom nearest neighbors implementations as an estimator-like object.

We just have to keep in mind that those 2 PRs could share the use of a standard neighbors= constructor param for the different use cases.

rth · 2018-01-23T13:09:45Z

To follow up to #10482 (comment) (as it's not really a comment about the PR), just a few general questions,

I understand the advantage of using standardized parameters and caching with NearestNeighborsTransformer. But other than that I was wondering about the rationale to generalize sparse precomputed distances everywhere. Is it mostly about API consistency or is it also expected to have some memory or performance impact ?
Does it make sense to use pre-computed distances in all NN based estimators as proposed in FEA Generalize the use of precomputed sparse distance matr… #10482 ? e.g. what would be the use case for KNeighborsClassifier? Essentially I'm wondering if there are any documentation on when to use or not to use pre-computed distances, unless that's estimator specific?

jnothman · 2018-01-23T13:28:32Z

Anywhere where all computations are based only on the sparse neighborhoods of each training point in relation to the training data is applicable, KNeighborsClassifier included. @TomDLT lists which apply in the PR.

Beyond caching, other specialised handling of particular data could be applied to produce sparse precomputed neighborhoods: batching, parallel/distributed computation of nearest neighbors, specialised handling for some data type, approximate nearest neighbors. Any of these could result in performance gains.

TomDLT · 2019-09-18T23:19:06Z

Solution A was implemented and merged in #10482

ogrisel mentioned this issue Jan 12, 2018

[MRG] Implementation of OPTICS #1984

Closed

ogrisel added the API label Jan 13, 2018

jnothman mentioned this issue Jan 14, 2018

[WIP] allow nearest neighbors algorithm to be an estimator (v2) #8999

Closed

TomDLT mentioned this issue Jan 16, 2018

FEA Generalize the use of precomputed sparse distance matr… #10482

Merged

TomDLT mentioned this issue Apr 9, 2019

Isomap doesn't actually support NearestNeighbors as input to .fit #13596

Closed

TomDLT closed this as completed Sep 18, 2019

TomDLT mentioned this issue Oct 22, 2019

DOC cleanup the roadmap #15332

Merged

TomDLT mentioned this issue Jul 8, 2020

MNT Avoid duplicate validation of data in supervised NN #17849

Merged

frankier mentioned this issue Jan 8, 2021

knn-graph construction benchmark erikbern/ann-benchmarks#207

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Toward a consistent API for NearestNeighbors & co #10463

Toward a consistent API for NearestNeighbors & co #10463

TomDLT commented Jan 12, 2018 •

edited

ogrisel commented Jan 12, 2018 •

edited

jnothman commented Jan 13, 2018 •

edited

jnothman commented Jan 13, 2018

ogrisel commented Jan 13, 2018 •

edited

ogrisel commented Jan 13, 2018 •

edited

rth commented Jan 23, 2018

jnothman commented Jan 23, 2018

TomDLT commented Sep 18, 2019

Toward a consistent API for NearestNeighbors & co #10463

Toward a consistent API for NearestNeighbors & co #10463

Comments

TomDLT commented Jan 12, 2018 • edited

Estimators relying on NearestNeighbors (NN), and their related params:

How do they call NearestNeighbors ?

Do they handle other form of input X?

Issues:

Proposed solutions:

ogrisel commented Jan 12, 2018 • edited

jnothman commented Jan 13, 2018 • edited

jnothman commented Jan 13, 2018

ogrisel commented Jan 13, 2018 • edited

ogrisel commented Jan 13, 2018 • edited

rth commented Jan 23, 2018

jnothman commented Jan 23, 2018

TomDLT commented Sep 18, 2019

TomDLT commented Jan 12, 2018 •

edited

Estimators relying on `NearestNeighbors` (NN), and their related params:

How do they call `NearestNeighbors` ?

ogrisel commented Jan 12, 2018 •

edited

jnothman commented Jan 13, 2018 •

edited

ogrisel commented Jan 13, 2018 •

edited

ogrisel commented Jan 13, 2018 •

edited