New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Toward a consistent API for NearestNeighbors & co #10463
Comments
Solution A could be combined with a new transformer estimator to precompute neighbors, for instance: make_pipeline(
NearestNeighborsTransformer(radius=42.0, mode='distances'),
DBSCAN(min_samples=30, neighbors='precomputed'),
memory='/path/to/cache'
)
make_pipeline(
NearestNeighborsTransformer(n_neighbors=5, mode='connectivity'),
SpectralClustering(n_clusters=5, neighbors='precomputed'),
memory='/path/to/cache'
)
make_pipeline(
NearestNeighborsTransformer(radius=42.0, mode='distances'),
TSNE(method='barnes_hut', learning_rate=10, angle=0.3, neighbors='precomputed'),
memory='/path/to/cache'
) This way it would be easy to tweak the value of the last stage of the pipeline (e.g. |
I've not looked at this in detail yet. But a related idea is the ability of algorithms that can apply NN to precomputed distance matrices can do so to a precomputed sparse graph style of matrix (e.g. |
I think you're talking about such sparse precomputed matrices above. Yes, a transformer would be an interesting way to make that easier. Are precomputed (approximate) nearest neighborhoods sufficient for most/all of these algorithms that use NN? |
Yes exactly, I started work to adapt TSNE to use This is part of an experimental branch by @TomDLT to explore what kinds of changes are required: |
Note: I am not sure whether or not we should introduce an additional
The latter usecase is related to #8999. We should probably do 2 separate PRs:
We just have to keep in mind that those 2 PRs could share the use of a standard |
To follow up to #10482 (comment) (as it's not really a comment about the PR), just a few general questions,
|
Anywhere where all computations are based only on the sparse neighborhoods of each training point in relation to the training data is applicable, KNeighborsClassifier included. @TomDLT lists which apply in the PR. Beyond caching, other specialised handling of particular data could be applied to produce sparse precomputed neighborhoods: batching, parallel/distributed computation of nearest neighbors, specialised handling for some data type, approximate nearest neighbors. Any of these could result in performance gains. |
Solution A was implemented and merged in #10482 |
Estimators relying on
NearestNeighbors
(NN), and their related params:params = (algorithm, leaf_size, metric, p, metric_params, n_jobs)
sklearn.neighbors:
NearestNeighbors(n_neighbors, radius, *params)
KNeighborsClassifier(n_neighbors, *params)
KNeighborsRegressor(n_neighbors, *params)
RadiusNeighborsClassifier(radius, *params)
RadiusNeighborsRegressor(radius, *params)
LocalOutlierFactor(n_neighbors, *params)
KernelDensity(algorithm, metric, leaf_size, metric_params)
sklearn.manifold:
TSNE(method="barnes_hut", metric)
Isomap(n_neighbors, neighbors_algorithm, n_jobs)
LocallyLinearEmbedding(n_neighbors, neighbors_algorithm, n_jobs)
SpectralEmbedding(affinity='nearest_neighbors', n_neighbors, n_jobs)
sklearn.cluster:
SpectralClustering(affinity='nearest_neighbors', n_neighbors, n_jobs)
DBSCAN(eps, *params)
How do they call
NearestNeighbors
?NeighborsBase._fit
: NearestNeighbors, KNeighborsClassifier, KNeighborsRegressor, RadiusNeighborsClassifier, RadiusNeighborsRegressor, LocalOutlierFactorBallTree/KDTree(X)
: KernelDensitykneighbors_graph(X)
: SpectralClustering, SpectralEmbeddingNearestNeighbors().fit(X)
: TSNE, DBSCAN, Isomap, kneighbors_graphDo they handle other form of input X?
KNeighborsMixin
object: kneighbors_graphNeighborsBase
object: all estimators inheriting NeighborsBase + UnsupervisedMixinBallTree/KDTree
object: all estimators inheriting NeighborsBase + SupervisedFloatMixin/SupervisedIntegerMixinIssues:
n_jobs
in TSNE).NearestNeighbors/BallTree/KDTree
object is not consistent, and not well documented. Sometimes it is documented but does not work (e.g. Isomap), or not well documented but it does work (e.g. LocalOutlierFactor). Most classes almost handle it sinceNearestNeighbors().fit(NearestNeighbors().fit(X))
works, but a call tocheck_array(X)
prevents it (e.g. Isomap, DBSCAN, SpectralEmbedding, SpectralClustering, LocallyLinearEmbedding, TSNE).kneighbors_graph
) (e.g. TSNE T-SNE fails for CSR matrix #9691).Proposed solutions:
A. We could generalize the use of precomputed distances matrix, and use pipelines to chain
NearestNeighbors
with other estimators. Yet it might not be possible/efficient for some estimators. I this case one would have to adapt the estimators to allow for the following:Estimator(neighbors='precomputed').fit(distance_matrix, y)
B. We could improve the checking of X to enable more widely having X as a
NearestNeighbors/BallTree/KDTree
fitted object. The changes would be probably limited, however, this solution is not pipeline-friendly.C. To be pipeline-friendly, a custom
NearestNeighbors
object could be passed in the params, unfitted. We could then put all NN-related parameters in this estimator parameter, and allow custom estimators with a clear API. This is essentially what is proposed in #8999.The text was updated successfully, but these errors were encountered: