Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API Allow users to pass DistanceMetric objects to metric keyword argument in neighbors.KNeighborsRegressor #26267

Merged
merged 22 commits into from Aug 1, 2023

Conversation

Micky774
Copy link
Contributor

@Micky774 Micky774 commented Apr 23, 2023

Reference Issues/PRs

Addresses #26010
Addresses #26329

What does this implement/fix? Explain your changes.

Allow users to pass DistanceMetric objects to metric keyword argument in neighbors.KNeighborsRegressor.

Any other comments?

I have discussed with @betatim the idea of developing an Engine for DistanceMetric, however that is probably not easily done in the near-future without adding to the complexity of the already-open work on the plugin system. That is probably the best strategy once it is a bit more mature, but until then, this option of directly passing DistanceMetric objects does provide a fast and simple route for the development/use of third-party DistanceMetric implementations.

@jjerphan jjerphan self-requested a review May 19, 2023 15:13
Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @Micky774.

As discussed directly with you, I think having efficient vectorized distance metrics implementations is of value for scikit-learn generally. In this regard I am +5 regarding merging this PR.

As you mention, the use of third party packages (such as yours, https://github.com/Micky774/distmetric-xsimd/) is limited by the absence of the plugins systems.

Yet, I am not sure if we want to relax our API, change scikit-learn UX, and mention it publicly if this is meant to be experimental or temporary: what this PR allows (i.e. to pass DistanceMetric{32,} instances to the metric keywords) relaxes the UX into. I think there are notable UX benefits in constraining users to use one single package-uniform specification (mainly clarity and uniformity across all API using metric). In this regard, I am -1 regarding merging this PR.

Still, we probably might want to have a way for advanced users (or developers of third party package like you) to use those packages in the meantime. Another alternative temporary approach would be to have third party packages monkey patch scikit-learn internals when they are imported at runtime. This is the approach https://github.com/intel/scikit-learn-intelex uses and it has the benefit of keeping scikit-learn UX unchanged. Using this approach, you could change those mappings to point to your implementations instead:

######################################################################
# metric mappings
# These map from metric id strings to class names
METRIC_MAPPING{{name_suffix}} = {
'euclidean': EuclideanDistance{{name_suffix}},
'l2': EuclideanDistance{{name_suffix}},
'minkowski': MinkowskiDistance{{name_suffix}},
'p': MinkowskiDistance{{name_suffix}},
'manhattan': ManhattanDistance{{name_suffix}},
'cityblock': ManhattanDistance{{name_suffix}},
'l1': ManhattanDistance{{name_suffix}},
'chebyshev': ChebyshevDistance{{name_suffix}},
'infinity': ChebyshevDistance{{name_suffix}},
'seuclidean': SEuclideanDistance{{name_suffix}},
'mahalanobis': MahalanobisDistance{{name_suffix}},
'wminkowski': WMinkowskiDistance{{name_suffix}},
'hamming': HammingDistance{{name_suffix}},
'canberra': CanberraDistance{{name_suffix}},
'braycurtis': BrayCurtisDistance{{name_suffix}},
'matching': MatchingDistance{{name_suffix}},
'jaccard': JaccardDistance{{name_suffix}},
'dice': DiceDistance{{name_suffix}},
'kulsinski': KulsinskiDistance{{name_suffix}},
'rogerstanimoto': RogersTanimotoDistance{{name_suffix}},
'russellrao': RussellRaoDistance{{name_suffix}},
'sokalmichener': SokalMichenerDistance{{name_suffix}},
'sokalsneath': SokalSneathDistance{{name_suffix}},
'haversine': HaversineDistance{{name_suffix}},
'pyfunc': PyFuncDistance{{name_suffix}},
}

This also has the benefit of making other methods of scikit-learn use your implementations.

What do you think?

@Micky774
Copy link
Contributor Author

Micky774 commented May 24, 2023

As you mention, the use of third party packages (such as yours, https://github.com/Micky774/distmetric-xsimd/) is limited by the absence of the plugins systems.

Yet, I am not sure if we want to relax our API, change scikit-learn UX, and mention it publicly if this is meant to be experimental or temporary: what this PR allows (i.e. to pass DistanceMetric{32,} instances to the metric keywords) relaxes the UX into. I think there are notable UX benefits in constraining users to use one single package-uniform specification (mainly clarity and uniformity across all API using metric). In this regard, I am -1 regarding merging this PR.

We actually discussed this a bit during the most recent monthly meeting, and from my conversation with @ogrisel my current understanding is that the plugin system is intended to expose estimator-specific computational routines rather than wide-used computational backends (e.g. PairwiseDistanceReduction or DistanceMetric{32} implementations). In this way, the expansion of the metric keyword to accept DistanceMetric{32} objects is independent of that and would not need deprecation to favor plugins.

Still, we probably might want to have a way for advanced users (or developers of third party package like you) to use those packages in the meantime. Another alternative temporary approach would be to have third party packages monkey patch scikit-learn internals when they are imported at runtime. This is the approach https://github.com/intel/scikit-learn-intelex uses and it has the benefit of keeping scikit-learn UX unchanged. Using this approach, you could change those mappings to point to your implementations instead:

######################################################################
# metric mappings
# These map from metric id strings to class names
METRIC_MAPPING{{name_suffix}} = {
'euclidean': EuclideanDistance{{name_suffix}},
'l2': EuclideanDistance{{name_suffix}},
'minkowski': MinkowskiDistance{{name_suffix}},
'p': MinkowskiDistance{{name_suffix}},
'manhattan': ManhattanDistance{{name_suffix}},
'cityblock': ManhattanDistance{{name_suffix}},
'l1': ManhattanDistance{{name_suffix}},
'chebyshev': ChebyshevDistance{{name_suffix}},
'infinity': ChebyshevDistance{{name_suffix}},
'seuclidean': SEuclideanDistance{{name_suffix}},
'mahalanobis': MahalanobisDistance{{name_suffix}},
'wminkowski': WMinkowskiDistance{{name_suffix}},
'hamming': HammingDistance{{name_suffix}},
'canberra': CanberraDistance{{name_suffix}},
'braycurtis': BrayCurtisDistance{{name_suffix}},
'matching': MatchingDistance{{name_suffix}},
'jaccard': JaccardDistance{{name_suffix}},
'dice': DiceDistance{{name_suffix}},
'kulsinski': KulsinskiDistance{{name_suffix}},
'rogerstanimoto': RogersTanimotoDistance{{name_suffix}},
'russellrao': RussellRaoDistance{{name_suffix}},
'sokalmichener': SokalMichenerDistance{{name_suffix}},
'sokalsneath': SokalSneathDistance{{name_suffix}},
'haversine': HaversineDistance{{name_suffix}},
'pyfunc': PyFuncDistance{{name_suffix}},
}

This also has the benefit of making other methods of scikit-learn use your implementations.

What do you think?

This is definitely an approach worth considering, but truth be told I also think it would be good for scikit-learn to more publicly expose opportunities for third-party acceleration of computational backends even if not through the plugin system. We have recently been discussing what the role of scikit-learn in the community is, and what our obligations are. One of the more critical questions, I think, is whether or not maximizing performance optimization is our obligation.

It's a bit tricky, imo. We should aim for performance wherever possible -- and indeed work like your PairwiseDistanceReduction backend are a massive effort in this regard -- but at the same time there are so many niche tools and system-specific boons that we could use but would be incredibly costly to implement/maintain. I favor trying to make it as easy and canonical as possible for third-party programs to accelerate scikit-learn, in parallel to our own internal efforts.

In this way, I see the relaxation of the API for the metric keyword as a feature, not a bug. It is an invitation and advertisement for interested users/developers to think of their own solutions for DistanceMetric{32} and contribute it to the community -- it is more obvious to users that this is something that they could write their own implementation of. Plus, we can still ensure the stability and uniformity of metric semantics across scikit-learn. Otherwise, I worry that if our go-to solution for third-party accelerators is just "monkey-patch us", then it poses a higher barier to entry.

I will admit my inexperience with this though. If other core-devs believe that the cost to relaxing the API outweighs the benefits of offering a more obvious avenue of contribution for third-party acceleration, then I am okay with the monkey-patching solution. I am not too attached to the specific path taken, just to the results of a faster scikit-learn and -- if at all possible -- an invitation to the community to try out their own acceleration strategies on DistanceMetric{32}.

cc: @ogrisel @adrinjalali since we had actively discussed the role of the plugins system, and the burden of this proposed API change during the monthly meeting

@betatim
Copy link
Member

betatim commented May 24, 2023

Could you post an example snippet of what user code with KNeighborsRegressor and this PR merged would look like? I looked at the tests to get a feeling for what user code would look like but decided that the tests are quite tricky because they test a lot of things. Hence I think have a snippet of user code would be nice.

I also looked at https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KDTree.html where you can already pass a DistanceMetric to the metric keyword. Is this precedence for this new API?


I'm not sure I have a firm opinion yet, but my first reaction is that we should preferably use one way to allow users to extend/modify scikit-learn. I'm not sure this is the case here, but imagine we'd end up with a world where there were "estimator plugins" and "distance metric plugins" (using "plugin" as a generic word for "extending scikit-learn"). I think that would be quite confusing to explain to users. I think having one system would be better.

@Micky774
Copy link
Contributor Author

Could you post an example snippet of what user code with KNeighborsRegressor and this PR merged would look like? I looked at the tests to get a feeling for what user code would look like but decided that the tests are quite tricky because they test a lot of things. Hence I think have a snippet of user code would be nice.

Sure! For the sake of example I'll assume the user has installed my sample accelerated implementations with an installation of scikit-learn from this branch.

Then it would look like:

from sklearn.neighbors import KNeighborsRegressor
import numpy as np
# Step 1
from distance_metrics import get_distance_metric # package name is a WIP and will soon be slsdm :)

random_state = 0
n_samples_X = 100
n_features = 10
n_classes = 3

rng = np.random.RandomState(random_state)
X = rng.randn(n_samples_X, n_features)
y = rng.randint(n_classes, size=n_samples_X)

# Step 2
dst = get_distance_metric(X, "manhattan")

# Step 3
neigh = KNeighborsRegressor(n_neighbors=2, metric=dst)
neigh.fit(X, y)

print(neigh.predict(X[:10]))

I also looked at https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KDTree.html where you can already pass a DistanceMetric to the metric keyword. Is this precedence for this new API?

Indeed! This actually makes the implementation very straightforward for many estimators since the functionality is largely already available (just need to make more lax the estimators' own validation).

I'm not sure I have a firm opinion yet, but my first reaction is that we should preferably use one way to allow users to extend/modify scikit-learn. I'm not sure this is the case here, but imagine we'd end up with a world where there were "estimator plugins" and "distance metric plugins" (using "plugin" as a generic word for "extending scikit-learn"). I think that would be quite confusing to explain to users. I think having one system would be better.

I definitely agree that having one system would be better. I personally would be okay with multiple paths if it made sense and reduced barriers to entry. I don't know if this API change I propose actually accomplishes that or not, but that is the underlying hope.

I've attached a list of potential paths forward. What do you think of option four:

Partially accept the API change as an experimental feature and remove without deprecation in favor of the Engine API later


Some Potential Paths Forward

I just want to go ahead and summarize some of the paths forwards we have discussed.

1. Accept the API change and do not extend the Engine API to cover DistanceMetric{32}

If you want to create a new distance metric then you probably either:

  1. Want to use a new (perhaps exotic) metric, in which case you may extend DistanceMetric{32} and flesh it out.
  2. Want to use a fancy implementation (e.g. SIMD/GPU) of an existing metric, in which case you may extend the default implementation for the metric and overwrite whatever you need.

And in either case instantiate the object and pass it directly via metric.

There is a clear, single path to accelerating DistanceMetric{32}.

2. Accept the API change and extend the Engine API to cover DistanceMetric{32}

The prior options still apply, however now you can also develop a new distance metric using its corresponding Engine which is a stricter API which includes usability checks and allows for use through entry point registration. From a user-experience perspective, Engine API compliant implementations are almost always preferred, except for perhaps cases where a user wants to write a new distance metric as quickly as possible, in which case simply extending the classes as mentioned prior may be preferred (even if not recommended).

This is where the confusion you mentioned would set in, making it unclear when one should make an implementation of the Engine or DistanceMetric{32} directly.

3. Reject the API and instead extend the Engine API to cover DistanceMetric{32}

Developers must write Engine API compliant implementations, but get all the benefits of doing so.

There is a clear, single path to accelerating DistanceMetric{32}.

4. Partially accept the API change as an experimental feature and remove without deprecation in favor of the Engine API later

Although the feature will be more obscure, it will still be available. Users with a strong need may cross the extra hurdle to enable the feature and use a third-party implementation while average users see no change. Later, when the Engine API is more mature and is extended to cover DistanceMetric{32} the experimental feature can be disabled and we can advocate for the Engine API as the sole path for extending DistanceMetric{32}.

At any given time, there is exactly one path for extending DistanceMetric{32}, though initially it is less discoverable. However, the less-discoverable path is likely available much sooner than the full Engine API.

@adrinjalali
Copy link
Member

I'm in favor of accepting this API enhancement no matter what we decide on the engine API side (I slightly prefer this to the engine API acceleration path I think, but no strong feelings there).

It would be nice to unify metrics around the code base, rather than having different set of accepted metrics for different classes.

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to unify metrics around the code base, rather than having different set of accepted metrics for different classes.

I also agree. Since some classes like KDTree and BallTree support DistanceMetric for metric as indicated by @betatim in #26267 (comment), the natural solution would be to have metric support DistanceMetric for all other interfaces exposing it.

Hence I am in favor of this PR, especially given that @Micky774 has come up with compelling performance improvement in the meantime. 🌟

I think some preliminary refactoring might be needed before approving this PR.

doc/whats_new/v1.3.rst Outdated Show resolved Hide resolved
sklearn/metrics/__init__.py Outdated Show resolved Hide resolved
@Micky774
Copy link
Contributor Author

Micky774 commented Jun 2, 2023

As an update regarding the motivation of this API, it turns out for KNeighborsRegressor on data with many features (e.g. n_features=500) we see some drastic speedups of the predict method as a whole. We see anywhere from 5.5-8x speedup on float32 data, and a 3-4x speedup on float64 data. Note that this is when n_jobs=None for algorithm='brute' which parallelizes using all available threads (I believe it is equivalent to n_jobs=-1).

Note that the following benchmarks were generated with this gist

Plots

336ecccc-adb6-45ef-875c-a1563d570e7a

1c00f369-2143-46f5-a095-e78c6306c5da

sklearn/metrics/pairwise.py Outdated Show resolved Hide resolved
@betatim
Copy link
Member

betatim commented Jun 2, 2023

With this PR merged I'd be able to use a DistanceMetric object (or instance? but this is a detail) from anywhere, it doesn't have to be part of scikit-learn, right? This in turn means I could use an implementation that uses SIMD instructions or other dark magic.

If yes, can we spell this out in the docs somewhere and maybe even link to an alternative source of DistanceMetrics? Given the speedups shown I think it is worth investing in some docs to make sure the average person on the internet highway understands the potential gains this PR brings. Put differently, I worry that without docs this is too obscure/too many steps for people to understand and as a result won't get used that much.

@Micky774
Copy link
Contributor Author

Micky774 commented Jun 2, 2023

With this PR merged I'd be able to use a DistanceMetric object (or instance? but this is a detail) from anywhere, it doesn't have to be part of scikit-learn, right? This in turn means I could use an implementation that uses SIMD instructions or other dark magic.

Precisely 😄

If yes, can we spell this out in the docs somewhere and maybe even link to an alternative source of DistanceMetrics? Given the speedups shown I think it is worth investing in some docs to make sure the average person on the internet highway understands the potential gains this PR brings. Put differently, I worry that without docs this is too obscure/too many steps for people to understand and as a result won't get used that much.

Fully agreed. I originally intended to include it in documentation once the functionality was exposed through the major pairwise distance reduction backends mentioned in the issue, but I suppose there's not really any reason to wait that long. We can have a maintained list of estimators for which the functionality is available, and then remove the list once the use of the metric keyword is made consistent across the codebase.

I'll start on that, and can provide my library as an example implementation (this'll kick me into gear to actually improve the library's documentation as well 😅).

@github-actions
Copy link

github-actions bot commented Jul 3, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: e4d8413. Link to the linter CI: here

@Micky774
Copy link
Contributor Author

Micky774 commented Jul 19, 2023

For those interested in trying this, there are now published wheels for Linux/MacOS x86-64 systems for my SIMD accelerated DistanceMetric implementation. You can also build from source if you want anything more than the SSE3 that the wheels come with.

If you want to quickly sanity check the speedups you can:

  1. Build scikit-learn from source on my metric_obj branch (this PR)
  2. pip install slsdm or follow the instructions to build from source (requires gcc)
  3. Follow along the quick demo in the repo

With that aside, I wanted to go ahead and ping @scikit-learn/core-devs to see if anyone has the capacity/interest in reviewing this PR. I think the API proposed here is a reasonable extension of what we already have for {KD, Ball}Tree, and in concert with #25914 and #25561 would bring us closer to a unified API for the metric keyword argument while inviting third-party acceleration.

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Could we treat other estimators in other dedicated PRs?

sklearn/neighbors/_base.py Outdated Show resolved Hide resolved
doc/whats_new/v1.4.rst Outdated Show resolved Hide resolved
doc/whats_new/v1.4.rst Outdated Show resolved Hide resolved
sklearn/neighbors/tests/test_neighbors.py Outdated Show resolved Hide resolved
@jjerphan jjerphan added the Waiting for Second Reviewer First reviewer is done, need a second one! label Jul 20, 2023
@Micky774 Micky774 changed the title API Allow users to pass DistanceMetric objects to metric keyword argument API Allow users to pass DistanceMetric objects to metric keyword argument in neighbors.KNeighborsRegressor Jul 20, 2023
@Micky774
Copy link
Contributor Author

@thomasjpfan @ogrisel In case either of you were interested in reviewing

elif (
callable(self.metric)
or self.metric in VALID_METRICS["ball_tree"]
or isinstance(self.metric, DistanceMetric)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API wise, self.algorithm is selecting ball_tree all the time if self.metric is a DistanceMetric. Is this the preferred default for self.algorithm="auto"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This preserves the standard behavior, e.g. metric="euclidean" and metric=DistanceMetric.get_metric("euclidean") will both result in a BallTree with a EuclideanDistance object.

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
@thomasjpfan thomasjpfan enabled auto-merge (squash) August 1, 2023 17:39
@thomasjpfan thomasjpfan merged commit a16d367 into scikit-learn:main Aug 1, 2023
26 checks passed
@Micky774 Micky774 deleted the metric_obj branch August 1, 2023 19:06
9Y5 pushed a commit to 9Y5/scikit-learn that referenced this pull request Aug 2, 2023
…argument in `neighbors.KNeighborsRegressor` (scikit-learn#26267)

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
REDVM pushed a commit to REDVM/scikit-learn that referenced this pull request Nov 16, 2023
…argument in `neighbors.KNeighborsRegressor` (scikit-learn#26267)

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants