ENH Adapt to latest commits of the `feature/engines` branch #74

fcharras · 2022-12-13T20:24:43Z

No description provided.

jjerphan

LGTM modulo a few changes.

jjerphan · 2022-12-15T13:28:48Z

sklearn_numba_dpex/kmeans/tests/test_kmeans.py

@@ -156,6 +156,7 @@ def test_euclidean_distance(dtype):
    estimator = KMeans(n_clusters=len(b))
    estimator.cluster_centers_ = b
    engine = KMeansEngine(estimator)
+    assert engine.accepts(a, y=None, sample_weight=None)


I think we should add assertions when y or sample_weight are not None.

This applies on new assertions as well.

jjerphan · 2022-12-15T13:30:17Z

sklearn_numba_dpex/kmeans/engine.py

+        # When sample_weight is None, the call to `_check_sample_weight` is delayed
+        # until now because, because the array of `ones` that is created is only
+        # necessary for engine methods that actually make use of `sample_weight` and
+        # call `_check_is_accepted_sample_weight`.
+        # Methods that don't use `sample_weight` still pass `sample_weight=None` to
+        # `accepts` but doesn't need to create the corresponding array.


Suggested change

# When sample_weight is None, the call to `_check_sample_weight` is delayed

# until now because, because the array of `ones` that is created is only

# necessary for engine methods that actually make use of `sample_weight` and

# call `_check_is_accepted_sample_weight`.

# Methods that don't use `sample_weight` still pass `sample_weight=None` to

# `accepts` but doesn't need to create the corresponding array.

# When sample_weight is None, the call to `_check_sample_weight` is

# delayed until now because the array of `ones` that is created is

# only necessary for engine methods that actually make use of

# `sample_weight` and call `_check_is_accepted_sample_weight`.

# Methods that don't use `sample_weight` still pass

# `sample_weight=None` to `accepts` but doesn't need to create

# the corresponding array.

jjerphan · 2022-12-15T13:31:03Z

sklearn_numba_dpex/kmeans/engine.py

+    def _check_is_accepted_X(self, X):
+        if X is not self._X_accepted:
+            raise RuntimeError(
+                "The object that was passed to the engine to query its compatibility "
+                "is different from the object that was given in downstream methods."
+            )
+
+    def _check_is_accepted_sample_weight(self, sample_weight):
+        if sample_weight is not self._sample_weight_accepted:
+            raise RuntimeError(
+                "The object that was passed to the engine to query its compatibility "
+                "is different from the object that was given in downstream methods."
+            )


I think we must test asserting those cases are met.

ogrisel

I think we could simplify things a lot by:

not calling _validate_date in accepts but instead just performing shallow type checking in accepts, that is something like:

  accepted_types = [np.ndarray, dpnp.ndarray, dpt.usm_ndarray]
  return (
      isinstance(X, accepted_types)
      and y == None
      and isinstance(sample_weight, accepted_types)
  )

keep the calls to _validate_data as in the current main (we could probably simplify it a bit further but we can do that later).

This means the engine will refuse to activate when the users passes data as a list of lists of Python scalar numbers but we don't care. This will make the code much simpler by not having to store _X_validated or _X_accepted on the engine instance.

What do you think?

ogrisel · 2022-12-15T10:32:57Z

sklearn_numba_dpex/kmeans/engine.py

+            if sample_weight is not None:
+                self._sample_weight_validated = self._check_sample_weight(sample_weight)
+            return True
+        except Exception:


We could be more specific here, no?

Suggested change

except Exception:

except NotSupportedByEngineError:

Maybe also let's add a TODO comment to explain that this condition on _is_in_testing_mode should better be handled in a mixin class for the estimator that should be in charge of raising an exception when the fallback to the default engine is explicitly disabled by a dedicated entry in sklearn._config.set_config.

The _validate_data could can also return ValueError or TypeError, or any other error that sklearn.BaseEstimator._validate_data or dpt.asarray would want to throw there. In those cases, we also want the engine to decline the compute. So a generic Exception is good here.

ogrisel · 2022-12-15T13:55:32Z

Also it would be great to have sklearn_numba_dpex specific tests to check that passing dpnp or dpctl arrays / tensor works as expected without raising an exception and returns the same results as when passing a numpy array with the same data values inside.

fcharras · 2022-12-15T15:06:14Z

Also it would be great to have sklearn_numba_dpex specific tests to check that passing dpnp or dpctl arrays / tensor works as expected without raising an exception and returns the same results as when passing a numpy array with the same data values inside.

Already exists 😁

fcharras · 2022-12-15T15:21:13Z

I think we could simplify things a lot by:
* not calling `_validate_date` in accept but instead just performing shallow type checking in accepts, that is something like:
[...]

What do you think?

It's true that the code in this PR is slightly more complicated but I think that it's worth and that it's actually the laziest path rather than the opposite. Mimicking the behavior of scikit-learn regarding what type of inputs are accepted ensures that the tests from the test suite will be compatible. If restricting the types that are accepted, we might loose compatibility with tests that would be perfectly valid otherwise, but would happen to be designed with list of lists as inputs.

So with this strategy we don't have to worry about that.

About what is good for the user, do you suggest that a UX that restricts the accepted types is better or only simpler to maintain ? (personnally I would say that it's better because implicit casting it bad, but I think the best is to mimic sklearn UX as much as possible whatever its choices)

ogrisel · 2022-12-15T15:31:12Z

If restricting the types that are accepted, we might loose compatibility with tests that would be perfectly valid otherwise, but would happen to be designed with list of lists as inputs.

I don't see any valid cases that are not covered by the simple solution I propose.

We honestly don't care about supporting list of lists in GPU k-means. list of lists is only useful for the occasional quick one-liner in an educational context. I am pretty sure all the scikit-learn tests for k-means use either numpy arrays or scipy sparse matrices.

If there are tests in scikit-learn for k-means that use list of lists, we can quickly update them to use numpy arrays.

ogrisel · 2022-12-15T15:33:10Z

About what is good for the user, do you suggest that a UX that restricts the accepted types is better or only simpler to maintain ? (personnally I would say that it's better because implicit casting it bad, but I think the best is to mimic sklearn UX as much as possible whatever its choices)

I think it's better to be explicit about what kind of container types an engine accepts. It makes things easier to reason about, both for the users and for the maintainers.

ogrisel · 2022-12-15T15:36:10Z

I spoke too quickly, we might also want to accept pandas dataframes...

ogrisel · 2022-12-15T15:40:07Z

So maybe we can accept any container that has the __array__ method but that is not scipy.sparse.issparse() (because unfortunately, it exposes an __array__ attribute but we really don't want to call it on a large sparse matrix.

ogrisel · 2022-12-15T15:51:22Z

Something that is not clear with the new accepts API is how it interacts with the engine instance lifecycle.

Do we really want to re-run the engine negotiation prior to calling predict? I would have thought that we would store the negotiated engine provider name on the instance computed by calling accepts on all the active provider names at fit time and then store that engine provider name as a private attribute of the estimator to reuse that provider name at prediction time time without calling a chain of accepts again.

@betatim @fcharras do you agree with this lifecycle? Or do you see problems?

EDIT: what I wrote above is what is currently implemented in the wip-engines branch in the KMeans._get_engine method:

https://github.com/ogrisel/scikit-learn/blob/ff191e296fa87d57ade7e2a3fb573870eded2f26/sklearn/cluster/_kmeans.py#L1519-L1537

So, as of now, the accepts method is never called at prediction time. I don't understand why the scikit-learn tests pass. I would need to update my dev environment on my intel laptop to inspect what's going on.

EDIT: I misread the code: accepts is called again, even if the _engine_provider attribute is set. I did not expect this.

Hum. I find it a bit weird to re-negotiate engines at prediction time. I did not expect that.

ogrisel · 2022-12-15T16:42:33Z

Thinking about this I have the feeling that we should change the way _get_engine works.

Instead I think _get_engine be in charge of calling engine._validate_data and return both the engine instance and the validated data for the first engine that accepts the data.

We can keep the accepts method to make it possible for engines to do an early rejection (e.g. based on the algorithm attribute prior to calling engine._validate_data.

WDYT @betatim, @jjerphan and @fcharras? I think this would make this PR much cleaner while still make it possible to use the outcome of _validate_data to drive the engine negotiation and without having to store _X_validated and checking for physical equality later (X is self_X_validated) which I find very ugly.

fcharras · 2022-12-16T09:46:23Z

I don't think it would be simpler. I've found input validation to be surprisingly difficult and verbose and I deleted several drafts before the current state. I'm particularly happy with reproducing scikit-learn choices when we can so we don't have to make committal choices ourselves (and maintain them). For the user, I can see benefits too, because ultimately what validates or not an input is the _asarray_with_order, which, in the current states of thing, relies on the asarray method of the underlying array library, so that the inputs that are accepted by the engine are the same than the inputs that are accepted by the underlying array library to create a new array, so we have a consistent behavior. For this reason also, it makes sense to fuse accepts with the conversion of the output, very much like scikit-learn already does with numpy.asarray.

fcharras · 2022-12-16T09:53:15Z

I agree that renegotiation at prediction time doesn't fit well, it probably should be changed, unless falling backs to other backends at prediction time can make sense, but that would mean implicitly converting the fitted attributes, and less implicit is better.

betatim · 2023-01-11T10:46:33Z

On the topic of "calling accept again at fit time". I agree that it is weird. The reason that _get_engine doesn't immediately return the engine corresponding to the value recorded during fit is that I want to detect the case where the user has changed the "environment" between calling fit and predict. For example one is in a config context but the other isn't. The idea of _get_engine was to re-run the "provider resolution" at predict time and then error when the result doesn't match what was recorded during fit. The assumption is that calling accept is cheap. It has to be cheap because it can get called a lot of times if you have a lot of plugins installed.

Re-reading the code now I had to read it twice to realise this is what happens. So I think changing it is a good idea. Not sure how to change it yet.

I think accept should be allowed to look at the type, shape and such of the input data as well as the configuration of the estimator to make its decision. However, this means its answer can change between fit and predict because in one case the caller passed a cupy array and in the other not. Or some other thing that leads to the input to fit not being something the engine can/wants to handle. You could argue that wrong input type will anyway lead to an exception during predict so it is no big deal. That is true, but is it as easy as this for all the various reasons that the fit and predict time provider resolution is different.

Based on the assumption that accept is cheap to call and that the main reason for results changing is "user error", it feels that an easy to understand and reason about solution is to just re-run the resolution and raise an error when the results differ.

Is there a reason for accept to not be cheap to call?

fcharras · 2023-01-12T09:33:53Z

I don't think the idea of calling accept at predict on other engines is good here. Engines are not exclusive to a particular type of input, and the range of accepted input of different engines can have non empty intersection. At predict time, we want to re-use the engine that was used at fit-time, even if the type of input is different. Else, cases where engine selection priority is not the same depending on the input type could create weird, inconsistent behavior, where another engine is suggested, when the input type would still be accepted by the engine used during training.

On the other hand it's reliable if the input is only tested for acceptation by the engine that was used during fit, and if the input is bad, let's raise an error and suggests that the user should either convert the estimator for the appropriate engine, or pass another input type.

Aside from this, overall, I'm -1 for the accept API at this point but it's for another reason. I think I agree that decoupling acceptation and validation is good, but the current states of input validation in scikit-learn and with array libraries does not fit well with this approach, let me try to explain why.

Currently, in scikit-learn, ultimately what validates an input or not is if the input is accepted by the asarray method or not. And after trying to implement input validation I can understand why it is this way: it is hard to perform input validation on an object that is completely unknown (does it have a shape attribute ? else, does it implement len ? does it support slicing ? etc etc), and asarray embeds all those validation rules already. So, it's easier to just try to asarray any input object and see what happens. If it succeeds, the input is (almost) validated, the remaining checks are easy, and, since asarray has already output an array with the correct type, it might have triggered memory allocation and copies, and we don't want to waste that, so we keep it. That's why, even if decoupling acceptation and validation makes sense, in practice it's easier to have only one validation method that rely on asarray, because asarray itself does not come with a asarray_accept helper.

In this PR, the accept API is worked around in a way that the validated input is kept as attribute in accept so we can keep relying in dpt.asarray (NB: asarray is included in the Array API), it works but, it's kind of complicated.

fcharras · 2023-01-16T16:07:36Z

Closing the PR following last round of live discussions: https://hackmd.io/--MJTgQzSFSYaaAgcJWQZg?both#2023-01-12 , will open a new PR to adapt to new changes when it's ready

… into adapt_to_wip_engines_latest

ogrisel · 2023-02-01T13:55:58Z

We have a problem with the new default implementation of count_distinct_clusters from the default engine in the new version of wip-engines:

For instance when running the benchmark script with this PR:

Traceback (most recent call last):
  File "/home/ogrisel/code/sklearn-numba-dpex/benchmark/kmeans.py", line 274, in <module>
    kmeans_timer.timeit(
  File "/home/ogrisel/code/sklearn-numba-dpex/benchmark/kmeans.py", line 108, in timeit
    KMeans(**est_kwargs).set_params(max_iter=1).fit(
  File "/home/ogrisel/mambaforge/envs/sklearn-numba-dpex/lib/python3.9/site-packages/sklearn/_engine/base.py", line 154, in wrapper
    r = method(self, *args, **kwargs)
  File "/home/ogrisel/mambaforge/envs/sklearn-numba-dpex/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py", line 1650, in fit
    distinct_clusters = engine.count_distinct_clusters(best_labels)
  File "/home/ogrisel/mambaforge/envs/sklearn-numba-dpex/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py", line 414, in count_distinct_clusters
    return len(set(cluster_labels))
TypeError: unhashable type: 'dpctl.tensor._usmarray.usm_ndarray'

Observed with dpctl==0.14.0+17.gfd263733b.

ogrisel · 2023-02-01T14:18:47Z

Actually this problem also appears in some failing tests. Furthermore, other tests need to be adapted to the new API.

Dismissing approval due to recent failure due to the merge. I think they must be resolved first.

fcharras · 2023-02-08T14:47:54Z

This is the set of minimal changes so that the bump works and we keep the same level of testing.

The next TODO I'm working on is implementing and testing the new method convert_to_sklearn_types to enable auto-conversion. But maybe it can be done in a separate PR ?

I'm not sure that there are other changes to consider ? I'd like to get rid entirely of the environment variable SKLEARN_NUMBA_DPEX_TESTING_MODE on our side, but I think this requires an additional keyword on the sklearn config side ? I'll open the discussion on the feature/wip-engine side.

jjerphan

LGTM.

I think the title of the PR can now be changed to:

ENH Adapt to latest commits of the `feature/engines` branch

What do you think?

ogrisel · 2023-02-09T10:28:22Z

sklearn_numba_dpex/kmeans/engine.py

+    def accepts(self, X, y, sample_weight):
+
+        if (algorithm := self.estimator.algorithm) not in ("lloyd", "auto", "full"):
+            if self._is_in_testing_mode:


As discussed in the bi-weekly plugin meeting, we can remove this condition and make the accepts method behave similarly in tests than in regular execution environments.

ogrisel

LGTM. Feel free to merge as is and do a follow-up PR to adapt the behavior of the test fixture w.r.t. accepting in test mode.

ogrisel · 2023-02-09T13:08:20Z

Since CI was green and @jjerphan already gave his +1, I just merged.

fcharras · 2023-02-09T13:53:52Z

TY for the merge, the follow-up will be in #89 (not ready to review yet)

fcharras force-pushed the adapt_to_wip_engines_latest branch 2 times, most recently from 2a64d0a to 07953b1 Compare December 13, 2022 20:49

fcharras mentioned this pull request Dec 13, 2022

[DRAFT] Engine plugin API and engine entry point for Lloyd's KMeans scikit-learn/scikit-learn#24497

Closed

fcharras requested review from jjerphan and ogrisel December 13, 2022 21:00

fcharras force-pushed the adapt_to_wip_engines_latest branch 2 times, most recently from 7c71880 to 6961b5f Compare December 13, 2022 22:31

Adapt to latest commits of wip-engines

6d87e2d

fcharras force-pushed the adapt_to_wip_engines_latest branch from 6961b5f to 6d87e2d Compare December 14, 2022 02:52

bump to last fix in wip-engines

e886c29

fcharras force-pushed the adapt_to_wip_engines_latest branch from c666409 to e886c29 Compare December 14, 2022 10:49

jjerphan previously approved these changes Dec 15, 2022

View reviewed changes

ogrisel reviewed Dec 15, 2022

View reviewed changes

fcharras closed this Jan 16, 2023

fcharras mentioned this pull request Jan 23, 2023

FEA Add EngineAwareMixin to factor the common logic of estimators which are plugin-extendable ogrisel/scikit-learn#13

Merged

fcharras reopened this Jan 27, 2023

fcharras added 3 commits January 27, 2023 10:36

Merge branch 'main' of https://github.com/soda-inria/sklearn-numba-dpex…

9de8c01

… into adapt_to_wip_engines_latest

Remove conversion from and bump to new commit

0495da5

Remove conversion from and bump to new commit ~fixup

3ac07f2

fcharras marked this pull request as draft February 6, 2023 17:29

fcharras added 2 commits February 8, 2023 15:34

bump scikit-learn/feature/wip-engine latest commit

5671c9e

fixup

cc131ea

fcharras marked this pull request as ready for review February 8, 2023 14:40

fcharras requested review from jjerphan and ogrisel February 8, 2023 14:48

jjerphan approved these changes Feb 8, 2023

View reviewed changes

fcharras mentioned this pull request Feb 8, 2023

FEAT: implement convert to sklearn types #89

Merged

1 task

fcharras changed the title ~~ENH Adapt to latest commits of wip-engines~~ ENH Adapt to latest commits of the feature/engines branch Feb 9, 2023

ogrisel reviewed Feb 9, 2023

View reviewed changes

ogrisel approved these changes Feb 9, 2023

View reviewed changes

ogrisel merged commit ba93efc into main Feb 9, 2023

ogrisel deleted the adapt_to_wip_engines_latest branch February 9, 2023 13:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH Adapt to latest commits of the `feature/engines` branch #74

ENH Adapt to latest commits of the `feature/engines` branch #74

fcharras commented Dec 13, 2022

jjerphan left a comment

jjerphan Dec 15, 2022

jjerphan Dec 15, 2022 •

edited

jjerphan Dec 15, 2022

ogrisel left a comment •

edited

ogrisel Dec 15, 2022

ogrisel Dec 15, 2022

fcharras Dec 15, 2022

ogrisel commented Dec 15, 2022 •

edited

fcharras commented Dec 15, 2022

fcharras commented Dec 15, 2022

ogrisel commented Dec 15, 2022 •

edited

ogrisel commented Dec 15, 2022 •

edited

ogrisel commented Dec 15, 2022

ogrisel commented Dec 15, 2022

ogrisel commented Dec 15, 2022 •

edited

ogrisel commented Dec 15, 2022 •

edited

fcharras commented Dec 16, 2022 •

edited

fcharras commented Dec 16, 2022 •

edited

betatim commented Jan 11, 2023

fcharras commented Jan 12, 2023

fcharras commented Jan 16, 2023

ogrisel commented Feb 1, 2023 •

edited

ogrisel commented Feb 1, 2023

fcharras commented Feb 8, 2023

jjerphan left a comment •

edited

ogrisel Feb 9, 2023

ogrisel left a comment

ogrisel commented Feb 9, 2023

fcharras commented Feb 9, 2023

ENH Adapt to latest commits of the feature/engines branch #74

ENH Adapt to latest commits of the feature/engines branch #74

Conversation

fcharras commented Dec 13, 2022

jjerphan left a comment

Choose a reason for hiding this comment

jjerphan Dec 15, 2022

Choose a reason for hiding this comment

jjerphan Dec 15, 2022 • edited

Choose a reason for hiding this comment

jjerphan Dec 15, 2022

Choose a reason for hiding this comment

ogrisel left a comment • edited

Choose a reason for hiding this comment

ogrisel Dec 15, 2022

Choose a reason for hiding this comment

ogrisel Dec 15, 2022

Choose a reason for hiding this comment

fcharras Dec 15, 2022

Choose a reason for hiding this comment

ogrisel commented Dec 15, 2022 • edited

fcharras commented Dec 15, 2022

fcharras commented Dec 15, 2022

ogrisel commented Dec 15, 2022 • edited

ogrisel commented Dec 15, 2022 • edited

ogrisel commented Dec 15, 2022

ogrisel commented Dec 15, 2022

ogrisel commented Dec 15, 2022 • edited

ogrisel commented Dec 15, 2022 • edited

fcharras commented Dec 16, 2022 • edited

fcharras commented Dec 16, 2022 • edited

betatim commented Jan 11, 2023

fcharras commented Jan 12, 2023

fcharras commented Jan 16, 2023

ogrisel commented Feb 1, 2023 • edited

ogrisel commented Feb 1, 2023

fcharras commented Feb 8, 2023

jjerphan left a comment • edited

Choose a reason for hiding this comment

ogrisel Feb 9, 2023

Choose a reason for hiding this comment

ogrisel left a comment

Choose a reason for hiding this comment

ogrisel commented Feb 9, 2023

fcharras commented Feb 9, 2023

ENH Adapt to latest commits of the `feature/engines` branch #74

ENH Adapt to latest commits of the `feature/engines` branch #74

jjerphan Dec 15, 2022 •

edited

ogrisel left a comment •

edited

ogrisel commented Dec 15, 2022 •

edited

ogrisel commented Dec 15, 2022 •

edited

ogrisel commented Dec 15, 2022 •

edited

ogrisel commented Dec 15, 2022 •

edited

ogrisel commented Dec 15, 2022 •

edited

fcharras commented Dec 16, 2022 •

edited

fcharras commented Dec 16, 2022 •

edited

ogrisel commented Feb 1, 2023 •

edited

jjerphan left a comment •

edited