Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Adapt to latest commits of the feature/engines branch #74

Merged
merged 7 commits into from Feb 9, 2023

Conversation

fcharras
Copy link
Collaborator

No description provided.

jjerphan
jjerphan previously approved these changes Dec 15, 2022
Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo a few changes.

@@ -156,6 +156,7 @@ def test_euclidean_distance(dtype):
estimator = KMeans(n_clusters=len(b))
estimator.cluster_centers_ = b
engine = KMeansEngine(estimator)
assert engine.accepts(a, y=None, sample_weight=None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add assertions when y or sample_weight are not None.

This applies on new assertions as well.

Comment on lines 349 to 354
# When sample_weight is None, the call to `_check_sample_weight` is delayed
# until now because, because the array of `ones` that is created is only
# necessary for engine methods that actually make use of `sample_weight` and
# call `_check_is_accepted_sample_weight`.
# Methods that don't use `sample_weight` still pass `sample_weight=None` to
# `accepts` but doesn't need to create the corresponding array.
Copy link
Member

@jjerphan jjerphan Dec 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# When sample_weight is None, the call to `_check_sample_weight` is delayed
# until now because, because the array of `ones` that is created is only
# necessary for engine methods that actually make use of `sample_weight` and
# call `_check_is_accepted_sample_weight`.
# Methods that don't use `sample_weight` still pass `sample_weight=None` to
# `accepts` but doesn't need to create the corresponding array.
# When sample_weight is None, the call to `_check_sample_weight` is
# delayed until now because the array of `ones` that is created is
# only necessary for engine methods that actually make use of
# `sample_weight` and call `_check_is_accepted_sample_weight`.
# Methods that don't use `sample_weight` still pass
# `sample_weight=None` to `accepts` but doesn't need to create
# the corresponding array.

Comment on lines 335 to 347
def _check_is_accepted_X(self, X):
if X is not self._X_accepted:
raise RuntimeError(
"The object that was passed to the engine to query its compatibility "
"is different from the object that was given in downstream methods."
)

def _check_is_accepted_sample_weight(self, sample_weight):
if sample_weight is not self._sample_weight_accepted:
raise RuntimeError(
"The object that was passed to the engine to query its compatibility "
"is different from the object that was given in downstream methods."
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we must test asserting those cases are met.

Copy link
Collaborator

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could simplify things a lot by:

  • not calling _validate_date in accepts but instead just performing shallow type checking in accepts, that is something like:
  accepted_types = [np.ndarray, dpnp.ndarray, dpt.usm_ndarray]
  return (
      isinstance(X, accepted_types)
      and y == None
      and isinstance(sample_weight, accepted_types)
  )
  • keep the calls to _validate_data as in the current main (we could probably simplify it a bit further but we can do that later).

This means the engine will refuse to activate when the users passes data as a list of lists of Python scalar numbers but we don't care. This will make the code much simpler by not having to store _X_validated or _X_accepted on the engine instance.

What do you think?

if sample_weight is not None:
self._sample_weight_validated = self._check_sample_weight(sample_weight)
return True
except Exception:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could be more specific here, no?

Suggested change
except Exception:
except NotSupportedByEngineError:

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also let's add a TODO comment to explain that this condition on _is_in_testing_mode should better be handled in a mixin class for the estimator that should be in charge of raising an exception when the fallback to the default engine is explicitly disabled by a dedicated entry in sklearn._config.set_config.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _validate_data could can also return ValueError or TypeError, or any other error that sklearn.BaseEstimator._validate_data or dpt.asarray would want to throw there. In those cases, we also want the engine to decline the compute. So a generic Exception is good here.

@ogrisel
Copy link
Collaborator

ogrisel commented Dec 15, 2022

Also it would be great to have sklearn_numba_dpex specific tests to check that passing dpnp or dpctl arrays / tensor works as expected without raising an exception and returns the same results as when passing a numpy array with the same data values inside.

@fcharras
Copy link
Collaborator Author

Also it would be great to have sklearn_numba_dpex specific tests to check that passing dpnp or dpctl arrays / tensor works as expected without raising an exception and returns the same results as when passing a numpy array with the same data values inside.

Already exists 😁

@fcharras
Copy link
Collaborator Author

I think we could simplify things a lot by:

* not calling `_validate_date` in accept but instead just performing shallow type checking in accepts, that is something like:

[...]

What do you think?

It's true that the code in this PR is slightly more complicated but I think that it's worth and that it's actually the laziest path rather than the opposite. Mimicking the behavior of scikit-learn regarding what type of inputs are accepted ensures that the tests from the test suite will be compatible. If restricting the types that are accepted, we might loose compatibility with tests that would be perfectly valid otherwise, but would happen to be designed with list of lists as inputs.

So with this strategy we don't have to worry about that.

About what is good for the user, do you suggest that a UX that restricts the accepted types is better or only simpler to maintain ? (personnally I would say that it's better because implicit casting it bad, but I think the best is to mimic sklearn UX as much as possible whatever its choices)

@ogrisel
Copy link
Collaborator

ogrisel commented Dec 15, 2022

If restricting the types that are accepted, we might loose compatibility with tests that would be perfectly valid otherwise, but would happen to be designed with list of lists as inputs.

I don't see any valid cases that are not covered by the simple solution I propose.

We honestly don't care about supporting list of lists in GPU k-means. list of lists is only useful for the occasional quick one-liner in an educational context. I am pretty sure all the scikit-learn tests for k-means use either numpy arrays or scipy sparse matrices.

If there are tests in scikit-learn for k-means that use list of lists, we can quickly update them to use numpy arrays.

@ogrisel
Copy link
Collaborator

ogrisel commented Dec 15, 2022

About what is good for the user, do you suggest that a UX that restricts the accepted types is better or only simpler to maintain ? (personnally I would say that it's better because implicit casting it bad, but I think the best is to mimic sklearn UX as much as possible whatever its choices)

I think it's better to be explicit about what kind of container types an engine accepts. It makes things easier to reason about, both for the users and for the maintainers.

@ogrisel
Copy link
Collaborator

ogrisel commented Dec 15, 2022

I spoke too quickly, we might also want to accept pandas dataframes...

@ogrisel
Copy link
Collaborator

ogrisel commented Dec 15, 2022

So maybe we can accept any container that has the __array__ method but that is not scipy.sparse.issparse() (because unfortunately, it exposes an __array__ attribute but we really don't want to call it on a large sparse matrix.

@ogrisel
Copy link
Collaborator

ogrisel commented Dec 15, 2022

Something that is not clear with the new accepts API is how it interacts with the engine instance lifecycle.

Do we really want to re-run the engine negotiation prior to calling predict? I would have thought that we would store the negotiated engine provider name on the instance computed by calling accepts on all the active provider names at fit time and then store that engine provider name as a private attribute of the estimator to reuse that provider name at prediction time time without calling a chain of accepts again.

@betatim @fcharras do you agree with this lifecycle? Or do you see problems?

EDIT: what I wrote above is what is currently implemented in the wip-engines branch in the KMeans._get_engine method:

So, as of now, the accepts method is never called at prediction time. I don't understand why the scikit-learn tests pass. I would need to update my dev environment on my intel laptop to inspect what's going on.

EDIT: I misread the code: accepts is called again, even if the _engine_provider attribute is set. I did not expect this.

Hum. I find it a bit weird to re-negotiate engines at prediction time. I did not expect that.

@ogrisel
Copy link
Collaborator

ogrisel commented Dec 15, 2022

Thinking about this I have the feeling that we should change the way _get_engine works.

Instead I think _get_engine be in charge of calling engine._validate_data and return both the engine instance and the validated data for the first engine that accepts the data.

We can keep the accepts method to make it possible for engines to do an early rejection (e.g. based on the algorithm attribute prior to calling engine._validate_data.

WDYT @betatim, @jjerphan and @fcharras? I think this would make this PR much cleaner while still make it possible to use the outcome of _validate_data to drive the engine negotiation and without having to store _X_validated and checking for physical equality later (X is self_X_validated) which I find very ugly.

@fcharras
Copy link
Collaborator Author

fcharras commented Dec 16, 2022

I don't think it would be simpler. I've found input validation to be surprisingly difficult and verbose and I deleted several drafts before the current state. I'm particularly happy with reproducing scikit-learn choices when we can so we don't have to make committal choices ourselves (and maintain them). For the user, I can see benefits too, because ultimately what validates or not an input is the _asarray_with_order, which, in the current states of thing, relies on the asarray method of the underlying array library, so that the inputs that are accepted by the engine are the same than the inputs that are accepted by the underlying array library to create a new array, so we have a consistent behavior. For this reason also, it makes sense to fuse accepts with the conversion of the output, very much like scikit-learn already does with numpy.asarray.

@fcharras
Copy link
Collaborator Author

fcharras commented Dec 16, 2022

I agree that renegotiation at prediction time doesn't fit well, it probably should be changed, unless falling backs to other backends at prediction time can make sense, but that would mean implicitly converting the fitted attributes, and less implicit is better.

@betatim
Copy link

betatim commented Jan 11, 2023

On the topic of "calling accept again at fit time". I agree that it is weird. The reason that _get_engine doesn't immediately return the engine corresponding to the value recorded during fit is that I want to detect the case where the user has changed the "environment" between calling fit and predict. For example one is in a config context but the other isn't. The idea of _get_engine was to re-run the "provider resolution" at predict time and then error when the result doesn't match what was recorded during fit. The assumption is that calling accept is cheap. It has to be cheap because it can get called a lot of times if you have a lot of plugins installed.

Re-reading the code now I had to read it twice to realise this is what happens. So I think changing it is a good idea. Not sure how to change it yet.

I think accept should be allowed to look at the type, shape and such of the input data as well as the configuration of the estimator to make its decision. However, this means its answer can change between fit and predict because in one case the caller passed a cupy array and in the other not. Or some other thing that leads to the input to fit not being something the engine can/wants to handle. You could argue that wrong input type will anyway lead to an exception during predict so it is no big deal. That is true, but is it as easy as this for all the various reasons that the fit and predict time provider resolution is different.

Based on the assumption that accept is cheap to call and that the main reason for results changing is "user error", it feels that an easy to understand and reason about solution is to just re-run the resolution and raise an error when the results differ.

Is there a reason for accept to not be cheap to call?

@fcharras
Copy link
Collaborator Author

I don't think the idea of calling accept at predict on other engines is good here. Engines are not exclusive to a particular type of input, and the range of accepted input of different engines can have non empty intersection. At predict time, we want to re-use the engine that was used at fit-time, even if the type of input is different. Else, cases where engine selection priority is not the same depending on the input type could create weird, inconsistent behavior, where another engine is suggested, when the input type would still be accepted by the engine used during training.

On the other hand it's reliable if the input is only tested for acceptation by the engine that was used during fit, and if the input is bad, let's raise an error and suggests that the user should either convert the estimator for the appropriate engine, or pass another input type.

Aside from this, overall, I'm -1 for the accept API at this point but it's for another reason. I think I agree that decoupling acceptation and validation is good, but the current states of input validation in scikit-learn and with array libraries does not fit well with this approach, let me try to explain why.

Currently, in scikit-learn, ultimately what validates an input or not is if the input is accepted by the asarray method or not. And after trying to implement input validation I can understand why it is this way: it is hard to perform input validation on an object that is completely unknown (does it have a shape attribute ? else, does it implement len ? does it support slicing ? etc etc), and asarray embeds all those validation rules already. So, it's easier to just try to asarray any input object and see what happens. If it succeeds, the input is (almost) validated, the remaining checks are easy, and, since asarray has already output an array with the correct type, it might have triggered memory allocation and copies, and we don't want to waste that, so we keep it. That's why, even if decoupling acceptation and validation makes sense, in practice it's easier to have only one validation method that rely on asarray, because asarray itself does not come with a asarray_accept helper.

In this PR, the accept API is worked around in a way that the validated input is kept as attribute in accept so we can keep relying in dpt.asarray (NB: asarray is included in the Array API), it works but, it's kind of complicated.

@fcharras
Copy link
Collaborator Author

Closing the PR following last round of live discussions: https://hackmd.io/--MJTgQzSFSYaaAgcJWQZg?both#2023-01-12 , will open a new PR to adapt to new changes when it's ready

@ogrisel
Copy link
Collaborator

ogrisel commented Feb 1, 2023

We have a problem with the new default implementation of count_distinct_clusters from the default engine in the new version of wip-engines:

For instance when running the benchmark script with this PR:

Traceback (most recent call last):
  File "/home/ogrisel/code/sklearn-numba-dpex/benchmark/kmeans.py", line 274, in <module>
    kmeans_timer.timeit(
  File "/home/ogrisel/code/sklearn-numba-dpex/benchmark/kmeans.py", line 108, in timeit
    KMeans(**est_kwargs).set_params(max_iter=1).fit(
  File "/home/ogrisel/mambaforge/envs/sklearn-numba-dpex/lib/python3.9/site-packages/sklearn/_engine/base.py", line 154, in wrapper
    r = method(self, *args, **kwargs)
  File "/home/ogrisel/mambaforge/envs/sklearn-numba-dpex/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py", line 1650, in fit
    distinct_clusters = engine.count_distinct_clusters(best_labels)
  File "/home/ogrisel/mambaforge/envs/sklearn-numba-dpex/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py", line 414, in count_distinct_clusters
    return len(set(cluster_labels))
TypeError: unhashable type: 'dpctl.tensor._usmarray.usm_ndarray'

Observed with dpctl==0.14.0+17.gfd263733b.

@ogrisel
Copy link
Collaborator

ogrisel commented Feb 1, 2023

Actually this problem also appears in some failing tests. Furthermore, other tests need to be adapted to the new API.

@jjerphan jjerphan dismissed their stale review February 2, 2023 07:53

Dismissing approval due to recent failure due to the merge. I think they must be resolved first.

@fcharras fcharras marked this pull request as draft February 6, 2023 17:29
@fcharras fcharras marked this pull request as ready for review February 8, 2023 14:40
@fcharras
Copy link
Collaborator Author

fcharras commented Feb 8, 2023

This is the set of minimal changes so that the bump works and we keep the same level of testing.

The next TODO I'm working on is implementing and testing the new method convert_to_sklearn_types to enable auto-conversion. But maybe it can be done in a separate PR ?

I'm not sure that there are other changes to consider ? I'd like to get rid entirely of the environment variable SKLEARN_NUMBA_DPEX_TESTING_MODE on our side, but I think this requires an additional keyword on the sklearn config side ? I'll open the discussion on the feature/wip-engine side.

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

I think the title of the PR can now be changed to:

ENH Adapt to latest commits of the `feature/engines` branch

What do you think?

@fcharras fcharras changed the title ENH Adapt to latest commits of wip-engines ENH Adapt to latest commits of the feature/engines branch Feb 9, 2023
def accepts(self, X, y, sample_weight):

if (algorithm := self.estimator.algorithm) not in ("lloyd", "auto", "full"):
if self._is_in_testing_mode:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in the bi-weekly plugin meeting, we can remove this condition and make the accepts method behave similarly in tests than in regular execution environments.

Copy link
Collaborator

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Feel free to merge as is and do a follow-up PR to adapt the behavior of the test fixture w.r.t. accepting in test mode.

@ogrisel ogrisel merged commit ba93efc into main Feb 9, 2023
@ogrisel ogrisel deleted the adapt_to_wip_engines_latest branch February 9, 2023 13:08
@ogrisel
Copy link
Collaborator

ogrisel commented Feb 9, 2023

Since CI was green and @jjerphan already gave his +1, I just merged.

@fcharras
Copy link
Collaborator Author

fcharras commented Feb 9, 2023

TY for the merge, the follow-up will be in #89 (not ready to review yet)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants