New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Adapt to latest commits of the feature/engines
branch
#74
Conversation
2a64d0a
to
07953b1
Compare
7c71880
to
6961b5f
Compare
6961b5f
to
6d87e2d
Compare
c666409
to
e886c29
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM modulo a few changes.
@@ -156,6 +156,7 @@ def test_euclidean_distance(dtype): | |||
estimator = KMeans(n_clusters=len(b)) | |||
estimator.cluster_centers_ = b | |||
engine = KMeansEngine(estimator) | |||
assert engine.accepts(a, y=None, sample_weight=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add assertions when y
or sample_weight
are not None
.
This applies on new assertions as well.
sklearn_numba_dpex/kmeans/engine.py
Outdated
# When sample_weight is None, the call to `_check_sample_weight` is delayed | ||
# until now because, because the array of `ones` that is created is only | ||
# necessary for engine methods that actually make use of `sample_weight` and | ||
# call `_check_is_accepted_sample_weight`. | ||
# Methods that don't use `sample_weight` still pass `sample_weight=None` to | ||
# `accepts` but doesn't need to create the corresponding array. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# When sample_weight is None, the call to `_check_sample_weight` is delayed | |
# until now because, because the array of `ones` that is created is only | |
# necessary for engine methods that actually make use of `sample_weight` and | |
# call `_check_is_accepted_sample_weight`. | |
# Methods that don't use `sample_weight` still pass `sample_weight=None` to | |
# `accepts` but doesn't need to create the corresponding array. | |
# When sample_weight is None, the call to `_check_sample_weight` is | |
# delayed until now because the array of `ones` that is created is | |
# only necessary for engine methods that actually make use of | |
# `sample_weight` and call `_check_is_accepted_sample_weight`. | |
# Methods that don't use `sample_weight` still pass | |
# `sample_weight=None` to `accepts` but doesn't need to create | |
# the corresponding array. |
sklearn_numba_dpex/kmeans/engine.py
Outdated
def _check_is_accepted_X(self, X): | ||
if X is not self._X_accepted: | ||
raise RuntimeError( | ||
"The object that was passed to the engine to query its compatibility " | ||
"is different from the object that was given in downstream methods." | ||
) | ||
|
||
def _check_is_accepted_sample_weight(self, sample_weight): | ||
if sample_weight is not self._sample_weight_accepted: | ||
raise RuntimeError( | ||
"The object that was passed to the engine to query its compatibility " | ||
"is different from the object that was given in downstream methods." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we must test asserting those cases are met.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could simplify things a lot by:
- not calling
_validate_date
inaccepts
but instead just performing shallow type checking inaccepts
, that is something like:
accepted_types = [np.ndarray, dpnp.ndarray, dpt.usm_ndarray]
return (
isinstance(X, accepted_types)
and y == None
and isinstance(sample_weight, accepted_types)
)
- keep the calls to
_validate_data
as in the currentmain
(we could probably simplify it a bit further but we can do that later).
This means the engine will refuse to activate when the users passes data as a list of lists of Python scalar numbers but we don't care. This will make the code much simpler by not having to store _X_validated
or _X_accepted
on the engine instance.
What do you think?
sklearn_numba_dpex/kmeans/engine.py
Outdated
if sample_weight is not None: | ||
self._sample_weight_validated = self._check_sample_weight(sample_weight) | ||
return True | ||
except Exception: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could be more specific here, no?
except Exception: | |
except NotSupportedByEngineError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also let's add a TODO comment to explain that this condition on _is_in_testing_mode
should better be handled in a mixin class for the estimator that should be in charge of raising an exception when the fallback to the default engine is explicitly disabled by a dedicated entry in sklearn._config.set_config
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _validate_data
could can also return ValueError
or TypeError
, or any other error that sklearn.BaseEstimator._validate_data
or dpt.asarray
would want to throw there. In those cases, we also want the engine to decline the compute. So a generic Exception
is good here.
Also it would be great to have |
|
It's true that the code in this PR is slightly more complicated but I think that it's worth and that it's actually the laziest path rather than the opposite. Mimicking the behavior of scikit-learn regarding what type of inputs are accepted ensures that the tests from the test suite will be compatible. If restricting the types that are accepted, we might loose compatibility with tests that would be perfectly valid otherwise, but would happen to be designed with list of lists as inputs. So with this strategy we don't have to worry about that. About what is good for the user, do you suggest that a UX that restricts the accepted types is better or only simpler to maintain ? (personnally I would say that it's better because implicit casting it bad, but I think the best is to mimic sklearn UX as much as possible whatever its choices) |
I don't see any valid cases that are not covered by the simple solution I propose. We honestly don't care about supporting list of lists in GPU k-means. list of lists is only useful for the occasional quick one-liner in an educational context. I am pretty sure all the scikit-learn tests for k-means use either numpy arrays or scipy sparse matrices. If there are tests in scikit-learn for k-means that use list of lists, we can quickly update them to use numpy arrays. |
I think it's better to be explicit about what kind of container types an engine accepts. It makes things easier to reason about, both for the users and for the maintainers. |
I spoke too quickly, we might also want to accept pandas dataframes... |
So maybe we can accept any container that has the |
Something that is not clear with the new Do we really want to re-run the engine negotiation prior to calling @betatim @fcharras do you agree with this lifecycle? Or do you see problems? EDIT: what I wrote above is what is currently implemented in the wip-engines branch in the So, as of now, the EDIT: I misread the code: Hum. I find it a bit weird to re-negotiate engines at prediction time. I did not expect that. |
Thinking about this I have the feeling that we should change the way Instead I think We can keep the WDYT @betatim, @jjerphan and @fcharras? I think this would make this PR much cleaner while still make it possible to use the outcome of _validate_data to drive the engine negotiation and without having to store |
I don't think it would be simpler. I've found input validation to be surprisingly difficult and verbose and I deleted several drafts before the current state. I'm particularly happy with reproducing scikit-learn choices when we can so we don't have to make committal choices ourselves (and maintain them). For the user, I can see benefits too, because ultimately what validates or not an input is the |
I agree that renegotiation at prediction time doesn't fit well, it probably should be changed, unless falling backs to other backends at prediction time can make sense, but that would mean implicitly converting the fitted attributes, and less implicit is better. |
On the topic of "calling Re-reading the code now I had to read it twice to realise this is what happens. So I think changing it is a good idea. Not sure how to change it yet. I think Based on the assumption that Is there a reason for |
I don't think the idea of calling On the other hand it's reliable if the input is only tested for acceptation by the engine that was used during fit, and if the input is bad, let's raise an error and suggests that the user should either convert the estimator for the appropriate engine, or pass another input type. Aside from this, overall, I'm Currently, in In this PR, the |
Closing the PR following last round of live discussions: https://hackmd.io/--MJTgQzSFSYaaAgcJWQZg?both#2023-01-12 , will open a new PR to adapt to new changes when it's ready |
We have a problem with the new default implementation of For instance when running the benchmark script with this PR: Traceback (most recent call last):
File "/home/ogrisel/code/sklearn-numba-dpex/benchmark/kmeans.py", line 274, in <module>
kmeans_timer.timeit(
File "/home/ogrisel/code/sklearn-numba-dpex/benchmark/kmeans.py", line 108, in timeit
KMeans(**est_kwargs).set_params(max_iter=1).fit(
File "/home/ogrisel/mambaforge/envs/sklearn-numba-dpex/lib/python3.9/site-packages/sklearn/_engine/base.py", line 154, in wrapper
r = method(self, *args, **kwargs)
File "/home/ogrisel/mambaforge/envs/sklearn-numba-dpex/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py", line 1650, in fit
distinct_clusters = engine.count_distinct_clusters(best_labels)
File "/home/ogrisel/mambaforge/envs/sklearn-numba-dpex/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py", line 414, in count_distinct_clusters
return len(set(cluster_labels))
TypeError: unhashable type: 'dpctl.tensor._usmarray.usm_ndarray' Observed with |
Actually this problem also appears in some failing tests. Furthermore, other tests need to be adapted to the new API. |
Dismissing approval due to recent failure due to the merge. I think they must be resolved first.
This is the set of minimal changes so that the bump works and we keep the same level of testing. The next TODO I'm working on is implementing and testing the new method I'm not sure that there are other changes to consider ? I'd like to get rid entirely of the environment variable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
I think the title of the PR can now be changed to:
ENH Adapt to latest commits of the `feature/engines` branch
What do you think?
feature/engines
branch
def accepts(self, X, y, sample_weight): | ||
|
||
if (algorithm := self.estimator.algorithm) not in ("lloyd", "auto", "full"): | ||
if self._is_in_testing_mode: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed in the bi-weekly plugin meeting, we can remove this condition and make the accepts method behave similarly in tests than in regular execution environments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Feel free to merge as is and do a follow-up PR to adapt the behavior of the test fixture w.r.t. accepting in test mode.
Since CI was green and @jjerphan already gave his +1, I just merged. |
TY for the merge, the follow-up will be in #89 (not ready to review yet) |
No description provided.