Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MNT Add estimator check for not calling __array_function__ #14702

Merged
merged 31 commits into from Oct 24, 2019

Conversation

@amueller
Copy link
Member

amueller commented Aug 20, 2019

test for #14687.

I don't think we test against numpy 1.17 so this will only fail on cron not here?

@amueller

This comment has been minimized.

Copy link
Member Author

amueller commented Aug 20, 2019

this is not a complete fix and is also breaking some things.
I feel like we might want to allow may_share_memory? If someone implements that wrongly and doesn't provide True or False they can figure that out themselves ;)

@amueller

This comment has been minimized.

Copy link
Member Author

amueller commented Aug 20, 2019

hm this is actually nearly ok but I gotta run ;)

amueller added 2 commits Aug 21, 2019
@amueller

This comment has been minimized.

Copy link
Member Author

amueller commented Aug 21, 2019

now added a test for sample_weights as well. That wasn't tested with NotAnArray before (so didn't work) and I needed to add some logic to the slicing in GridSearchCV to make this work with NotAnArray fit_params

amueller added 4 commits Aug 21, 2019
@amueller

This comment has been minimized.

Copy link
Member Author

amueller commented Aug 21, 2019

Should be good now?

@amueller

This comment has been minimized.

Copy link
Member Author

amueller commented Aug 21, 2019

oh coverage fails because we don't test on new numpy I guess. Should we? I think waiting till it's on conda might be ok?


def __array__(self, dtype=None):
return self.data

def __array_function__(self, func, types, args, kwargs):

This comment has been minimized.

Copy link
@thomasjpfan

thomasjpfan Aug 21, 2019

Member

Returning True and raising TypeError needs to be tested?

This comment has been minimized.

Copy link
@amueller

amueller Aug 21, 2019

Author Member

you mean explicitly tested? Sure, i could do that. Thought that was a bit overkill for a test helper. They are obviously used in the tests, but on CI there's no __array_function__ protocol.

This comment has been minimized.

Copy link
@thomasjpfan

thomasjpfan Aug 21, 2019

Member

I would be more onboard if NotAnArray was private, which goes back to "public vs private utils" #6616 (comment)

On the other hand, we are depending on this raising when something is wrong. When everything is working, our tests do not run __array_function__.

This comment has been minimized.

Copy link
@amueller

amueller Aug 21, 2019

Author Member

That's a good point. I'll add a test (that won't be run)

@adrinjalali

This comment has been minimized.

Copy link
Member

adrinjalali commented Aug 22, 2019

oh coverage fails because we don't test on new numpy I guess. Should we? I think waiting till it's on conda might be ok?

I'm not sure how many people install numpy from pypi, but my guess is "many". I'd rather have that, or conda-forge as the "test against the latest" in the CI, rather than conda. WDYT?

@@ -118,6 +118,7 @@ def fit(self, X, y, sample_weight=None):
self.sparse_output_ = sp.issparse(y)

if not self.sparse_output_:
y = np.asarray(y)

This comment has been minimized.

Copy link
@adrinjalali

adrinjalali Aug 22, 2019

Member

do we not want to have a more complex atleast_1d instead?

This comment has been minimized.

Copy link
@amueller

amueller Aug 22, 2019

Author Member

you mean column_or_1d? I honestly don't know why we would need atleast_1d. We usually want scalars to be treated differently.

This comment has been minimized.

Copy link
@amueller

amueller Aug 22, 2019

Author Member

there's also check_array(ensure_2d=False) which might be more suitable here?

This comment has been minimized.

Copy link
@adrinjalali

adrinjalali Aug 22, 2019

Member

yep, I tend to forget that we have both column_or_1d and atleast_1d. Ideally I think I'd rather have ensure_ndims=n in check_array maybe.

@@ -433,6 +434,8 @@ def fit(self, X, y, sample_weight=None):
self.n_outputs_ = y.shape[1]

check_consistent_length(X, y, sample_weight)
if sample_weight is not None:

This comment has been minimized.

Copy link
@adrinjalali

adrinjalali Aug 22, 2019

Member

I feel like this belongs to check_array too, doesn't it?

This comment has been minimized.

Copy link
@amueller

amueller Aug 22, 2019

Author Member

not sure what you mean. Using check_array(sample_weights, ensure_2d=True) instead of asarray? We don't have any tests for NaN in sample weights, do we? I'm not sure how much I want to make this PR about adding way more checks to sample weights

This comment has been minimized.

Copy link
@glemaitre

glemaitre Sep 10, 2019

Contributor

we have _check_sample_weight in validation

This comment has been minimized.

Copy link
@amueller

amueller Sep 24, 2019

Author Member

that does a lot more though, right?

@amueller

This comment has been minimized.

Copy link
Member Author

amueller commented Aug 22, 2019

The test failure is due to me using pytest in test_estimator_checks :-/ That will be fixed in #14381 I think.

I'm ok with moving the "latest" CI to conda-forge

amueller added 2 commits Aug 26, 2019
# Conflicts:
#	sklearn/model_selection/tests/test_search.py
@amueller

This comment has been minimized.

Copy link
Member Author

amueller commented Aug 26, 2019

Never mind, #14381 doesn't fix that, I just won't use pytest in that file.

@amueller

This comment has been minimized.

Copy link
Member Author

amueller commented Sep 3, 2019

any takers? Should I include the change to CI in this PR?

@adrinjalali

This comment has been minimized.

Copy link
Member

adrinjalali commented Sep 3, 2019

I was gonna do the CI change, if you don't, I'll do in a separate PR.

@amueller amueller added this to the 0.22 milestone Sep 27, 2019
@amueller

This comment has been minimized.

Copy link
Member Author

amueller commented Sep 27, 2019

this is quite hard to keep in sync with master :-/

@amueller

This comment has been minimized.

Copy link
Member Author

amueller commented Sep 27, 2019

greeeeen

@glemaitre glemaitre added this to TO REVIEW in Guillaume's pet Oct 2, 2019
@ogrisel
ogrisel approved these changes Oct 7, 2019
Copy link
Member

ogrisel left a comment

LGTM.

We need a changelog entry (and maybe a section in the doc) to explain that sklearn estimators will not rely on __array_function__ by default and instead materialize fit parameters as numpy arrays internally.

@amueller

This comment has been minimized.

Copy link
Member Author

amueller commented Oct 7, 2019

@ogrisel like this?

Copy link
Member

ogrisel left a comment

Maybe mention NEP 18 explicitly to improve googl-ability.

doc/whats_new/v0.22.rst Outdated Show resolved Hide resolved
doc/whats_new/v0.22.rst Outdated Show resolved Hide resolved
@@ -623,6 +627,10 @@ These changes mostly affect library developers.
Such classifiers need to have the `binary_only=True` estimator tag.
:pr:`13875` by `Trevor Stephens`_.

- Estimators are expected to convert input data (``X``, ``y``,
``sample_weights``) to numpy ``ndarray`` and never call
``__array_function__`` on the original datatype that is passed.

This comment has been minimized.

Copy link
@ogrisel

ogrisel Oct 8, 2019

Member
Suggested change
``__array_function__`` on the original datatype that is passed.
``__array_function__`` on the original datatype that is passed. This means that,
by default, estimators are not expected to support the array function dispatching
mechanism of `NEP 18`_.

This comment has been minimized.

Copy link
@amueller

amueller Oct 8, 2019

Author Member

I don't think this wording is particularly clear. Estimators are forbidden from supporting the mechanism of NEP 18 with this PR.

This comment has been minimized.

Copy link
@ogrisel

ogrisel Oct 9, 2019

Member

I did not want to phrase it in a way that this will always be the case in the future. But phrase it as you wish as long as NEP 18 is mentioned :)

amueller and others added 3 commits Oct 8, 2019
Co-Authored-By: Olivier Grisel <olivier.grisel@ensta.org>
Co-Authored-By: Olivier Grisel <olivier.grisel@ensta.org>
Copy link
Member

jnothman left a comment

otherwise lgtm

@@ -623,6 +628,11 @@ These changes mostly affect library developers.
Such classifiers need to have the `binary_only=True` estimator tag.
:pr:`13875` by `Trevor Stephens`_.

- Estimators are expected to convert input data (``X``, ``y``,
``sample_weights``) to numpy ``ndarray`` and never call

This comment has been minimized.

Copy link
@jnothman

jnothman Oct 23, 2019

Member

why not use :class:`numpy.ndarray` to get the cross-reference?

doc/whats_new/v0.22.rst Show resolved Hide resolved
doc/whats_new/v0.22.rst Show resolved Hide resolved
@@ -601,6 +601,11 @@ Changelog
Miscellaneous
.............

- |API| Scikit-learn currently converts any input data structure implementing a duck array

This comment has been minimized.

Copy link
@jnothman

jnothman Oct 23, 2019

Member

Do you mean "now" instead of "currently"?

@NicolasHug

This comment has been minimized.

Copy link
Contributor

NicolasHug commented Oct 24, 2019

The only test is that estimators don't complain when you pass _NotAnArray, but it doesn't make sure __array_function__ isn't used, which is what we ultimately want right? At least that's what the what'snew says.

I think we should have _NotAnArray.__array_function__ raise an exception to properly cover that?

@amueller I addressed Joel's comments and merged with master. I'm happy to address future comments if you need


Regarding all the changes to sample_weight, I agree with @glemaitre that we could/should be using _check_sample_weights.
But I'm fine with doing that in another PR.

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Oct 24, 2019

@NicolasHug

This comment has been minimized.

Copy link
Contributor

NicolasHug commented Oct 24, 2019

Ah, indeed. I wasn't reading that well for some reason

@NicolasHug NicolasHug changed the title [MRG] estimator check for not calling __array_function__ MNT Add estimator check for not calling __array_function__ Oct 24, 2019
@NicolasHug NicolasHug merged commit a91bae7 into scikit-learn:master Oct 24, 2019
19 checks passed
19 checks passed
LGTM analysis: C/C++ No code changes detected
Details
LGTM analysis: JavaScript No code changes detected
Details
LGTM analysis: Python No new or fixed alerts
Details
ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: doc Your tests passed on CircleCI!
Details
ci/circleci: doc artifact Link to 0/doc/_changed.html
Details
ci/circleci: doc-min-dependencies Your tests passed on CircleCI!
Details
ci/circleci: lint Your tests passed on CircleCI!
Details
codecov/patch 97.95% of diff hit (target 97.21%)
Details
codecov/project 97.21% (+<.01%) compared to 7281083
Details
scikit-learn.scikit-learn Build #20191024.36 succeeded
Details
scikit-learn.scikit-learn (Linux py35_conda_openblas) Linux py35_conda_openblas succeeded
Details
scikit-learn.scikit-learn (Linux py35_ubuntu_atlas) Linux py35_ubuntu_atlas succeeded
Details
scikit-learn.scikit-learn (Linux pylatest_conda_mkl) Linux pylatest_conda_mkl succeeded
Details
scikit-learn.scikit-learn (Linux pylatest_pip_openblas_pandas) Linux pylatest_pip_openblas_pandas succeeded
Details
scikit-learn.scikit-learn (Linux32 py35_ubuntu_atlas_32bit) Linux32 py35_ubuntu_atlas_32bit succeeded
Details
scikit-learn.scikit-learn (Windows py35_pip_openblas_32bit) Windows py35_pip_openblas_32bit succeeded
Details
scikit-learn.scikit-learn (Windows py37_conda_mkl) Windows py37_conda_mkl succeeded
Details
scikit-learn.scikit-learn (macOS pylatest_conda_mkl) macOS pylatest_conda_mkl succeeded
Details
@glemaitre glemaitre moved this from TO REVIEW to REVIEWED AND WAITING FOR CHANGES in Guillaume's pet Oct 25, 2019
@glemaitre glemaitre moved this from REVIEWED AND WAITING FOR CHANGES to MERGED in Guillaume's pet Oct 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.