Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MNT Introduction of n_features_in_ attr with _validate_data mtd #16112

Merged
merged 78 commits into from Feb 29, 2020

Conversation

@NicolasHug
Copy link
Member

NicolasHug commented Jan 13, 2020

Implements SLEP010

Superseds #13603

The _validate_data method is only called in fit and partial_fit. I will open other PRs later for predict, transform, etc.

Please note that while the SLEP was under review, #15557 was merged which allows the Gaussian Processes to support sequences of variable length. That use-case isn't covered by the SLEP. For now, the n_features_in_ doesn't exist if a GP is passed a non-2d array.

NicolasHug added 30 commits Apr 9, 2019
…ranch 'master' of github.com:scikit-learn/scikit-learn into n_features_in
doc/whats_new/v0.23.rst Outdated Show resolved Hide resolved
@glemaitre

This comment has been minimized.

Copy link
Contributor

glemaitre commented Feb 11, 2020

You need to update this section: https://93096-843222-gh.circle-artifacts.com/0/doc/developers/develop.html#rolling-your-own-estimator

because we are calling check_array instead of the new _validate_data

@NicolasHug

This comment has been minimized.

Copy link
Member Author

NicolasHug commented Feb 11, 2020

You need to update this section...

The utility is currently private and we haven't decided yet whether it should be public (it probably will)

@glemaitre

This comment has been minimized.

Copy link
Contributor

glemaitre commented Feb 11, 2020

The utility is currently private and we haven't decided yet whether it should be public (it probably will)

OK so not a blocker for this PR.

On a side note, I find weird that we will soon make fail some third-party estimators without providing _validate_data as a developer tool.

Copy link
Contributor

glemaitre left a comment

I'm putting my approval for 99% of this PR, where only the blocker regarding y_required is the blocker.

@glemaitre

This comment has been minimized.

Copy link
Contributor

glemaitre commented Feb 11, 2020

I am also thinking that it could be great to open a follow-up issue to address the remaining issue:

  • validation on transform and predict (and friends) methods;
  • include sample_weight;
  • maybe modify check_X_y when y is required or not.
@NicolasHug

This comment has been minimized.

Copy link
Member Author

NicolasHug commented Feb 11, 2020

Thanks for the reviews !

On a side note, I find weird that we will soon make fail some third-party estimators without providing _validate_data as a developer tool

Just to clarify, the common check only raises a warning for now. Also, our plan is to decide whether we make _validate_data public before 0.23, so that should be a no-issue. I'll make sure to open the discussion about this once the PR addressing _validate_data in predict/transformed is merged (if it hasn't been decided before).

@glemaitre

This comment has been minimized.

Copy link
Contributor

glemaitre commented Feb 12, 2020

It looks good. We could merge I think @NicolasHug

@glemaitre

This comment has been minimized.

Copy link
Contributor

glemaitre commented Feb 12, 2020

Ups wrong PR sorry (too many tabs opened :))

Copy link
Member

ogrisel left a comment

LGTM. Thanks!

@ogrisel

This comment has been minimized.

Copy link
Member

ogrisel commented Feb 12, 2020

Maybe let's wait for @jnothman's final review before merging.

@NicolasHug

This comment has been minimized.

Copy link
Member Author

NicolasHug commented Feb 12, 2020

@NicolasHug

This comment has been minimized.

Copy link
Member Author

NicolasHug commented Feb 28, 2020

During the meeting we decided to merge this PR and follow up with the introduction of a 'is_supervised' tag, and to raise a proper error message in validate_data() when y is None and the tag is True.

I gave it a try but I don't think it will completely work: some estimators like ElasticNetCV will validate y and X separetely, e.g. to fail early if y isn't of the right shape. More concretely, they will first call y = check_array(y, ...), and then X = _validate_data(X, ...), but they don't pass y to _validate_data(). So now, with the addition of the tag and the addition of the check, _validate_data() fails with "This estimator is supervised but y is None". That might happen for third-party estimators as well.

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Feb 28, 2020

Do you want to fix conflicts so we can merge?
I don't like the bad error message but I think we decided in the meeting that we can merge now and fix that before the release. I think we should certainly fix it before the release, but the amount of merge conflicts suggests to me that it might be good to merge this now.

@NicolasHug

This comment has been minimized.

Copy link
Member Author

NicolasHug commented Feb 28, 2020

As much as I want this to be merged, we don't have a viable fix at the moment. It seems like the tag won't work out (see message above). Do you think we should still merge?

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Feb 28, 2020

I feel pretty confident we can find a solution, though.

So either we say relying on a tag is bad because it doesn't allow us to be flexible, and we pass it explicitly every time, or we refactor elasticnet to pass y.

There's a bunch of possible solutions, I think:

  • add a require_y argument everywhere
  • refactor ElasticNet and use the tag
  • use the tag but allow overwriting by an explicit require_y argument
  • always return y and raise an error outside of _validate_data
@NicolasHug

This comment has been minimized.

Copy link
Member Author

NicolasHug commented Feb 28, 2020

OK, i'll fix the conflicts and merge when green

thanks everyone for the reviews on this

@NicolasHug

This comment has been minimized.

Copy link
Member Author

NicolasHug commented Feb 29, 2020

It's green! merging. Thanks again for the reviews

@NicolasHug NicolasHug changed the title [MRG] MNT n_features_in_ attribute with _validate_data method MNT Introduction of n_features_in_ attr with _validate_data mtd Feb 29, 2020
@NicolasHug NicolasHug merged commit d205638 into scikit-learn:master Feb 29, 2020
21 checks passed
21 checks passed
LGTM analysis: C/C++ No code changes detected
Details
LGTM analysis: JavaScript No code changes detected
Details
LGTM analysis: Python 35 new alerts
Details
ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: doc Your tests passed on CircleCI!
Details
ci/circleci: doc artifact Link to 0/doc/_changed.html
Details
ci/circleci: doc-min-dependencies Your tests passed on CircleCI!
Details
ci/circleci: lint Your tests passed on CircleCI!
Details
codecov/patch 98.67% of diff hit (target 97.76%)
Details
codecov/project 97.9% (+0.13%) compared to a101d2d
Details
scikit-learn.scikit-learn Build #20200229.1 succeeded
Details
scikit-learn.scikit-learn (Linting) Linting succeeded
Details
scikit-learn.scikit-learn (Linux py36_conda_openblas) Linux py36_conda_openblas succeeded
Details
scikit-learn.scikit-learn (Linux py36_ubuntu_atlas) Linux py36_ubuntu_atlas succeeded
Details
scikit-learn.scikit-learn (Linux pylatest_pip_openblas_pandas) Linux pylatest_pip_openblas_pandas succeeded
Details
scikit-learn.scikit-learn (Linux32 py36_ubuntu_atlas_32bit) Linux32 py36_ubuntu_atlas_32bit succeeded
Details
scikit-learn.scikit-learn (Linux_Runs pylatest_conda_mkl) Linux_Runs pylatest_conda_mkl succeeded
Details
scikit-learn.scikit-learn (Windows py36_pip_openblas_32bit) Windows py36_pip_openblas_32bit succeeded
Details
scikit-learn.scikit-learn (Windows py37_conda_mkl) Windows py37_conda_mkl succeeded
Details
scikit-learn.scikit-learn (macOS pylatest_conda_mkl) macOS pylatest_conda_mkl succeeded
Details
scikit-learn.scikit-learn (macOS pylatest_conda_mkl_no_openmp) macOS pylatest_conda_mkl_no_openmp succeeded
Details
@NicolasHug NicolasHug deleted the NicolasHug:n_features_in branch Feb 29, 2020
panpiort8 pushed a commit to panpiort8/scikit-learn that referenced this pull request Mar 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

6 participants
You can’t perform that action at this time.