Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MNT Introduction of n_features_in_ attr with _validate_data mtd #16112

Merged
merged 78 commits into from
Feb 29, 2020

Conversation

NicolasHug
Copy link
Member

@NicolasHug NicolasHug commented Jan 13, 2020

Implements SLEP010

Superseds #13603

The _validate_data method is only called in fit and partial_fit. I will open other PRs later for predict, transform, etc.

Please note that while the SLEP was under review, #15557 was merged which allows the Gaussian Processes to support sequences of variable length. That use-case isn't covered by the SLEP. For now, the n_features_in_ doesn't exist if a GP is passed a non-2d array.

…ranch 'master' of github.com:scikit-learn/scikit-learn into n_features_in
@glemaitre
Copy link
Member

I am also thinking that it could be great to open a follow-up issue to address the remaining issue:

  • validation on transform and predict (and friends) methods;
  • include sample_weight;
  • maybe modify check_X_y when y is required or not.

@NicolasHug
Copy link
Member Author

Thanks for the reviews !

On a side note, I find weird that we will soon make fail some third-party estimators without providing _validate_data as a developer tool

Just to clarify, the common check only raises a warning for now. Also, our plan is to decide whether we make _validate_data public before 0.23, so that should be a no-issue. I'll make sure to open the discussion about this once the PR addressing _validate_data in predict/transformed is merged (if it hasn't been decided before).

@glemaitre
Copy link
Member

It looks good. We could merge I think @NicolasHug

@glemaitre
Copy link
Member

glemaitre commented Feb 12, 2020

Ups wrong PR sorry (too many tabs opened :))

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@ogrisel
Copy link
Member

ogrisel commented Feb 12, 2020

Maybe let's wait for @jnothman's final review before merging.

@NicolasHug
Copy link
Member Author

I think we need to decide on https://github.com/scikit-learn/scikit-learn/pull/16112/files#r378281885 before merging?

@NicolasHug
Copy link
Member Author

During the meeting we decided to merge this PR and follow up with the introduction of a 'is_supervised' tag, and to raise a proper error message in validate_data() when y is None and the tag is True.

I gave it a try but I don't think it will completely work: some estimators like ElasticNetCV will validate y and X separetely, e.g. to fail early if y isn't of the right shape. More concretely, they will first call y = check_array(y, ...), and then X = _validate_data(X, ...), but they don't pass y to _validate_data(). So now, with the addition of the tag and the addition of the check, _validate_data() fails with "This estimator is supervised but y is None". That might happen for third-party estimators as well.

@amueller
Copy link
Member

Do you want to fix conflicts so we can merge?
I don't like the bad error message but I think we decided in the meeting that we can merge now and fix that before the release. I think we should certainly fix it before the release, but the amount of merge conflicts suggests to me that it might be good to merge this now.

@NicolasHug
Copy link
Member Author

As much as I want this to be merged, we don't have a viable fix at the moment. It seems like the tag won't work out (see message above). Do you think we should still merge?

@amueller
Copy link
Member

I feel pretty confident we can find a solution, though.

So either we say relying on a tag is bad because it doesn't allow us to be flexible, and we pass it explicitly every time, or we refactor elasticnet to pass y.

There's a bunch of possible solutions, I think:

  • add a require_y argument everywhere
  • refactor ElasticNet and use the tag
  • use the tag but allow overwriting by an explicit require_y argument
  • always return y and raise an error outside of _validate_data

@NicolasHug
Copy link
Member Author

OK, i'll fix the conflicts and merge when green

thanks everyone for the reviews on this

@NicolasHug
Copy link
Member Author

It's green! merging. Thanks again for the reviews

@NicolasHug NicolasHug changed the title [MRG] MNT n_features_in_ attribute with _validate_data method MNT Introduction of n_features_in_ attr with _validate_data mtd Feb 29, 2020
@NicolasHug NicolasHug merged commit d205638 into scikit-learn:master Feb 29, 2020
@NicolasHug NicolasHug deleted the n_features_in branch February 29, 2020 14:05
panpiort8 pushed a commit to panpiort8/scikit-learn that referenced this pull request Mar 3, 2020
gio8tisu pushed a commit to gio8tisu/scikit-learn that referenced this pull request May 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants