Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Support pipelines in CalibratedClassifierCV #17546

Merged
merged 27 commits into from Jun 30, 2020

Conversation

lucyleeow
Copy link
Member

Reference Issues/PRs

Towards #8710 (this is the first/main issue but other issues are mentioned in the comments)

What does this implement/fix? Explain your changes.

Don't _validate_data if prefit estimator used in CalibratedClassifierCV
Includes test

Any other comments?

@lucyleeow lucyleeow changed the title ENH Don't _validate_data if prefit estimator used in CalibratedClassifierCV ENH Support pipelines in CalibratedClassifierCV Jun 9, 2020
Copy link
Contributor

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of thoughts to investigate.

sklearn/tests/test_calibration.py Outdated Show resolved Hide resolved
sklearn/tests/test_calibration.py Outdated Show resolved Hide resolved
@@ -176,23 +176,10 @@ def fit(self, X, y, sample_weight=None):
self : object
Returns an instance of self.
"""
X, y = self._validate_data(X, y, accept_sparse=['csc', 'csr', 'coo'],
force_all_finite=False, allow_nd=True)
X, y = indexable(X, y)
le = LabelBinarizer().fit(y)
self.classes_ = le.classes_
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit curious about these lines here. In case, that the model has been prefitted, it would be best to use prefitted_model.classes_ instead? Meaning that we have this part only when the model is not prefitted.

sklearn/calibration.py Outdated Show resolved Hide resolved
sklearn/calibration.py Outdated Show resolved Hide resolved
@cmarmo
Copy link
Member

cmarmo commented Jun 15, 2020

Hi @glemaitre and @lucyleeow, there is a PR related to #8710 waiting for review for a year and still alive. It isn't clear to me how those two PRs interact, but may I suggest to have a look at #13060, before moving on with this one? Thanks!

@glemaitre
Copy link
Contributor

They will be dissociated (even if we are going to have some git conflicts). This one is about internal sklearn interoperability while #13060 is about supporting a new type of y (i.e. multilabel). I will try to find the time to review #13060 but I am not the most knowledgeable person with multilabel.

@glemaitre
Copy link
Contributor

So most probably the assert on n_features_in_ will not be valid with the DictVectorizer. I didn't recall the choice we made.

@lucyleeow
Copy link
Member Author

lucyleeow commented Jun 16, 2020

@glemaitre I'm am not sure about how I addressed this:

For instance, I think that calib_clf.n_features_in_ will not be defined, isn't it?

  • for a NOT prefit estimator, we have to _validate_data before we fit and append to the list self.calibrated_classifiers_. Thus we can't use _validate_data(reset=False). I've changed back to reset=True. (We also can't use self.calibrated_classifiers_[0].n_features_in_ as self.calibrated_classifiers_ can be an empty list.)
  • I used your suggested code to set the attribute for prefit estimator in the code if self.cv == "prefit":.

Not sure if this is a good approach so happy to change.

sklearn/calibration.py Outdated Show resolved Hide resolved
@lucyleeow
Copy link
Member Author

ping @glemaitre (the red is codecov)

sklearn/calibration.py Outdated Show resolved Hide resolved
Copy link
Contributor

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will need an entry in what's new as well

sklearn/calibration.py Outdated Show resolved Hide resolved
sklearn/calibration.py Outdated Show resolved Hide resolved
sklearn/tests/test_calibration.py Outdated Show resolved Hide resolved
sklearn/tests/test_calibration.py Outdated Show resolved Hide resolved
@lucyleeow
Copy link
Member Author

lucyleeow commented Jun 25, 2020

@glemaitre I amended it such that the attributes n_features_in_ and classes_ are derived from base_estimator[-1] when base_estimator is a pipeline.

I think we could/should add a test that checks for when cv is a CV splitter but maybe in another PR since it isn't related to this one?

@lucyleeow
Copy link
Member Author

I realise why the line:

            elif hasattr(self.cv, "n_folds"):
                n_folds = self.cv.n_folds

was red. I think it should but n_splits. I don't think cv iterators have n_folds attribute.

I have fixed it here but let me know if you want me to separate into it's own PR

@lucyleeow
Copy link
Member Author

I've also added a test for cv iterator and test for default base_estimator as codecov was red

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR @lucyleeow !

if isinstance(self.base_estimator, Pipeline):
estimator = self.base_estimator[-1]
else:
estimator = self.base_estimator
check_is_fitted(estimator)
self.n_features_in_ = estimator.n_features_in_
self.classes_ = estimator.classes_
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find the discussion about this.

It seems weird that CalibratedClassiferCV takes the n_features_in_ from the final step. If the first step of the Pipeline was a feature selector, then CalibratedClassiferCV would not have correct n_feature_in_?

Also, for third party estimators in a pipeline, if they do not have n_feature_in_ or classes_ this would fail. I would prefer being slightly more lenient:

with suppress(AttributeError):
    self.n_features_in_ = base_estimator.n_features_in_

with suppress(AttributeError):
	self.classes_ = base_estimator.classes_

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right for the n_features_in_ that it should be the value of the first step and not the last one.

Regarding classes_, I would be a bit more conservative. Our estimator_check explicitly checks for classes_ so this is a kind of contract that if you do a scikit-learn classifier, you should have this attribute.

And for n_features_in_, we can be lenient in this case because we still don't impose anything yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT @thomasjpfan ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right. I got mixed up.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding classes_, I would be a bit more conservative. Our estimator_check explicitly checks for classes_ so this is a kind of contract that if you do a scikit-learn classifier, you should have this attribute.

As in don't suppress warning when assigning this attribute?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also out of interest, which of:

if hasattr(calib_clf, "n_features_in_"):
    self.n_features_in_ = base_estimator.n_features_in_

and

with suppress(AttributeError):
    self.n_features_in_ = base_estimator.n_features_in_

would be preferred here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used the first so I don't need to import suppress but happy to change

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am actually not sure what is more explicit. I am not used to suppress but I find it elegant and it should be more pythonic. Go ahead with it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT @thomasjpfan ?

I agree with the being more strict with classes_ and be lenient with n_features_in_.

@lucyleeow
Copy link
Member Author

thanks @thomasjpfan and @glemaitre, I think I've made the suggested changes

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM Thank you @lucyleeow !

Copy link
Contributor

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@glemaitre glemaitre merged commit 6b29bc6 into scikit-learn:master Jun 30, 2020
7 checks passed
@glemaitre
Copy link
Contributor

Thanks @lucyleeow

@lucyleeow lucyleeow deleted the IS/8710 branch June 30, 2020 13:42
@lucyleeow
Copy link
Member Author

Thank you!

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Jul 17, 2020
jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this pull request Oct 22, 2020
@odedbd
Copy link

odedbd commented Jan 26, 2021

@lucyleeow I've tried using this (pipeline base estimator, cv='prefit') and now the .fit() call is working, but I ran into a similar issue when I tried to predict, since there is also a data validation step in predict_proba. So I can fit an estimator but I don't see how to use it. What am I missing here?

@glemaitre
Copy link
Contributor

@odedbd Please open a thread with your question on GitHub discussion: https://github.com/scikit-learn/scikit-learn/discussions

Commenting on a merged PR will not attract traffic to get an answer and your issue will be useful for the entire community since this is a usage question. If it leads to a missing feature, we can create an associated issue+PR.

@scikit-learn scikit-learn locked as resolved and limited conversation to collaborators Jan 26, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants