Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test and doc for n_features_in_ for sklearn.calibration #19555

Merged
merged 15 commits into from Mar 30, 2021

Conversation

ogrisel
Copy link
Member

@ogrisel ogrisel commented Feb 25, 2021

Towards #19333.
Fixes: #8710.

The work was already done. I just had to enable the tests and update the docstring.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CalibrationClassifier.predict_proba calls check_array but then delegates the responsibly for checking n_features_in_ to the calibrated classifier, which triggers the error message. This behavior makes sense for metaestimators that can pass in all the features to its inner estimator.

sklearn/calibration.py Outdated Show resolved Hide resolved
@thomasjpfan
Copy link
Member

Should we add versionadded for n_features_in_ everywhere?

@ogrisel
Copy link
Member Author

ogrisel commented Feb 25, 2021

Should we add versionadded for n_features_in_ everywhere?

I think so. Is this a retrospective 0.24?

@ogrisel
Copy link
Member Author

ogrisel commented Feb 25, 2021

CalibrationClassifier.predict_proba calls check_array but then delegates the responsibly for checking n_features_in_ to the calibrated classifier, which triggers the error message. This behavior makes sense for metaestimators that can pass in all the features to its inner estimator.

The code is simpler this way but I wonder if it's right. If the nested base estimator does not have n_features_in_ it we could record it at fit time (even when cv="prefit") at the meta-estimator level and check that in predict_proba. The code would be more complex and I am not sure about the added value.

@ogrisel
Copy link
Member Author

ogrisel commented Feb 25, 2021

Actually I found a problem with non-array inputs (e.g. list of str): those are accepted at fit time but rejected at prediction time because the test was incomplete. This problem was previously reported in #8710 (comment).

I will push a small refactoring fixing this simultaneously.

sklearn/base.py Outdated Show resolved Hide resolved
@ogrisel
Copy link
Member Author

ogrisel commented Feb 25, 2021

On top of fixing the extended version of test_calibration_pipeline, the refactoring also allows for the behavior in the new test test_calibration_inconsistent_prefit_n_features_in which I think is more user friendly.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of my concerns has to do with calling _validate_data in the meta-estimator. This PR essentially, turns off most of the validation in check_array when calling _validate_data such that n_features_in_ can be set by the meta-estimator. I am concerned with copying when X is a list of strings, which may be an edge case.

I have a proposal on having the meta-estimator call _check_n_features directly with an array-like in one of the comments.

sklearn/base.py Outdated Show resolved Hide resolved
sklearn/calibration.py Outdated Show resolved Hide resolved
sklearn/calibration.py Outdated Show resolved Hide resolved
@ogrisel
Copy link
Member Author

ogrisel commented Mar 9, 2021

Once #19633 is merged, I will update this PR accordingly.

@ogrisel ogrisel removed the Bug label Mar 17, 2021
X, y = text_data
clf = text_data_pipeline
X, y = dict_data
clf = dict_data_pipeline
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thomasjpfan note that while working on this, I discovered that _num_features(X) == 2 while DictVectorizer will probably never set n_features_in_ because it is flexible enough to accept a variable number of dict entries.

This is not a problem for this PR because I made CalibratedClassifierCV check _num_features(X) only if base_estimator define the n_features_in_ attribute but I wonder if it does not reveal that our _num_features utility is not trying to be too smart.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you suggest to be the alternative?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened #19740 to have _num_features error for a collection of dicts.

X, y = self._validate_data(
X, y, accept_sparse=['csc', 'csr', 'coo'],
force_all_finite=False, allow_nd=True
)
Copy link
Member Author

@ogrisel ogrisel Mar 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency I prefer, to never validate the data in the meta estimator and use _safe_indexing in _fit_classifier_calibrator_pair instead. I am not sure if this is right or not.

@ogrisel
Copy link
Member Author

ogrisel commented Mar 17, 2021

@adrinjalali @lorentzenchr @thomasjpfan I think this is ready for a second review pass.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall. You can ignore most comments if you don't agree with them :)

@@ -257,7 +265,19 @@ def fit(self, X, y, sample_weight=None):
else:
check_is_fitted(self.base_estimator)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why this if/else is needed, it also checks the actual type, which is not what some of us like to do (looking at you @ogrisel :P )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good remark, let me check.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to pass the classes_ attribute explicitly otherwise it fails but I find the code cleaner this way.

sklearn/calibration.py Outdated Show resolved Hide resolved
sklearn/calibration.py Outdated Show resolved Hide resolved
X, y = text_data
clf = text_data_pipeline
X, y = dict_data
clf = dict_data_pipeline
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you suggest to be the alternative?

sklearn/tests/test_calibration.py Outdated Show resolved Hide resolved
sklearn/tests/test_calibration.py Outdated Show resolved Hide resolved
@@ -959,10 +959,13 @@ def check_dtype_object(name, estimator_orig):


def check_complex_data(name, estimator_orig):
rng = np.random.RandomState(42)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this change related to this PR? (I'm happy to keep it here anyway)

Copy link
Member Author

@ogrisel ogrisel Mar 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes otherwise it would fail in this PR because the validation is now delegated to the underlying estimator, but after the CV split of CalibratedClassifierCV: this test was failing because the default StratifiedKFold CV split was failing because of the unique values in y that prevented to have balanced "classes" in the validation folds.

Number of features seen during :term:`fit`. Only defined if the
underlying base_estimator exposes such an attribute when fit.

.. versionadded:: 0.24
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.. versionadded:: 0.24
.. versionadded:: 1.0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n_features_in_ was set in 0.24, (but we did not document it)

Copy link
Member Author

@ogrisel ogrisel Mar 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it was there, I think it's more accurate to document that it was introduced in 0.24, even if not documented.

This can be considered a fix for a bad documentation if you wish.

sklearn/calibration.py Outdated Show resolved Hide resolved
Number of features seen during :term:`fit`. Only defined if the
underlying base_estimator exposes such an attribute when fit.

.. versionadded:: 0.24
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n_features_in_ was set in 0.24, (but we did not document it)

X, y = text_data
clf = text_data_pipeline
X, y = dict_data
clf = dict_data_pipeline
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened #19740 to have _num_features error for a collection of dicts.

@ogrisel
Copy link
Member Author

ogrisel commented Mar 29, 2021

@thomasjpfan I addressed or answered the remaining remarks let me know what you think.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment, otherwise LGTM

sklearn/calibration.py Outdated Show resolved Hide resolved
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
@ogrisel ogrisel merged commit 54ff7b7 into scikit-learn:main Mar 30, 2021
@ogrisel ogrisel deleted the n_features_in-calibration branch March 30, 2021 16:20
@glemaitre glemaitre mentioned this pull request Apr 22, 2021
12 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CalibratedClassifierCV doesn't interact properly with Pipeline estimators
4 participants