New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test and doc for n_features_in_ for sklearn.calibration #19555
Test and doc for n_features_in_ for sklearn.calibration #19555
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CalibrationClassifier.predict_proba
calls check_array
but then delegates the responsibly for checking n_features_in_
to the calibrated classifier, which triggers the error message. This behavior makes sense for metaestimators that can pass in all the features to its inner estimator.
Should we add |
I think so. Is this a retrospective 0.24? |
The code is simpler this way but I wonder if it's right. If the nested base estimator does not have |
Actually I found a problem with non-array inputs (e.g. list of str): those are accepted at fit time but rejected at prediction time because the test was incomplete. This problem was previously reported in #8710 (comment). I will push a small refactoring fixing this simultaneously. |
On top of fixing the extended version of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of my concerns has to do with calling _validate_data
in the meta-estimator. This PR essentially, turns off most of the validation in check_array
when calling _validate_data
such that n_features_in_
can be set by the meta-estimator. I am concerned with copying when X
is a list of strings, which may be an edge case.
I have a proposal on having the meta-estimator call _check_n_features
directly with an array-like
in one of the comments.
Once #19633 is merged, I will update this PR accordingly. |
57742b6
to
24d8e90
Compare
X, y = text_data | ||
clf = text_data_pipeline | ||
X, y = dict_data | ||
clf = dict_data_pipeline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thomasjpfan note that while working on this, I discovered that _num_features(X) == 2
while DictVectorizer
will probably never set n_features_in_
because it is flexible enough to accept a variable number of dict entries.
This is not a problem for this PR because I made CalibratedClassifierCV
check _num_features(X)
only if base_estimator
define the n_features_in_
attribute but I wonder if it does not reveal that our _num_features
utility is not trying to be too smart.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you suggest to be the alternative?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened #19740 to have _num_features
error for a collection of dicts.
X, y = self._validate_data( | ||
X, y, accept_sparse=['csc', 'csr', 'coo'], | ||
force_all_finite=False, allow_nd=True | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency I prefer, to never validate the data in the meta estimator and use _safe_indexing
in _fit_classifier_calibrator_pair
instead. I am not sure if this is right or not.
@adrinjalali @lorentzenchr @thomasjpfan I think this is ready for a second review pass. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall. You can ignore most comments if you don't agree with them :)
sklearn/calibration.py
Outdated
@@ -257,7 +265,19 @@ def fit(self, X, y, sample_weight=None): | |||
else: | |||
check_is_fitted(self.base_estimator) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why this if/else
is needed, it also checks the actual type, which is not what some of us like to do (looking at you @ogrisel :P )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good remark, let me check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to pass the classes_
attribute explicitly otherwise it fails but I find the code cleaner this way.
X, y = text_data | ||
clf = text_data_pipeline | ||
X, y = dict_data | ||
clf = dict_data_pipeline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you suggest to be the alternative?
@@ -959,10 +959,13 @@ def check_dtype_object(name, estimator_orig): | |||
|
|||
|
|||
def check_complex_data(name, estimator_orig): | |||
rng = np.random.RandomState(42) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this change related to this PR? (I'm happy to keep it here anyway)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes otherwise it would fail in this PR because the validation is now delegated to the underlying estimator, but after the CV split of CalibratedClassifierCV
: this test was failing because the default StratifiedKFold CV split was failing because of the unique values in y that prevented to have balanced "classes" in the validation folds.
Number of features seen during :term:`fit`. Only defined if the | ||
underlying base_estimator exposes such an attribute when fit. | ||
|
||
.. versionadded:: 0.24 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.. versionadded:: 0.24 | |
.. versionadded:: 1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n_features_in_
was set in 0.24
, (but we did not document it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it was there, I think it's more accurate to document that it was introduced in 0.24, even if not documented.
This can be considered a fix for a bad documentation if you wish.
Number of features seen during :term:`fit`. Only defined if the | ||
underlying base_estimator exposes such an attribute when fit. | ||
|
||
.. versionadded:: 0.24 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n_features_in_
was set in 0.24
, (but we did not document it)
X, y = text_data | ||
clf = text_data_pipeline | ||
X, y = dict_data | ||
clf = dict_data_pipeline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened #19740 to have _num_features
error for a collection of dicts.
@thomasjpfan I addressed or answered the remaining remarks let me know what you think. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comment, otherwise LGTM
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Towards #19333.
Fixes: #8710.
The work was already done. I just had to enable the tests and update the docstring.