-
-
Notifications
You must be signed in to change notification settings - Fork 26.1k
FIX min_value and max_value not indexed when features are removed #29451
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX min_value and max_value not indexed when features are removed #29451
Conversation
95f3150
to
2c01aaa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the PR!
You will need to add a changelog in doc/whats_new/v1.6.rst
.
I need to spend a bit of more time to make sure I understand the actual fix but I have already a few comments on the test.
sklearn/impute/tests/test_impute.py
Outdated
missing_column, check_column, min_value, max_value | ||
): | ||
"""Check that we properly apply the empty feature mask to | ||
`min_value` and `max_value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mention the original github maybe? https://github.com/scikit-learn/scikit-learn/issues/29355
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, added in 1816d0e.
sklearn/impute/tests/test_impute.py
Outdated
"""Check that we properly apply the empty feature mask to | ||
`min_value` and `max_value. | ||
""" | ||
X = np.array([[1, 2, -1, -1], [4, 5, 6, 6], [7, 8, -1, -1], [10, 11, 12, 12]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason you used -1
? If not I would use the default np.nan
(rather than -1) missing value I find this is slightly easier to parse visually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no specific reason. I was referring to the previous test case. I agreed with your suggestion and fixed it in 1816d0e.
sklearn/impute/tests/test_impute.py
Outdated
|
||
min_value_array = [-np.inf] * 4 | ||
max_value_array = [np.inf] * 4 | ||
min_value_array[check_column] = min_value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious this is not needed to trigger the original issue right? Is there a good reason you added this?
Probably related but what are your two different parametrized tests checking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. With keep_empty_features=False
and min_value
/max_value
are array-like, we can already reproduce the error as the original issue mentioned. I added the two parametrized tests to check if the dropped element of min_value
/max_value
is done correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK so you I guess what you are saying is that you fixed an additional bug compared to the one that was reported, thanks for this 🙏!
2c01aaa
to
46a7c7f
Compare
@lesteve Thanks for the review! I have addressed your comments and updated the changelog. |
@@ -756,8 +756,17 @@ def fit_transform(self, X, y=None, **params): | |||
self.n_iter_ = 0 | |||
return super()._concatenate_indicator(Xt, X_indicator) | |||
|
|||
self._min_value = self._validate_limit(self.min_value, "min", X.shape[1]) | |||
self._max_value = self._validate_limit(self.max_value, "max", X.shape[1]) | |||
self._min_value = self._validate_limit( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would keep the two original lines with X.shape[1]
and move them up before the X, X_t, mask_missing_values, complete_mask = ...
line, i.e. when X
is still the original input data.
I feel it would make the code easier to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This suggestion won't work as it is, since X
may be not be a numpy array at this stage (for example a list as in the test you added).
I need to spend more time looking at the code for a better suggestion ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @gunsodo,
I went through your PR and just wanted to leave some comments.
Thanks for your work!
sklearn/impute/tests/test_impute.py
Outdated
keep_empty_features=False, | ||
) | ||
|
||
X_no_missing = X[:, [i for i in range(X.shape[1]) if i != missing_column]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
X_no_missing = X[:, [i for i in range(X.shape[1]) if i != missing_column]] | |
X_no_missing = np.delete(X, missing_column, axis=1) |
I think this would be easier to read.
And maybe also renaming X_no_missing
into X_without_missing_column
, so we don't wrongly suggest no missing values at all in it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Renamed and improved in 11964ea.
sklearn/impute/tests/test_impute.py
Outdated
assert_allclose(np.min(X_imputed[np.isnan(X_no_missing)]), min_value) | ||
assert_allclose(np.max(X_imputed[np.isnan(X_no_missing)]), max_value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand this correctly, than we rely on the carefully designed data (X
) we input into this test to actually return entries that are min_value
and max_value
, which is alright for a test.
And if this is the case, then these two assert statements suggest some exactness that is not really tested for and I think we could have more transparency if we use a plain ==
instead of assert_allclose()
.
And I feel that we should add a comment above, saying that we expect min_value
and max_value
to be actually present in the imputed data, because we chose the data passed to the test accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your suggestion. I pushed the fix in 11964ea.
sklearn/impute/_iterative.py
Outdated
self._min_value = self._validate_limit( | ||
self.min_value, "min", complete_mask.shape[1] | ||
) | ||
self._max_value = self._validate_limit( | ||
self.max_value, "max", complete_mask.shape[1] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering if instead of checking the shape of the limit in lines 681-686 we could do the validation in the beginning of fit()
and fit_transform()
, like we usually do. Something like if len(self.min_value) != X.shape[1]: raise ValueError(...)
... It feels a bit cleaner to me.
This would mean that the code in li. 759-764 could stay as it was before.
But I think your take on dealing with this is also okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it should be
if isinstance(self.min_value, np.array) and len(self.min_value) != X.shape[1]: raise ValueError(...)
Or similar, since we need this only if the input is an array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comments! As @lesteve suggested above, it could be problematic when either X
, self.min_value
, or self.max_value
are not np.ndarray
. We might have to safely convert all of them to np.ndarray
s first then we can check the lengths. However, this seems redundant because we call _validate_data(X)
inside self._initial_imputation
and check_array(self.min_value)
inside self._validate_limit
already. Which approach do you think would work best?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think both is alright, but I agree that earlier validation is more readable.
I tried this locally:
-
revert changes in li. 759-764 (starting with
self._min_value = self._validate_limit(
) -
insert into li. 747 (right before
super()._fit_indicator(complete_mask)
):
if len(self.min_value) != complete_mask.shape[1]:
raise ValueError(
f"'min_value' should be of shape ({complete_mask.shape[1]},)"
f"when an array-like is provided. Got {len(self.min_value)}, instead."
)
if len(self.max_value) != complete_mask.shape[1]:
raise ValueError(
f"'max_value' should be of shape ({complete_mask.shape[1]},)"
f"when an array-like is provided. Got {len(self.max_value)}, instead."
)
- outcomment / delete li. 681-686 (starting with
if not limit.shape[0] == n_features:
)
This seems to work well and looks a bit neater (if we straighten the two "raise ValueErrors" into one check).
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your suggestions! I found two other issues considering this solution but managed to resolve them in my recent push.
self.min_value
/self.max_value
can beNone
or a scalar. We may need to replace the condition withself.min_value is not None and not np.isscalar(self.min_value) and len(self.min_value) != complete_mask.shape[1]
instead. Personally, I prefer letting_validate_limit
handle these checks implicitly, but I also agree that passingcomplete_mask
to_validate_limit
is somewhat unclear.- When
self.min_value
is a scalar, after_validate_limit
theself._min_value
variable already had the same length asX
. In this case, we need to conditionally index only ifself.min_value
is an array-like since the beginning.
Also, I decided to check and raise the errors separately to provide the users with a better explanation of the error (whether it is min_value
or max_value
). Please let me know if we could further enhance the code readability.
doc/whats_new/v1.6.rst
Outdated
@@ -174,6 +174,9 @@ Changelog | |||
- |Fix| :class:`impute.KNNImputer` excludes samples with nan distances when | |||
computing the mean value for uniform weights. | |||
:pr:`29135` by :user:`Xuefeng Xu <xuefeng-xu>`. | |||
- |Fix| :class:`impute.IterativeImputer` no longer raises an error when `min_value` and `max_value` | |||
are array-like and some features are dropped due to `keep_empty_features = False`. | |||
:pr:`29451` by :user:`Guntitat Sawadwuthikul <gunsodo>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also add here, that you fixed a bug when max_value
and min_value
would not index correctly when they are arrays and full nan columns are present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in 9a42943, thanks!
68eab68
to
ad10964
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your work @gunsodo. I think it looks great, and I'll approve it since I've manually checked everything and it all seems good to me. However, please note that I'm not a maintainer, and you'll still need two approvals from maintainers to merge.
sklearn/impute/tests/test_impute.py
Outdated
@@ -1546,6 +1547,46 @@ def test_iterative_imputer_constant_fill_value(): | |||
assert_array_equal(imputer.initial_imputer_.statistics_, fill_value) | |||
|
|||
|
|||
@pytest.mark.parametrize( | |||
"missing_column,check_column,min_value,max_value", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"missing_column,check_column,min_value,max_value", | |
"missing_column, check_column, min_value, max_value", |
Nit, but I would be glad if we added some white spaces here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I added the changes in 2ece665.
ad10964
to
2ece665
Compare
@StefanieSenger @lesteve Thank you very much for all your suggestions. May I ask if there is any action I need to take for the two required approvals? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I have not had time to take a proper look at how to simplify the checks, I have a vague feeling this seems like a bit too complicated right now but I need more time to figure out if something can be done about it.
Try to ping me in a week if you haven't got a better review from me ...
In the meantime, I have some more superficial comments.
sklearn/impute/tests/test_impute.py
Outdated
@pytest.mark.parametrize( | ||
"missing_column, check_column, min_value, max_value", | ||
[ | ||
(2, 3, 4, 5), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you simplify the test i.e. not use a parametrization unless there is a good reason? I find it makes it a bit harder to follow what is going on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also if you can add a comment why the result should be the expected one it would be great. This is probably super easy for you right now because you have this fresh in your memory but imagine someone (maybe even you) in 6 months that hasn't looked at this code for a while and need to figure out what this test is trying to check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for pointing this out. I have improved the test readability by removing the parametrization as you suggested. I also added comments to help understand the test case better.
2ece665
to
88537a8
Compare
88537a8
to
06b286a
Compare
06b286a
to
f72a9a8
Compare
Sorry I have been busy with other things and likely won't have time to look at this before September 😓. I am setting the "Waiting for reviewer" label hoping someone else may have time to take a look ... |
Let me have a look at this PR since I checked the codebase recently with for another fix. |
@glemaitre Do I need to rebase once again before you review it? |
I will quickly answer that: We've got a bit of a jam concerning reviews. Sorry that you need to wait that long. That's not the rule and you are having bad luck here. There is nothing that you need to do (unless there are conflicts to be resolved). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks almost good. I just think that we need to centralize the check since this is all related to "limit" or "bound".
We will need to pass additional parameter such as the mask.
doc/whats_new/v1.6.rst
Outdated
- |Fix| When `min_value` and `max_value` are array-like and some features are dropped due to | ||
`keep_empty_features = False`, :class:`impute.IterativeImputer` no longer raises an error and | ||
now indexes correctly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a new line and correct a typo.
- |Fix| When `min_value` and `max_value` are array-like and some features are dropped due to | |
`keep_empty_features = False`, :class:`impute.IterativeImputer` no longer raises an error and | |
now indexes correctly. | |
- |Fix| When `min_value` and `max_value` are array-like and some features are dropped due to | |
`keep_empty_features=False`, :class:`impute.IterativeImputer` no longer raises an error and | |
now indexes correctly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the correction. I also moved this to 29451.fix.rst
.
sklearn/impute/_iterative.py
Outdated
@@ -747,6 +741,28 @@ def fit_transform(self, X, y=None, **params): | |||
X, in_fit=True | |||
) | |||
|
|||
n_features_in = complete_mask.shape[1] | |||
err_msg = ( | |||
f"should be of shape ({n_features_in},) when an array-like is provided. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is injected afterwards, I think the punctuation is wrong right now.
f"should be of shape ({n_features_in},) when an array-like is provided. " | |
f"should be of shape ({n_features_in},) when an array-like is provided" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the catch. I resolved this along with your suggestion below. The corrected error message is now in _validate_limit
.
sklearn/impute/_iterative.py
Outdated
n_features_in = complete_mask.shape[1] | ||
err_msg = ( | ||
f"should be of shape ({n_features_in},) when an array-like is provided. " | ||
) | ||
|
||
if ( | ||
self.min_value is not None | ||
and not np.isscalar(self.min_value) | ||
and len(self.min_value) != n_features_in | ||
): | ||
raise ValueError( | ||
f"'min_value' {err_msg}. Got {len(self.min_value)}, instead." | ||
) | ||
if ( | ||
self.max_value is not None | ||
and not np.isscalar(self.max_value) | ||
and len(self.max_value) != n_features_in | ||
): | ||
raise ValueError( | ||
f"'max_value' {err_msg}. Got {len(self.max_value)}, instead." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that can factorize slightly the code to avoid twice the if block.
n_features_in = complete_mask.shape[1] | |
err_msg = ( | |
f"should be of shape ({n_features_in},) when an array-like is provided. " | |
) | |
if ( | |
self.min_value is not None | |
and not np.isscalar(self.min_value) | |
and len(self.min_value) != n_features_in | |
): | |
raise ValueError( | |
f"'min_value' {err_msg}. Got {len(self.min_value)}, instead." | |
) | |
if ( | |
self.max_value is not None | |
and not np.isscalar(self.max_value) | |
and len(self.max_value) != n_features_in | |
): | |
raise ValueError( | |
f"'max_value' {err_msg}. Got {len(self.max_value)}, instead." | |
) | |
def check_bound_values(value, name, n_features_in): | |
if ( | |
value is not None | |
and not np.isscalar(value) | |
and len(value) != n_features_in | |
): | |
raise ValueError( | |
f"'{name}' should be of shape ({n_features_in},) when an array-like" | |
f" is provided. Got {len(value)}, instead." | |
) | |
check_bound_values(self.min_value, "min_value", complete_mask.shape[1]) | |
check_bound_values(self.max_value, "max_value", complete_mask.shape[1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about it, I think it would make more sense to have this check included in self._validate_limit
as well.
We might need to pass more parameter to _validate_limit
but it will be the central place to validate the bound.
sklearn/impute/_iterative.py
Outdated
# Make sure to remove the empty feature elements from the bounds | ||
nonempty_feature_mask = np.logical_not(np.all(complete_mask, axis=0)) | ||
if len(self._min_value) == len(nonempty_feature_mask): | ||
self._min_value = self._min_value[nonempty_feature_mask] | ||
if len(self._max_value) == len(nonempty_feature_mask): | ||
self._max_value = self._max_value[nonempty_feature_mask] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same here, I think that we should move this code in the _validate_limit
function.
@glemaitre Thanks for your comments. I have made several fixes, mainly by moving the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of nitpicks. Otherwise it looks good.
a253448
to
1f58af9
Compare
@glemaitre Thank you very much! Just committed all your nitpicks :) |
@adrinjalali do you want to have a quick look after the approval of @StefanieSenger and myself. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM.
sklearn/impute/_iterative.py
Outdated
|
||
Returns | ||
------- | ||
limit: ndarray, shape(n_features,) | ||
Array of limits, one for each feature. | ||
""" | ||
n_features_in = len(is_empty_feature) | ||
if limit is not None and not np.isscalar(limit) and len(limit) != n_features_in: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we're not validating them, I think _num_samples(limit)
makes more sense here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed the change in b638d30 but am unsure if this addresses your concern. Please let me know if it should be done differently. Thanks for your suggestion!
Co-authored-by: Guillaume Lemaitre <guillaume@probabl.ai>
1f58af9
to
b638d30
Compare
Please avoid force pushing here. Makes it hard to review changes. |
The force-pushing is necessary though, when you had to do it once. There is no way or at least no easy way(?) to avoid it once the branch has a crude history, I think. |
Force pushing is almost never necessary. You can always pull from the branch, merge the remote with local branch, and that'll be an extra commit in the history (which we don't care about). |
@adrinjalali @StefanieSenger Well noted, I was personally unsure about having merge commits. Thanks for the advice! |
Reference Issues/PRs
Fixes #29355.
What does this implement/fix? Explain your changes.
min_value
andmax_value
with the original number of features (before dropping those empty ones)min_value
andmax_value
to remove the empty features@lesteve Please let me know if this does not align with your suggestion.