Skip to content

FIX min_value and max_value not indexed when features are removed #29451

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

gunsodo
Copy link
Contributor

@gunsodo gunsodo commented Jul 10, 2024

Reference Issues/PRs

Fixes #29355.

What does this implement/fix? Explain your changes.

  • Check the shape of the min_value and max_value with the original number of features (before dropping those empty ones)
  • Apply the feature mask to min_value and max_value to remove the empty features

@lesteve Please let me know if this does not align with your suggestion.

Copy link

github-actions bot commented Jul 10, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: b638d30. Link to the linter CI: here

@gunsodo gunsodo force-pushed the fix-iterative-imputer-drop-empty branch from 95f3150 to 2c01aaa Compare July 10, 2024 23:58
Copy link
Member

@lesteve lesteve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the PR!

You will need to add a changelog in doc/whats_new/v1.6.rst.

I need to spend a bit of more time to make sure I understand the actual fix but I have already a few comments on the test.

missing_column, check_column, min_value, max_value
):
"""Check that we properly apply the empty feature mask to
`min_value` and `max_value.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mention the original github maybe? https://github.com/scikit-learn/scikit-learn/issues/29355

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, added in 1816d0e.

"""Check that we properly apply the empty feature mask to
`min_value` and `max_value.
"""
X = np.array([[1, 2, -1, -1], [4, 5, 6, 6], [7, 8, -1, -1], [10, 11, 12, 12]])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason you used -1? If not I would use the default np.nan (rather than -1) missing value I find this is slightly easier to parse visually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no specific reason. I was referring to the previous test case. I agreed with your suggestion and fixed it in 1816d0e.


min_value_array = [-np.inf] * 4
max_value_array = [np.inf] * 4
min_value_array[check_column] = min_value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious this is not needed to trigger the original issue right? Is there a good reason you added this?

Probably related but what are your two different parametrized tests checking?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. With keep_empty_features=False and min_value/max_value are array-like, we can already reproduce the error as the original issue mentioned. I added the two parametrized tests to check if the dropped element of min_value/max_value is done correctly.

Copy link
Member

@lesteve lesteve Jul 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK so you I guess what you are saying is that you fixed an additional bug compared to the one that was reported, thanks for this 🙏!

@gunsodo gunsodo force-pushed the fix-iterative-imputer-drop-empty branch from 2c01aaa to 46a7c7f Compare July 11, 2024 11:42
@gunsodo
Copy link
Contributor Author

gunsodo commented Jul 11, 2024

@lesteve Thanks for the review! I have addressed your comments and updated the changelog.

@@ -756,8 +756,17 @@ def fit_transform(self, X, y=None, **params):
self.n_iter_ = 0
return super()._concatenate_indicator(Xt, X_indicator)

self._min_value = self._validate_limit(self.min_value, "min", X.shape[1])
self._max_value = self._validate_limit(self.max_value, "max", X.shape[1])
self._min_value = self._validate_limit(
Copy link
Member

@lesteve lesteve Jul 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep the two original lines with X.shape[1] and move them up before the X, X_t, mask_missing_values, complete_mask = ... line, i.e. when X is still the original input data.

I feel it would make the code easier to understand.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This suggestion won't work as it is, since X may be not be a numpy array at this stage (for example a list as in the test you added).

I need to spend more time looking at the code for a better suggestion ...

Copy link
Contributor

@StefanieSenger StefanieSenger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @gunsodo,
I went through your PR and just wanted to leave some comments.
Thanks for your work!

keep_empty_features=False,
)

X_no_missing = X[:, [i for i in range(X.shape[1]) if i != missing_column]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
X_no_missing = X[:, [i for i in range(X.shape[1]) if i != missing_column]]
X_no_missing = np.delete(X, missing_column, axis=1)

I think this would be easier to read.

And maybe also renaming X_no_missing into X_without_missing_column, so we don't wrongly suggest no missing values at all in it.

Copy link
Contributor Author

@gunsodo gunsodo Jul 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Renamed and improved in 11964ea.

Comment on lines 1585 to 1586
assert_allclose(np.min(X_imputed[np.isnan(X_no_missing)]), min_value)
assert_allclose(np.max(X_imputed[np.isnan(X_no_missing)]), max_value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand this correctly, than we rely on the carefully designed data (X) we input into this test to actually return entries that are min_value and max_value, which is alright for a test.

And if this is the case, then these two assert statements suggest some exactness that is not really tested for and I think we could have more transparency if we use a plain == instead of assert_allclose().

And I feel that we should add a comment above, saying that we expect min_value and max_value to be actually present in the imputed data, because we chose the data passed to the test accordingly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your suggestion. I pushed the fix in 11964ea.

Comment on lines 759 to 807
self._min_value = self._validate_limit(
self.min_value, "min", complete_mask.shape[1]
)
self._max_value = self._validate_limit(
self.max_value, "max", complete_mask.shape[1]
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if instead of checking the shape of the limit in lines 681-686 we could do the validation in the beginning of fit() and fit_transform(), like we usually do. Something like if len(self.min_value) != X.shape[1]: raise ValueError(...) ... It feels a bit cleaner to me.

This would mean that the code in li. 759-764 could stay as it was before.

But I think your take on dealing with this is also okay.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it should be

if isinstance(self.min_value, np.array) and len(self.min_value) != X.shape[1]: raise ValueError(...)

Or similar, since we need this only if the input is an array.

Copy link
Contributor Author

@gunsodo gunsodo Jul 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments! As @lesteve suggested above, it could be problematic when either X, self.min_value, or self.max_value are not np.ndarray. We might have to safely convert all of them to np.ndarrays first then we can check the lengths. However, this seems redundant because we call _validate_data(X) inside self._initial_imputation and check_array(self.min_value) inside self._validate_limit already. Which approach do you think would work best?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think both is alright, but I agree that earlier validation is more readable.

I tried this locally:

  1. revert changes in li. 759-764 (starting with self._min_value = self._validate_limit()

  2. insert into li. 747 (right before super()._fit_indicator(complete_mask)):

        if len(self.min_value) != complete_mask.shape[1]: 
            raise ValueError(
                f"'min_value' should be of shape ({complete_mask.shape[1]},)"
                f"when an array-like is provided. Got {len(self.min_value)}, instead."
            )
        
        if len(self.max_value) != complete_mask.shape[1]: 
            raise ValueError(
                f"'max_value' should be of shape ({complete_mask.shape[1]},)"
                f"when an array-like is provided. Got {len(self.max_value)}, instead."
            )
  1. outcomment / delete li. 681-686 (starting with if not limit.shape[0] == n_features:)

This seems to work well and looks a bit neater (if we straighten the two "raise ValueErrors" into one check).

What do you think?

Copy link
Contributor Author

@gunsodo gunsodo Jul 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your suggestions! I found two other issues considering this solution but managed to resolve them in my recent push.

  • self.min_value/self.max_value can be None or a scalar. We may need to replace the condition with self.min_value is not None and not np.isscalar(self.min_value) and len(self.min_value) != complete_mask.shape[1] instead. Personally, I prefer letting _validate_limit handle these checks implicitly, but I also agree that passing complete_mask to _validate_limit is somewhat unclear.
  • When self.min_value is a scalar, after _validate_limit the self._min_value variable already had the same length as X. In this case, we need to conditionally index only if self.min_value is an array-like since the beginning.

Also, I decided to check and raise the errors separately to provide the users with a better explanation of the error (whether it is min_value or max_value). Please let me know if we could further enhance the code readability.

@@ -174,6 +174,9 @@ Changelog
- |Fix| :class:`impute.KNNImputer` excludes samples with nan distances when
computing the mean value for uniform weights.
:pr:`29135` by :user:`Xuefeng Xu <xuefeng-xu>`.
- |Fix| :class:`impute.IterativeImputer` no longer raises an error when `min_value` and `max_value`
are array-like and some features are dropped due to `keep_empty_features = False`.
:pr:`29451` by :user:`Guntitat Sawadwuthikul <gunsodo>`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also add here, that you fixed a bug when max_value and min_value would not index correctly when they are arrays and full nan columns are present.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 9a42943, thanks!

@gunsodo gunsodo force-pushed the fix-iterative-imputer-drop-empty branch 4 times, most recently from 68eab68 to ad10964 Compare July 18, 2024 23:50
@gunsodo gunsodo requested a review from StefanieSenger July 18, 2024 23:50
Copy link
Contributor

@StefanieSenger StefanieSenger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work @gunsodo. I think it looks great, and I'll approve it since I've manually checked everything and it all seems good to me. However, please note that I'm not a maintainer, and you'll still need two approvals from maintainers to merge.

@@ -1546,6 +1547,46 @@ def test_iterative_imputer_constant_fill_value():
assert_array_equal(imputer.initial_imputer_.statistics_, fill_value)


@pytest.mark.parametrize(
"missing_column,check_column,min_value,max_value",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"missing_column,check_column,min_value,max_value",
"missing_column, check_column, min_value, max_value",

Nit, but I would be glad if we added some white spaces here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I added the changes in 2ece665.

@gunsodo gunsodo force-pushed the fix-iterative-imputer-drop-empty branch from ad10964 to 2ece665 Compare July 23, 2024 11:31
@gunsodo
Copy link
Contributor Author

gunsodo commented Jul 23, 2024

@StefanieSenger @lesteve Thank you very much for all your suggestions. May I ask if there is any action I need to take for the two required approvals?

Copy link
Member

@lesteve lesteve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I have not had time to take a proper look at how to simplify the checks, I have a vague feeling this seems like a bit too complicated right now but I need more time to figure out if something can be done about it.

Try to ping me in a week if you haven't got a better review from me ...

In the meantime, I have some more superficial comments.

@pytest.mark.parametrize(
"missing_column, check_column, min_value, max_value",
[
(2, 3, 4, 5),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you simplify the test i.e. not use a parametrization unless there is a good reason? I find it makes it a bit harder to follow what is going on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also if you can add a comment why the result should be the expected one it would be great. This is probably super easy for you right now because you have this fresh in your memory but imagine someone (maybe even you) in 6 months that hasn't looked at this code for a while and need to figure out what this test is trying to check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing this out. I have improved the test readability by removing the parametrization as you suggested. I also added comments to help understand the test case better.

@gunsodo gunsodo force-pushed the fix-iterative-imputer-drop-empty branch from 2ece665 to 88537a8 Compare July 24, 2024 23:57
@gunsodo gunsodo requested a review from lesteve July 25, 2024 23:48
@gunsodo gunsodo force-pushed the fix-iterative-imputer-drop-empty branch from 88537a8 to 06b286a Compare July 26, 2024 09:48
@gunsodo gunsodo force-pushed the fix-iterative-imputer-drop-empty branch from 06b286a to f72a9a8 Compare August 3, 2024 05:20
@lesteve
Copy link
Member

lesteve commented Aug 3, 2024

Sorry I have been busy with other things and likely won't have time to look at this before September 😓. I am setting the "Waiting for reviewer" label hoping someone else may have time to take a look ...

@glemaitre
Copy link
Member

Let me have a look at this PR since I checked the codebase recently with for another fix.

@glemaitre glemaitre self-requested a review September 11, 2024 14:56
@gunsodo
Copy link
Contributor Author

gunsodo commented Sep 22, 2024

@glemaitre Do I need to rebase once again before you review it?

@StefanieSenger
Copy link
Contributor

@glemaitre Do I need to rebase once again before you review it?

I will quickly answer that: We've got a bit of a jam concerning reviews. Sorry that you need to wait that long. That's not the rule and you are having bad luck here. There is nothing that you need to do (unless there are conflicts to be resolved).

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks almost good. I just think that we need to centralize the check since this is all related to "limit" or "bound".

We will need to pass additional parameter such as the mask.

Comment on lines 217 to 219
- |Fix| When `min_value` and `max_value` are array-like and some features are dropped due to
`keep_empty_features = False`, :class:`impute.IterativeImputer` no longer raises an error and
now indexes correctly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a new line and correct a typo.

Suggested change
- |Fix| When `min_value` and `max_value` are array-like and some features are dropped due to
`keep_empty_features = False`, :class:`impute.IterativeImputer` no longer raises an error and
now indexes correctly.
- |Fix| When `min_value` and `max_value` are array-like and some features are dropped due to
`keep_empty_features=False`, :class:`impute.IterativeImputer` no longer raises an error and
now indexes correctly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the correction. I also moved this to 29451.fix.rst.

@@ -747,6 +741,28 @@ def fit_transform(self, X, y=None, **params):
X, in_fit=True
)

n_features_in = complete_mask.shape[1]
err_msg = (
f"should be of shape ({n_features_in},) when an array-like is provided. "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is injected afterwards, I think the punctuation is wrong right now.

Suggested change
f"should be of shape ({n_features_in},) when an array-like is provided. "
f"should be of shape ({n_features_in},) when an array-like is provided"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catch. I resolved this along with your suggestion below. The corrected error message is now in _validate_limit.

Comment on lines 744 to 779
n_features_in = complete_mask.shape[1]
err_msg = (
f"should be of shape ({n_features_in},) when an array-like is provided. "
)

if (
self.min_value is not None
and not np.isscalar(self.min_value)
and len(self.min_value) != n_features_in
):
raise ValueError(
f"'min_value' {err_msg}. Got {len(self.min_value)}, instead."
)
if (
self.max_value is not None
and not np.isscalar(self.max_value)
and len(self.max_value) != n_features_in
):
raise ValueError(
f"'max_value' {err_msg}. Got {len(self.max_value)}, instead."
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that can factorize slightly the code to avoid twice the if block.

Suggested change
n_features_in = complete_mask.shape[1]
err_msg = (
f"should be of shape ({n_features_in},) when an array-like is provided. "
)
if (
self.min_value is not None
and not np.isscalar(self.min_value)
and len(self.min_value) != n_features_in
):
raise ValueError(
f"'min_value' {err_msg}. Got {len(self.min_value)}, instead."
)
if (
self.max_value is not None
and not np.isscalar(self.max_value)
and len(self.max_value) != n_features_in
):
raise ValueError(
f"'max_value' {err_msg}. Got {len(self.max_value)}, instead."
)
def check_bound_values(value, name, n_features_in):
if (
value is not None
and not np.isscalar(value)
and len(value) != n_features_in
):
raise ValueError(
f"'{name}' should be of shape ({n_features_in},) when an array-like"
f" is provided. Got {len(value)}, instead."
)
check_bound_values(self.min_value, "min_value", complete_mask.shape[1])
check_bound_values(self.max_value, "max_value", complete_mask.shape[1])

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it, I think it would make more sense to have this check included in self._validate_limit as well.
We might need to pass more parameter to _validate_limit but it will be the central place to validate the bound.

Comment on lines 781 to 808
# Make sure to remove the empty feature elements from the bounds
nonempty_feature_mask = np.logical_not(np.all(complete_mask, axis=0))
if len(self._min_value) == len(nonempty_feature_mask):
self._min_value = self._min_value[nonempty_feature_mask]
if len(self._max_value) == len(nonempty_feature_mask):
self._max_value = self._max_value[nonempty_feature_mask]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same here, I think that we should move this code in the _validate_limit function.

@gunsodo
Copy link
Contributor Author

gunsodo commented Oct 19, 2024

@glemaitre Thanks for your comments. I have made several fixes, mainly by moving the limit handling logic to inside _validate_limit. This introduces two extra parameters since it is a staticmethod. Please let me know if I need to be more careful with this change of argument passing or if you have any other suggestions.

@gunsodo gunsodo requested a review from glemaitre October 19, 2024 07:13
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of nitpicks. Otherwise it looks good.

@gunsodo gunsodo force-pushed the fix-iterative-imputer-drop-empty branch from a253448 to 1f58af9 Compare October 28, 2024 11:58
@gunsodo
Copy link
Contributor Author

gunsodo commented Oct 28, 2024

@glemaitre Thank you very much! Just committed all your nitpicks :)

@glemaitre glemaitre added this to the 1.6 milestone Oct 28, 2024
@glemaitre
Copy link
Member

@adrinjalali do you want to have a quick look after the approval of @StefanieSenger and myself.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM.


Returns
-------
limit: ndarray, shape(n_features,)
Array of limits, one for each feature.
"""
n_features_in = len(is_empty_feature)
if limit is not None and not np.isscalar(limit) and len(limit) != n_features_in:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we're not validating them, I think _num_samples(limit) makes more sense here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed the change in b638d30 but am unsure if this addresses your concern. Please let me know if it should be done differently. Thanks for your suggestion!

@gunsodo gunsodo force-pushed the fix-iterative-imputer-drop-empty branch from 1f58af9 to b638d30 Compare October 29, 2024 11:45
@adrinjalali
Copy link
Member

Please avoid force pushing here. Makes it hard to review changes.

@adrinjalali adrinjalali enabled auto-merge (squash) October 29, 2024 12:19
@StefanieSenger
Copy link
Contributor

The force-pushing is necessary though, when you had to do it once. There is no way or at least no easy way(?) to avoid it once the branch has a crude history, I think.

@adrinjalali
Copy link
Member

adrinjalali commented Oct 29, 2024

Force pushing is almost never necessary. You can always pull from the branch, merge the remote with local branch, and that'll be an extra commit in the history (which we don't care about).

@gunsodo
Copy link
Contributor Author

gunsodo commented Oct 29, 2024

@adrinjalali @StefanieSenger Well noted, I was personally unsure about having merge commits. Thanks for the advice!

@adrinjalali adrinjalali merged commit e5075ae into scikit-learn:main Oct 29, 2024
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Shape may mismatch in IterativeImputer if set min_value with array-like input and some features contain all missing values
5 participants