FIX min_value and max_value not indexed when features are removed #29451

gunsodo · 2024-07-10T10:45:09Z

Reference Issues/PRs

Fixes #29355.

What does this implement/fix? Explain your changes.

Check the shape of the min_value and max_value with the original number of features (before dropping those empty ones)
Apply the feature mask to min_value and max_value to remove the empty features

@lesteve Please let me know if this does not align with your suggestion.

github-actions · 2024-07-10T10:46:22Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: b638d30. Link to the linter CI: here}

lesteve

Thanks a lot for the PR!

You will need to add a changelog in doc/whats_new/v1.6.rst.

I need to spend a bit of more time to make sure I understand the actual fix but I have already a few comments on the test.

lesteve · 2024-07-11T07:03:07Z

sklearn/impute/tests/test_impute.py

+    missing_column, check_column, min_value, max_value
+):
+    """Check that we properly apply the empty feature mask to
+    `min_value` and `max_value.


Mention the original github maybe? https://github.com/scikit-learn/scikit-learn/issues/29355

Sure, added in 1816d0e.

lesteve · 2024-07-11T07:03:34Z

sklearn/impute/tests/test_impute.py

+    """Check that we properly apply the empty feature mask to
+    `min_value` and `max_value.
+    """
+    X = np.array([[1, 2, -1, -1], [4, 5, 6, 6], [7, 8, -1, -1], [10, 11, 12, 12]])


Is there a reason you used -1? If not I would use the default np.nan (rather than -1) missing value I find this is slightly easier to parse visually.

There is no specific reason. I was referring to the previous test case. I agreed with your suggestion and fixed it in 1816d0e.

lesteve · 2024-07-11T07:07:34Z

sklearn/impute/tests/test_impute.py

+
+    min_value_array = [-np.inf] * 4
+    max_value_array = [np.inf] * 4
+    min_value_array[check_column] = min_value


Just curious this is not needed to trigger the original issue right? Is there a good reason you added this?

Probably related but what are your two different parametrized tests checking?

You're right. With keep_empty_features=False and min_value/max_value are array-like, we can already reproduce the error as the original issue mentioned. I added the two parametrized tests to check if the dropped element of min_value/max_value is done correctly.

OK so you I guess what you are saying is that you fixed an additional bug compared to the one that was reported, thanks for this 🙏!

gunsodo · 2024-07-11T11:47:44Z

@lesteve Thanks for the review! I have addressed your comments and updated the changelog.

lesteve · 2024-07-11T13:47:41Z

sklearn/impute/_iterative.py

@@ -756,8 +756,17 @@ def fit_transform(self, X, y=None, **params):
            self.n_iter_ = 0
            return super()._concatenate_indicator(Xt, X_indicator)

-        self._min_value = self._validate_limit(self.min_value, "min", X.shape[1])
-        self._max_value = self._validate_limit(self.max_value, "max", X.shape[1])
+        self._min_value = self._validate_limit(


I would keep the two original lines with X.shape[1] and move them up before the X, X_t, mask_missing_values, complete_mask = ... line, i.e. when X is still the original input data.

I feel it would make the code easier to understand.

This suggestion won't work as it is, since X may be not be a numpy array at this stage (for example a list as in the test you added).

I need to spend more time looking at the code for a better suggestion ...

StefanieSenger

Hi @gunsodo,
I went through your PR and just wanted to leave some comments.
Thanks for your work!

StefanieSenger · 2024-07-11T12:16:13Z

sklearn/impute/tests/test_impute.py

+        keep_empty_features=False,
+    )
+
+    X_no_missing = X[:, [i for i in range(X.shape[1]) if i != missing_column]]


Suggested change

X_no_missing = X[:, [i for i in range(X.shape[1]) if i != missing_column]]

X_no_missing = np.delete(X, missing_column, axis=1)

I think this would be easier to read.

And maybe also renaming X_no_missing into X_without_missing_column, so we don't wrongly suggest no missing values at all in it.

Thanks! Renamed and improved in 11964ea.

StefanieSenger · 2024-07-11T12:44:46Z

sklearn/impute/tests/test_impute.py

+    assert_allclose(np.min(X_imputed[np.isnan(X_no_missing)]), min_value)
+    assert_allclose(np.max(X_imputed[np.isnan(X_no_missing)]), max_value)


If I understand this correctly, than we rely on the carefully designed data (X) we input into this test to actually return entries that are min_value and max_value, which is alright for a test.

And if this is the case, then these two assert statements suggest some exactness that is not really tested for and I think we could have more transparency if we use a plain == instead of assert_allclose().

And I feel that we should add a comment above, saying that we expect min_value and max_value to be actually present in the imputed data, because we chose the data passed to the test accordingly.

Thank you for your suggestion. I pushed the fix in 11964ea.

StefanieSenger · 2024-07-11T14:09:40Z

sklearn/impute/_iterative.py

+        self._min_value = self._validate_limit(
+            self.min_value, "min", complete_mask.shape[1]
+        )
+        self._max_value = self._validate_limit(
+            self.max_value, "max", complete_mask.shape[1]
+        )


I was wondering if instead of checking the shape of the limit in lines 681-686 we could do the validation in the beginning of fit() and fit_transform(), like we usually do. Something like if len(self.min_value) != X.shape[1]: raise ValueError(...) ... It feels a bit cleaner to me.

This would mean that the code in li. 759-764 could stay as it was before.

But I think your take on dealing with this is also okay.

Actually, it should be

if isinstance(self.min_value, np.array) and len(self.min_value) != X.shape[1]: raise ValueError(...)

Or similar, since we need this only if the input is an array.

Thanks for the comments! As @lesteve suggested above, it could be problematic when either X, self.min_value, or self.max_value are not np.ndarray. We might have to safely convert all of them to np.ndarrays first then we can check the lengths. However, this seems redundant because we call _validate_data(X) inside self._initial_imputation and check_array(self.min_value) inside self._validate_limit already. Which approach do you think would work best?

I think both is alright, but I agree that earlier validation is more readable.

I tried this locally:

revert changes in li. 759-764 (starting with self._min_value = self._validate_limit()

insert into li. 747 (right before super()._fit_indicator(complete_mask)):

if len(self.min_value) != complete_mask.shape[1]: raise ValueError( f"'min_value' should be of shape ({complete_mask.shape[1]},)" f"when an array-like is provided. Got {len(self.min_value)}, instead." ) if len(self.max_value) != complete_mask.shape[1]: raise ValueError( f"'max_value' should be of shape ({complete_mask.shape[1]},)" f"when an array-like is provided. Got {len(self.max_value)}, instead." )

outcomment / delete li. 681-686 (starting with if not limit.shape[0] == n_features:)

This seems to work well and looks a bit neater (if we straighten the two "raise ValueErrors" into one check).

What do you think?

Thank you for your suggestions! I found two other issues considering this solution but managed to resolve them in my recent push.

self.min_value/self.max_value can be None or a scalar. We may need to replace the condition with self.min_value is not None and not np.isscalar(self.min_value) and len(self.min_value) != complete_mask.shape[1] instead. Personally, I prefer letting _validate_limit handle these checks implicitly, but I also agree that passing complete_mask to _validate_limit is somewhat unclear.

When self.min_value is a scalar, after _validate_limit the self._min_value variable already had the same length as X. In this case, we need to conditionally index only if self.min_value is an array-like since the beginning.

Also, I decided to check and raise the errors separately to provide the users with a better explanation of the error (whether it is min_value or max_value). Please let me know if we could further enhance the code readability.

StefanieSenger · 2024-07-11T14:36:24Z

doc/whats_new/v1.6.rst

@@ -174,6 +174,9 @@ Changelog
 - |Fix| :class:`impute.KNNImputer` excludes samples with nan distances when
  computing the mean value for uniform weights.
  :pr:`29135` by :user:`Xuefeng Xu <xuefeng-xu>`.
+- |Fix| :class:`impute.IterativeImputer` no longer raises an error when `min_value` and `max_value`
+  are array-like and some features are dropped due to `keep_empty_features = False`.
+  :pr:`29451` by :user:`Guntitat Sawadwuthikul <gunsodo>`.


You can also add here, that you fixed a bug when max_value and min_value would not index correctly when they are arrays and full nan columns are present.

Added in 9a42943, thanks!

StefanieSenger

Thanks for your work @gunsodo. I think it looks great, and I'll approve it since I've manually checked everything and it all seems good to me. However, please note that I'm not a maintainer, and you'll still need two approvals from maintainers to merge.

StefanieSenger · 2024-07-23T09:09:32Z

sklearn/impute/tests/test_impute.py

@@ -1546,6 +1547,46 @@ def test_iterative_imputer_constant_fill_value():
    assert_array_equal(imputer.initial_imputer_.statistics_, fill_value)


+@pytest.mark.parametrize(
+    "missing_column,check_column,min_value,max_value",


Suggested change

"missing_column,check_column,min_value,max_value",

"missing_column, check_column, min_value, max_value",

Nit, but I would be glad if we added some white spaces here.

Sure, I added the changes in 2ece665.

gunsodo · 2024-07-23T11:34:57Z

@StefanieSenger @lesteve Thank you very much for all your suggestions. May I ask if there is any action I need to take for the two required approvals?

lesteve

Sorry I have not had time to take a proper look at how to simplify the checks, I have a vague feeling this seems like a bit too complicated right now but I need more time to figure out if something can be done about it.

Try to ping me in a week if you haven't got a better review from me ...

In the meantime, I have some more superficial comments.

sklearn/impute/_iterative.py

lesteve · 2024-07-24T14:03:00Z

sklearn/impute/tests/test_impute.py

+@pytest.mark.parametrize(
+    "missing_column, check_column, min_value, max_value",
+    [
+        (2, 3, 4, 5),


Can you simplify the test i.e. not use a parametrization unless there is a good reason? I find it makes it a bit harder to follow what is going on.

Also if you can add a comment why the result should be the expected one it would be great. This is probably super easy for you right now because you have this fresh in your memory but imagine someone (maybe even you) in 6 months that hasn't looked at this code for a while and need to figure out what this test is trying to check.

Thank you for pointing this out. I have improved the test readability by removing the parametrization as you suggested. I also added comments to help understand the test case better.

lesteve · 2024-08-03T05:38:36Z

Sorry I have been busy with other things and likely won't have time to look at this before September 😓. I am setting the "Waiting for reviewer" label hoping someone else may have time to take a look ...

glemaitre · 2024-09-11T14:56:27Z

Let me have a look at this PR since I checked the codebase recently with for another fix.

gunsodo · 2024-09-22T03:29:36Z

@glemaitre Do I need to rebase once again before you review it?

StefanieSenger · 2024-09-27T09:39:17Z

@glemaitre Do I need to rebase once again before you review it?

I will quickly answer that: We've got a bit of a jam concerning reviews. Sorry that you need to wait that long. That's not the rule and you are having bad luck here. There is nothing that you need to do (unless there are conflicts to be resolved).

glemaitre

It looks almost good. I just think that we need to centralize the check since this is all related to "limit" or "bound".

We will need to pass additional parameter such as the mask.

glemaitre · 2024-10-10T12:51:40Z

doc/whats_new/v1.6.rst

+- |Fix| When `min_value` and `max_value` are array-like and some features are dropped due to
+  `keep_empty_features = False`, :class:`impute.IterativeImputer` no longer raises an error and
+  now indexes correctly.


Adding a new line and correct a typo.

Suggested change

- |Fix| When `min_value` and `max_value` are array-like and some features are dropped due to

`keep_empty_features = False`, :class:`impute.IterativeImputer` no longer raises an error and

now indexes correctly.

- |Fix| When `min_value` and `max_value` are array-like and some features are dropped due to

`keep_empty_features=False`, :class:`impute.IterativeImputer` no longer raises an error and

now indexes correctly.

Thanks for the correction. I also moved this to 29451.fix.rst.

glemaitre · 2024-10-10T12:52:45Z

sklearn/impute/_iterative.py

@@ -747,6 +741,28 @@ def fit_transform(self, X, y=None, **params):
            X, in_fit=True
        )

+        n_features_in = complete_mask.shape[1]
+        err_msg = (
+            f"should be of shape ({n_features_in},) when an array-like is provided. "


This is injected afterwards, I think the punctuation is wrong right now.

Suggested change

f"should be of shape ({n_features_in},) when an array-like is provided. "

f"should be of shape ({n_features_in},) when an array-like is provided"

Thanks for the catch. I resolved this along with your suggestion below. The corrected error message is now in _validate_limit.

glemaitre · 2024-10-10T13:03:08Z

sklearn/impute/_iterative.py

+        n_features_in = complete_mask.shape[1]
+        err_msg = (
+            f"should be of shape ({n_features_in},) when an array-like is provided. "
+        )
+
+        if (
+            self.min_value is not None
+            and not np.isscalar(self.min_value)
+            and len(self.min_value) != n_features_in
+        ):
+            raise ValueError(
+                f"'min_value' {err_msg}. Got {len(self.min_value)}, instead."
+            )
+        if (
+            self.max_value is not None
+            and not np.isscalar(self.max_value)
+            and len(self.max_value) != n_features_in
+        ):
+            raise ValueError(
+                f"'max_value' {err_msg}. Got {len(self.max_value)}, instead."
+            )


I think that can factorize slightly the code to avoid twice the if block.

Suggested change

n_features_in = complete_mask.shape[1]

err_msg = (

f"should be of shape ({n_features_in},) when an array-like is provided. "

)

if (

self.min_value is not None

and not np.isscalar(self.min_value)

and len(self.min_value) != n_features_in

):

raise ValueError(

f"'min_value' {err_msg}. Got {len(self.min_value)}, instead."

)

if (

self.max_value is not None

and not np.isscalar(self.max_value)

and len(self.max_value) != n_features_in

):

raise ValueError(

f"'max_value' {err_msg}. Got {len(self.max_value)}, instead."

)

def check_bound_values(value, name, n_features_in):

if (

value is not None

and not np.isscalar(value)

and len(value) != n_features_in

):

raise ValueError(

f"'{name}' should be of shape ({n_features_in},) when an array-like"

f" is provided. Got {len(value)}, instead."

)

check_bound_values(self.min_value, "min_value", complete_mask.shape[1])

check_bound_values(self.max_value, "max_value", complete_mask.shape[1])

Thinking about it, I think it would make more sense to have this check included in self._validate_limit as well.
We might need to pass more parameter to _validate_limit but it will be the central place to validate the bound.

glemaitre · 2024-10-10T13:14:09Z

sklearn/impute/_iterative.py

+        # Make sure to remove the empty feature elements from the bounds
+        nonempty_feature_mask = np.logical_not(np.all(complete_mask, axis=0))
+        if len(self._min_value) == len(nonempty_feature_mask):
+            self._min_value = self._min_value[nonempty_feature_mask]
+        if len(self._max_value) == len(nonempty_feature_mask):
+            self._max_value = self._max_value[nonempty_feature_mask]
+


The same here, I think that we should move this code in the _validate_limit function.

gunsodo · 2024-10-19T07:13:30Z

@glemaitre Thanks for your comments. I have made several fixes, mainly by moving the limit handling logic to inside _validate_limit. This introduces two extra parameters since it is a staticmethod. Please let me know if I need to be more careful with this change of argument passing or if you have any other suggestions.

glemaitre

A couple of nitpicks. Otherwise it looks good.

sklearn/impute/tests/test_impute.py

doc/whats_new/upcoming_changes/sklearn.impute/29451.fix.rst

gunsodo · 2024-10-28T11:59:41Z

@glemaitre Thank you very much! Just committed all your nitpicks :)

glemaitre · 2024-10-28T17:58:54Z

@adrinjalali do you want to have a quick look after the approval of @StefanieSenger and myself.

adrinjalali

Otherwise LGTM.

adrinjalali · 2024-10-29T11:26:27Z

sklearn/impute/_iterative.py


        Returns
        -------
        limit: ndarray, shape(n_features,)
            Array of limits, one for each feature.
        """
+        n_features_in = len(is_empty_feature)
+        if limit is not None and not np.isscalar(limit) and len(limit) != n_features_in:


since we're not validating them, I think _num_samples(limit) makes more sense here.

I pushed the change in b638d30 but am unsure if this addresses your concern. Please let me know if it should be done differently. Thanks for your suggestion!

Co-authored-by: Guillaume Lemaitre <guillaume@probabl.ai>

adrinjalali · 2024-10-29T12:18:06Z

Please avoid force pushing here. Makes it hard to review changes.

StefanieSenger · 2024-10-29T12:20:49Z

The force-pushing is necessary though, when you had to do it once. There is no way or at least no easy way(?) to avoid it once the branch has a crude history, I think.

adrinjalali · 2024-10-29T12:23:53Z

Force pushing is almost never necessary. You can always pull from the branch, merge the remote with local branch, and that'll be an extra commit in the history (which we don't care about).

gunsodo · 2024-10-29T12:28:59Z

@adrinjalali @StefanieSenger Well noted, I was personally unsure about having merge commits. Thanks for the advice!

github-actions bot added the module:impute label Jul 10, 2024

gunsodo force-pushed the fix-iterative-imputer-drop-empty branch from 95f3150 to 2c01aaa Compare July 10, 2024 23:58

lesteve reviewed Jul 11, 2024

View reviewed changes

gunsodo force-pushed the fix-iterative-imputer-drop-empty branch from 2c01aaa to 46a7c7f Compare July 11, 2024 11:42

lesteve reviewed Jul 11, 2024

View reviewed changes

StefanieSenger reviewed Jul 11, 2024

View reviewed changes

gunsodo force-pushed the fix-iterative-imputer-drop-empty branch 4 times, most recently from 68eab68 to ad10964 Compare July 18, 2024 23:50

gunsodo requested a review from StefanieSenger July 18, 2024 23:50

StefanieSenger approved these changes Jul 23, 2024

View reviewed changes

gunsodo force-pushed the fix-iterative-imputer-drop-empty branch from ad10964 to 2ece665 Compare July 23, 2024 11:31

lesteve reviewed Jul 24, 2024

View reviewed changes

gunsodo force-pushed the fix-iterative-imputer-drop-empty branch from 2ece665 to 88537a8 Compare July 24, 2024 23:57

gunsodo requested a review from lesteve July 25, 2024 23:48

gunsodo force-pushed the fix-iterative-imputer-drop-empty branch from 88537a8 to 06b286a Compare July 26, 2024 09:48

gunsodo force-pushed the fix-iterative-imputer-drop-empty branch from 06b286a to f72a9a8 Compare August 3, 2024 05:20

lesteve added the Waiting for Reviewer label Aug 3, 2024

StefanieSenger mentioned this pull request Sep 11, 2024

SimpleImputer does not drop a column full of np.nan even when keep_empty_feature=False #29827

Closed

glemaitre self-requested a review September 11, 2024 14:56

glemaitre reviewed Oct 10, 2024

View reviewed changes

gunsodo requested a review from glemaitre October 19, 2024 07:13

glemaitre reviewed Oct 28, 2024

View reviewed changes

glemaitre approved these changes Oct 28, 2024

View reviewed changes

gunsodo force-pushed the fix-iterative-imputer-drop-empty branch from a253448 to 1f58af9 Compare October 28, 2024 11:58

glemaitre added this to the 1.6 milestone Oct 28, 2024

adrinjalali reviewed Oct 29, 2024

View reviewed changes

gunsodo and others added 13 commits October 29, 2024 20:40

FIX min_value and max_value not indexed when features are removed

bb9eb15

replace -1 with np.nan as suggested

928fbb2

enhance test cases

b2decbb

avoid passing the mask to _validate_limit

4ba2eef

add one more test case

b4dced2

add spaces

81734c7

use f-string

a90af06

improve test readability

3a81861

rename force_all_finite to ensure_all_finite

eab3266

add changelog file with new format

04d696f

address pr comments and fix regression tests

3d30263

apply suggestions from code review

c68ceb3

Co-authored-by: Guillaume Lemaitre <guillaume@probabl.ai>

use _num_samples reflecting code review

b638d30

gunsodo force-pushed the fix-iterative-imputer-drop-empty branch from 1f58af9 to b638d30 Compare October 29, 2024 11:45

adrinjalali approved these changes Oct 29, 2024

View reviewed changes

adrinjalali enabled auto-merge (squash) October 29, 2024 12:19

adrinjalali merged commit e5075ae into scikit-learn:main Oct 29, 2024
29 checks passed

	X_no_missing = X[:, [i for i in range(X.shape[1]) if i != missing_column]]
	X_no_missing = np.delete(X, missing_column, axis=1)

		assert_allclose(np.min(X_imputed[np.isnan(X_no_missing)]), min_value)
		assert_allclose(np.max(X_imputed[np.isnan(X_no_missing)]), max_value)

	"missing_column,check_column,min_value,max_value",
	"missing_column, check_column, min_value, max_value",

	f"should be of shape ({n_features_in},) when an array-like is provided. "
	f"should be of shape ({n_features_in},) when an array-like is provided"

Uh oh!

FIX min_value and max_value not indexed when features are removed #29451

FIX min_value and max_value not indexed when features are removed #29451

Uh oh!

Conversation

gunsodo commented Jul 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Uh oh!

github-actions bot commented Jul 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

lesteve left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lesteve Jul 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gunsodo commented Jul 11, 2024

Uh oh!

lesteve Jul 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

StefanieSenger left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gunsodo Jul 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gunsodo Jul 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gunsodo Jul 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

StefanieSenger left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gunsodo commented Jul 23, 2024

gunsodo commented Jul 10, 2024 •

edited

Loading

github-actions bot commented Jul 10, 2024 •

edited

Loading

lesteve Jul 11, 2024 •

edited

Loading

lesteve Jul 11, 2024 •

edited

Loading

gunsodo Jul 14, 2024 •

edited

Loading

gunsodo Jul 12, 2024 •

edited

Loading

gunsodo Jul 17, 2024 •

edited

Loading

lesteve commented Aug 3, 2024 •

edited

Loading