Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IterativeImputer behaviour on missing nan's in fit data #13773

Closed
Pacman1984 opened this issue May 3, 2019 · 21 comments · Fixed by #14806
Closed

IterativeImputer behaviour on missing nan's in fit data #13773

Pacman1984 opened this issue May 3, 2019 · 21 comments · Fixed by #14806

Comments

@Pacman1984
Copy link

@Pacman1984 Pacman1984 commented May 3, 2019

Why is this behaviour forced:

Features with missing values during transform which did not have any missing values during fit will be imputed with the initial imputation method only.

https://scikit-learn.org/dev/modules/generated/sklearn.impute.IterativeImputer.html#sklearn.impute.IterativeImputer

This means by default it will return the mean of that feature. I would prefer just fit one iteration of the chosen estimator and use that fitted estimator to impute missing values.

Actual behaviour:
Example - The second feature missing np.nan --> mean imputation

import numpy as np
from sklearn.impute import IterativeImputer
imp = IterativeImputer(max_iter=10, verbose=0)
imp.fit([[1, 2], [3, 6], [4, 8], [10, 20], [np.nan, 22], [7, 14]])

X_test = [[np.nan, 4], [6, np.nan], [np.nan, 6], [4, np.nan], [33, np.nan]]
print(np.round(imp.transform(X_test)))
Return:
[[ 2.  4.]
 [ 6. 12.]
 [ 3.  6.]
 [ 4. 12.]
 [33. 12.]]

Example adjusted - Second feature has np.nan values --> iterative imputation with estimator

import numpy as np
from sklearn.impute import IterativeImputer
imp = IterativeImputer(max_iter=10, verbose=0)
imp.fit([[1, 2], [3, 6], [4, 8], [10, 20], [np.nan, 22], [7, np.nan]])

X_test = [[np.nan, 4], [6, np.nan], [np.nan, 6], [4, np.nan], [33, np.nan]]
print(np.round(imp.transform(X_test)))
Return:
[[ 2.  4.]
 [ 6. 12.]
 [ 3.  6.]
 [ 4. 8.]
 [33. 66.]]

Maybe sklearn/impute.py line 679 to 683 should be optional with a parameter like force-iterimpute.

@Pacman1984 Pacman1984 changed the title IterativeImputer behaviour on no nan's in fit data IterativeImputer behaviour on missing nan's in fit data May 3, 2019
@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented May 3, 2019

@Pacman1984

This comment has been minimized.

Copy link
Author

@Pacman1984 Pacman1984 commented May 3, 2019

OK, I do pull request. Sorry iam a newby on github participation.

@Pacman1984 Pacman1984 closed this May 3, 2019
@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented May 3, 2019

We keep the issue open until the issue is solved :)

Let us know if you need help.

@jnothman jnothman reopened this May 3, 2019
@sergeyf

This comment has been minimized.

Copy link
Contributor

@sergeyf sergeyf commented May 3, 2019

Just making sure I understand this...

Would it work like this?

(1) Apply initial imputation to every single feature including i.
(2) Run the entire sequence of stored regressors while keeping feature i fixed (transform only, no fits).
(3) NEW: run a single fit/transform imputation for feature i.

Is that correct? If not, at what point would we fit/transform a single imputation of feature i?

@Pacman1984

This comment has been minimized.

Copy link
Author

@Pacman1984 Pacman1984 commented May 3, 2019

Yes, exactly, that would be correct and the clean way.

A fast correction could be (have not tested it), to make this part in https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/impute.py optional:

Line 679 to 683:

        # if nothing is missing, just return the default
        # (should not happen at fit time because feat_ids would be excluded)
        missing_row_mask = mask_missing_values[:, feat_idx]
        if not np.any(missing_row_mask):
            return X_filled, estimator

Because the iterative process would not effect the feature i with respect to updated imputes. Don't making a special case should end up in the same result as the clean version you proposed @sergeyf .

@sergeyf

This comment has been minimized.

Copy link
Contributor

@sergeyf sergeyf commented May 3, 2019

Ah, I see what you're saying. Just keep fitting & storing the models but not doing any prediction. I don't see any obvious downsides here. Good idea, thanks.

@Pacman1984

This comment has been minimized.

Copy link
Author

@Pacman1984 Pacman1984 commented May 3, 2019

Yes exactly, keep fitting but dont do predictions for features with no missing values.
Should i close and do a pull request or whats the process?

@sergeyf

This comment has been minimized.

Copy link
Contributor

@sergeyf sergeyf commented May 3, 2019

Leave this open, start a PR. Once a PR is merged, this ticket can be closed.

Thanks.

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented May 4, 2019

@sergeyf

This comment has been minimized.

Copy link
Contributor

@sergeyf sergeyf commented May 4, 2019

@glemaitre

This comment has been minimized.

Copy link
Contributor

@glemaitre glemaitre commented May 4, 2019

Or maybe we should consider making IterativeImputer experimental for this release??

One possible reason might be linked with the default estimator, which I find slow to use it. Maybe, one cycle in experimental would allow to quickly change those if they are shown to be problematic in practice.

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented May 5, 2019

Let's do it. We have the mechanism, we're sure we've made design and implementation choices here that are not universal, so I'll open an issue

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented May 29, 2019

Pull request welcome to change the behaviour for features which are fully observed at training.

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Jul 12, 2019

@jnothman do you want that by default or as a parameter? I feel like doing it by default might increase training time a lot if only a few features are actually missing in the data.

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Jul 25, 2019

@jnothman do you want that by default or as a parameter? I feel like doing it by default might increase training time a lot if only a few features are actually missing in the data.

I think it would be sensible to enable by default, but have the ability to disable it.

@JackMiranda

This comment has been minimized.

Copy link

@JackMiranda JackMiranda commented Jul 25, 2019

As someone who was confused enough by the current behavior to file a bug report, I too am in favor of making the new behavior a toggleable default!

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Jul 26, 2019

Would you like to submit a fix as a pull request, @JackMiranda?

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Jul 26, 2019

Or is @Pacman1984 still working on it?

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Aug 25, 2019

It'd be nice to make progress on this

@sergeyf

This comment has been minimized.

Copy link
Contributor

@sergeyf sergeyf commented Aug 25, 2019

Maybe I can take a crack at this.

To review: the change would be to (optionally and by default) to fit regressors on even those features that have no missing values at train time.

At transform, we can then impute them these features if they are missing for any sample.

We will need a new test, and to update the doc string, Maybe the test can come directly from #14383?

Am I missing anything?

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Sep 9, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.