Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
IterativeImputer behaviour on missing nan's in fit data #13773
Why is this behaviour forced:
Features with missing values during transform which did not have any missing values during fit will be imputed with the initial imputation method only.
This means by default it will return the mean of that feature. I would prefer just fit one iteration of the chosen estimator and use that fitted estimator to impute missing values.
import numpy as np from sklearn.impute import IterativeImputer imp = IterativeImputer(max_iter=10, verbose=0) imp.fit([[1, 2], [3, 6], [4, 8], [10, 20], [np.nan, 22], [7, 14]]) X_test = [[np.nan, 4], [6, np.nan], [np.nan, 6], [4, np.nan], [33, np.nan]] print(np.round(imp.transform(X_test)))
Example adjusted - Second feature has np.nan values --> iterative imputation with estimator
import numpy as np from sklearn.impute import IterativeImputer imp = IterativeImputer(max_iter=10, verbose=0) imp.fit([[1, 2], [3, 6], [4, 8], [10, 20], [np.nan, 22], [7, np.nan]]) X_test = [[np.nan, 4], [6, np.nan], [np.nan, 6], [4, np.nan], [33, np.nan]] print(np.round(imp.transform(X_test)))
Maybe sklearn/impute.py line 679 to 683 should be optional with a parameter like force-iterimpute.
Just making sure I understand this...
Would it work like this?
(1) Apply initial imputation to every single feature including i.
Is that correct? If not, at what point would we fit/transform a single imputation of feature i?
Yes, exactly, that would be correct and the clean way.
A fast correction could be (have not tested it), to make this part in https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/impute.py optional:
Line 679 to 683:
# if nothing is missing, just return the default # (should not happen at fit time because feat_ids would be excluded) missing_row_mask = mask_missing_values[:, feat_idx] if not np.any(missing_row_mask): return X_filled, estimator
Because the iterative process would not effect the feature i with respect to updated imputes. Don't making a special case should end up in the same result as the clean version you proposed @sergeyf .
I don't really see this as a hugely important or common use case. It's good to get right but it currently is reasonable if not perfect. What other concerns do you have?…
On Sat, May 4, 2019, 2:24 AM Joel Nothman ***@***.***> wrote: Or maybe we should consider making IterativeImputer experimental for this release?? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#13773 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAOJV3A4LM652TNRWY2VQRLPTVI6DANCNFSM4HKPSS4A> .
One possible reason might be linked with the default estimator, which I find slow to use it. Maybe, one cycle in experimental would allow to quickly change those if they are shown to be problematic in practice.
I think it would be sensible to enable by default, but have the ability to disable it.
Maybe I can take a crack at this.
To review: the change would be to (optionally and by default) to
We will need a new test, and to update the doc string, Maybe the test can come directly from #14383?
Am I missing anything?