Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Merge IterativeImputer into master branch #11977

Merged
merged 25 commits into from Feb 15, 2019
Merged

Conversation

jnothman
Copy link
Member

@jnothman jnothman commented Sep 3, 2018

This is a placeholder to track work on a new IterativeImputer (formerly MICE, ChainedImputer) that should accommodate the best features of R's MICE and missForest, especially as applicable to the settings of predictive modelling, clustering and decomposition.

Closes #11259

TODO:

Ping @sergeyf, @RianneSchouten

@sergeyf
Copy link
Contributor

sergeyf commented Sep 17, 2018

Here's a good recent paper about imputation methods: http://www.jmlr.org/papers/volume18/17-073/17-073.pdf

And a useful table from it:

image

As discussed, IterativeImputer can do perform any of the Sequential X algorithms, of which there are a bunch. I'm not sure what the difference is between Sequential KNN and Iterative KNN.

An example could perhaps compare RidgeCV, KNN, CART, RandomForest as model inputs?

@jnothman
Copy link
Member Author

jnothman commented Sep 17, 2018 via email

@sergeyf
Copy link
Contributor

sergeyf commented Sep 18, 2018

I don't have any reasons to think differences are substantial, but I also don't have enough experience using various sequential imputers.

@amueller
Copy link
Member

amueller commented Oct 5, 2018

what's to do here? Does this need reviews?

@sergeyf
Copy link
Contributor

sergeyf commented Oct 5, 2018

@amueller not yet. We have to merge a few more examples and modifications in here before it's ready for more reviews.

@sergeyf
Copy link
Contributor

sergeyf commented Oct 5, 2018

@jnothman looks like the most recent merge caused some branch conflicts, but I don't have access to resolve them.

@jnothman
Copy link
Member Author

jnothman commented Oct 7, 2018 via email

@sergeyf
Copy link
Contributor

sergeyf commented Oct 7, 2018

Great, thanks.

@jnothman
Copy link
Member Author

Test failures on some (but not the latest) version of scipy, around the usage of truncnorm.

@jnothman
Copy link
Member Author

Some thoughts of things we might want here to improve usability:

  • reorder the parameters so that predictor comes higher up, and perhaps other reordering to emphasise importance to user (pity that SimpleImputer puts missing_values first)
  • stopping criteria where sample_posterior=False: change n_iter to max_iter and measure for some change < tol
  • with or without this, we might want some measure of convergence where sample_posterior=False, i.e. report that change that would otherwise be used for stopping. Reasonable definition of change might be max(X{t} - X{i-1})

@sergeyf
Copy link
Contributor

sergeyf commented Jan 17, 2019

I still don't think max_iter is a good idea: in MICE mode, sampling means that it doesn't actually converge, but jumps around, and this is the expected/desired behavior.

Also, looks like you're having the same weird issue with the doctest that I'm having in the other PR.

@jnothman
Copy link
Member Author

I still don't think max_iter is a good idea: in MICE mode, sampling means that it doesn't actually converge, but jumps around, and this is the expected/desired behavior.

I am not proposing to apply it with sample_posterior=True. Without sample_posterior it would be good to have a way to identify how well it converged.

I can only get 26 in the doctest locally (haven't tried installing different dependencies), including with repeated transform, or repeated fit_transform, with different random_state.... Hmmm...

@sergeyf
Copy link
Contributor

sergeyf commented Jan 17, 2019

OK, that makes sense to me. I'll make a PR with the changes you suggested once work dies down a little and once we get the missforest example PR in.

Do you still want to wait on the amputer and MICE example before merging this? We may not get anyone to finish those up...

@jnothman
Copy link
Member Author

jnothman commented Jan 18, 2019 via email

@sergeyf
Copy link
Contributor

sergeyf commented Feb 13, 2019

I agree. It would be good to get this into the hands of users (i.e. merged before next major release) as I assume there will be plenty of other things we'll learn from them - and that will inform further how to make the examples better.

@jnothman
Copy link
Member Author

I'm not sure what you mean by improving training examples. Do you mean improving examples to use real-world missing data or data removed not completely at random?

When you say we should go beyond mean and std of predictions, do you mean mean and std of scores? Why should we go beyond it in this case?

@glemaitre
Copy link
Member

glemaitre commented Feb 14, 2019 via email

@jnothman
Copy link
Member Author

jnothman commented Feb 14, 2019 via email

@sergeyf
Copy link
Contributor

sergeyf commented Feb 14, 2019 via email

@jnothman
Copy link
Member Author

jnothman commented Feb 14, 2019 via email

@jnothman
Copy link
Member Author

jnothman commented Feb 14, 2019 via email

@sergeyf
Copy link
Contributor

sergeyf commented Feb 14, 2019 via email

@jnothman
Copy link
Member Author

jnothman commented Feb 15, 2019 via email

@jnothman
Copy link
Member Author

I'm going to merge this and get it road tested. Can you set the base for the multiple imputation pr to master?

Thanks and congratulations!

@jnothman jnothman merged commit b8d1226 into master Feb 15, 2019
@jnothman
Copy link
Member Author

Unfortunately GitHub has recorded me as the author of that commit :( I should have done the merge manually

@sergeyf
Copy link
Contributor

sergeyf commented Feb 15, 2019 via email

@amueller
Copy link
Member

AWESOME! Amazing work y'all!

@glemaitre
Copy link
Member

@sergeyf Thanks for the hard work.

@reckoner
Copy link

reckoner commented Feb 20, 2019

Great work! In reading the documentation, it seems there is no variance adjustment for the so-imputed values for estimators. Is something that is out of scope?

@sergeyf
Copy link
Contributor

sergeyf commented Feb 20, 2019

@reckoner Do you have a pointer to what variance adjustment means in this case? What purpose does it serve?

@reckoner
Copy link

@sergeyf In Chapter 5 ("Estimation of imputation uncertainty") of

Little, Roderick JA, and Donald B. Rubin. Statistical analysis with missing data. Vol. 333. John Wiley & Sons, 2014.

The issue is that the estimators that are generated from processing the dataset that have been filled out via imputation have to be adjusted for the variance in the imputed values. For example, the mean based on the so-imputed data has a different estimator variance (e.g., confidence interval) than that of the non-imputed data. Imputing multiple times means that the uncertainties (i.e., variance) of the multiple imputation is used to estimate this variance and flow this down to the ultimate estimator.

I don't know, outside of predict_proba, which is implemented for a few sklearn objects, how such variance adjustment would flow down to these few objects, but I don't see how it would percolate down to other objects that do not implement predict_proba.

I hope that helps.

@sergeyf
Copy link
Contributor

sergeyf commented Feb 20, 2019

Ah ok. Well, IterativeImputer by default doesn't do multiple imputation. You have to do it yourself by calling it with sample_posterior=True and different flags each time. We have an open PR where this is demonstrated (possibly the variance adjustments you're describing). #13025

Please take a look and comment there if you think it's appropriate. We're also happy to take contributions to that PR because the original person (who knows MICE-related topics better than I do) was unable to continue working on it.

@reckoner
Copy link

You are correct. Seems like this issue was raised here by @jnothman

#13025 (comment)

Let me study this carefully.

@jnothman
Copy link
Member Author

jnothman commented Feb 20, 2019 via email

@qinhanmin2014 qinhanmin2014 deleted the iterativeimputer branch March 3, 2019 12:08
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
@adrinjalali adrinjalali moved this from To do to Reviewer approved in Missing Values/Imputation Oct 21, 2019
@adrinjalali adrinjalali moved this from Reviewer approved to Done in Missing Values/Imputation Oct 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

6 participants