Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Merge IterativeImputer into master branch #11977

Merged
merged 25 commits into from Feb 15, 2019
Merged

[MRG] Merge IterativeImputer into master branch #11977

merged 25 commits into from Feb 15, 2019

Conversation

@jnothman
Copy link
Member

jnothman commented Sep 3, 2018

This is a placeholder to track work on a new IterativeImputer (formerly MICE, ChainedImputer) that should accommodate the best features of R's MICE and missForest, especially as applicable to the settings of predictive modelling, clustering and decomposition.

Closes #11259

TODO:

  • complete the rename and some redesign (#11350)
  • an example of using this as missforest (#12100)
  • ?an example of using this for multiple imputation (#13025)
  • avoid RidgeCV as default predictor due to non-smooth effect of regularisation parameter (#13038)
  • consider reordering parameters. predictors should go up in the list of parameters. (#13061)
  • n_iter -> max_iter with checks for convergence at max(abs(X - Xprev)) < tol (#13061)
  • rename predictor to estimator (#13061)
  • ?missing indicator

Ping @sergeyf, @RianneSchouten

jnothman and others added 4 commits Sep 3, 2018
This reverts commit f819704.
…11350)

Towards making this more generic than MICE
@sergeyf

This comment has been minimized.

Copy link
Contributor

sergeyf commented Sep 17, 2018

Here's a good recent paper about imputation methods: http://www.jmlr.org/papers/volume18/17-073/17-073.pdf

And a useful table from it:

image

As discussed, IterativeImputer can do perform any of the Sequential X algorithms, of which there are a bunch. I'm not sure what the difference is between Sequential KNN and Iterative KNN.

An example could perhaps compare RidgeCV, KNN, CART, RandomForest as model inputs?

@jnothman

This comment has been minimized.

Copy link
Member Author

jnothman commented Sep 17, 2018

@sergeyf

This comment has been minimized.

Copy link
Contributor

sergeyf commented Sep 18, 2018

I don't have any reasons to think differences are substantial, but I also don't have enough experience using various sequential imputers.

@sergeyf

This comment has been minimized.

Copy link
Contributor

sergeyf commented on doc/whats_new/v0.21.rst in 09a9a21 Oct 5, 2018

The entire IterativeImputer is new, so this enhancement note is not really necessary.

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Oct 5, 2018

what's to do here? Does this need reviews?

@sergeyf

This comment has been minimized.

Copy link
Contributor

sergeyf commented Oct 5, 2018

@amueller not yet. We have to merge a few more examples and modifications in here before it's ready for more reviews.

@sergeyf

This comment has been minimized.

Copy link
Contributor

sergeyf commented Oct 5, 2018

@jnothman looks like the most recent merge caused some branch conflicts, but I don't have access to resolve them.

@jnothman

This comment has been minimized.

Copy link
Member Author

jnothman commented Oct 7, 2018

@sergeyf

This comment has been minimized.

Copy link
Contributor

sergeyf commented Oct 7, 2018

Great, thanks.

jnothman added 4 commits Jan 16, 2019
@jnothman

This comment has been minimized.

Copy link
Member Author

jnothman commented Jan 16, 2019

Test failures on some (but not the latest) version of scipy, around the usage of truncnorm.

@jnothman

This comment has been minimized.

Copy link
Member Author

jnothman commented Jan 17, 2019

Some thoughts of things we might want here to improve usability:

  • reorder the parameters so that predictor comes higher up, and perhaps other reordering to emphasise importance to user (pity that SimpleImputer puts missing_values first)
  • stopping criteria where sample_posterior=False: change n_iter to max_iter and measure for some change < tol
  • with or without this, we might want some measure of convergence where sample_posterior=False, i.e. report that change that would otherwise be used for stopping. Reasonable definition of change might be max(X{t} - X{i-1})
@sergeyf

This comment has been minimized.

Copy link
Contributor

sergeyf commented Jan 17, 2019

I still don't think max_iter is a good idea: in MICE mode, sampling means that it doesn't actually converge, but jumps around, and this is the expected/desired behavior.

Also, looks like you're having the same weird issue with the doctest that I'm having in the other PR.

@jnothman

This comment has been minimized.

Copy link
Member Author

jnothman commented Jan 17, 2019

I still don't think max_iter is a good idea: in MICE mode, sampling means that it doesn't actually converge, but jumps around, and this is the expected/desired behavior.

I am not proposing to apply it with sample_posterior=True. Without sample_posterior it would be good to have a way to identify how well it converged.

I can only get 26 in the doctest locally (haven't tried installing different dependencies), including with repeated transform, or repeated fit_transform, with different random_state.... Hmmm...

@sergeyf

This comment has been minimized.

Copy link
Contributor

sergeyf commented Jan 17, 2019

OK, that makes sense to me. I'll make a PR with the changes you suggested once work dies down a little and once we get the missforest example PR in.

Do you still want to wait on the amputer and MICE example before merging this? We may not get anyone to finish those up...

@jnothman

This comment has been minimized.

Copy link
Member Author

jnothman commented Feb 13, 2019

I'm not sure what you mean by improving training examples. Do you mean improving examples to use real-world missing data or data removed not completely at random?

When you say we should go beyond mean and std of predictions, do you mean mean and std of scores? Why should we go beyond it in this case?

@glemaitre

This comment has been minimized.

Copy link
Contributor

glemaitre commented Feb 14, 2019

@jnothman

This comment has been minimized.

Copy link
Member Author

jnothman commented Feb 14, 2019

@sergeyf

This comment has been minimized.

Copy link
Contributor

sergeyf commented Feb 14, 2019

@jnothman

This comment has been minimized.

Copy link
Member Author

jnothman commented Feb 14, 2019

@jnothman

This comment has been minimized.

Copy link
Member Author

jnothman commented Feb 14, 2019

@sergeyf

This comment has been minimized.

Copy link
Contributor

sergeyf commented Feb 14, 2019

@jnothman

This comment has been minimized.

Copy link
Member Author

jnothman commented Feb 15, 2019

@jnothman

This comment has been minimized.

Copy link
Member Author

jnothman commented Feb 15, 2019

I'm going to merge this and get it road tested. Can you set the base for the multiple imputation pr to master?

Thanks and congratulations!

@jnothman jnothman merged commit b8d1226 into master Feb 15, 2019
13 checks passed
13 checks passed
LGTM analysis: C/C++ No code changes detected
Details
LGTM analysis: JavaScript No code changes detected
Details
LGTM analysis: Python 1 new alert
Details
ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: doc Your tests passed on CircleCI!
Details
ci/circleci: doc-min-dependencies Your tests passed on CircleCI!
Details
ci/circleci: lint Your tests passed on CircleCI!
Details
codecov/patch 98.77% of diff hit (target 92.43%)
Details
codecov/project 92.5% (+0.07%) compared to fc65d9f
Details
continuous-integration/appveyor/branch AppVeyor build succeeded
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details
@jnothman

This comment has been minimized.

Copy link
Member Author

jnothman commented Feb 15, 2019

Unfortunately GitHub has recorded me as the author of that commit :( I should have done the merge manually

@sergeyf

This comment has been minimized.

Copy link
Contributor

sergeyf commented Feb 15, 2019

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Feb 15, 2019

AWESOME! Amazing work y'all!

@glemaitre

This comment has been minimized.

Copy link
Contributor

glemaitre commented Feb 15, 2019

@sergeyf Thanks for the hard work.

@reckoner

This comment has been minimized.

Copy link

reckoner commented Feb 20, 2019

Great work! In reading the documentation, it seems there is no variance adjustment for the so-imputed values for estimators. Is something that is out of scope?

@sergeyf

This comment has been minimized.

Copy link
Contributor

sergeyf commented Feb 20, 2019

@reckoner Do you have a pointer to what variance adjustment means in this case? What purpose does it serve?

@reckoner

This comment has been minimized.

Copy link

reckoner commented Feb 20, 2019

@sergeyf In Chapter 5 ("Estimation of imputation uncertainty") of

Little, Roderick JA, and Donald B. Rubin. Statistical analysis with missing data. Vol. 333. John Wiley & Sons, 2014.

The issue is that the estimators that are generated from processing the dataset that have been filled out via imputation have to be adjusted for the variance in the imputed values. For example, the mean based on the so-imputed data has a different estimator variance (e.g., confidence interval) than that of the non-imputed data. Imputing multiple times means that the uncertainties (i.e., variance) of the multiple imputation is used to estimate this variance and flow this down to the ultimate estimator.

I don't know, outside of predict_proba, which is implemented for a few sklearn objects, how such variance adjustment would flow down to these few objects, but I don't see how it would percolate down to other objects that do not implement predict_proba.

I hope that helps.

@sergeyf

This comment has been minimized.

Copy link
Contributor

sergeyf commented Feb 20, 2019

Ah ok. Well, IterativeImputer by default doesn't do multiple imputation. You have to do it yourself by calling it with sample_posterior=True and different flags each time. We have an open PR where this is demonstrated (possibly the variance adjustments you're describing). #13025

Please take a look and comment there if you think it's appropriate. We're also happy to take contributions to that PR because the original person (who knows MICE-related topics better than I do) was unable to continue working on it.

@reckoner

This comment has been minimized.

Copy link

reckoner commented Feb 20, 2019

You are correct. Seems like this issue was raised here by @jnothman

#13025 (comment)

Let me study this carefully.

@jnothman

This comment has been minimized.

Copy link
Member Author

jnothman commented Feb 20, 2019

@qinhanmin2014 qinhanmin2014 deleted the iterativeimputer branch Mar 3, 2019
xhlulu added a commit to xhlulu/scikit-learn that referenced this pull request Apr 28, 2019
xhlulu added a commit to xhlulu/scikit-learn that referenced this pull request Apr 28, 2019
xhlulu added a commit to xhlulu/scikit-learn that referenced this pull request Apr 28, 2019
koenvandevelde added a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
@adrinjalali adrinjalali moved this from To do to Reviewer approved in Missing Values/Imputation Oct 21, 2019
@adrinjalali adrinjalali moved this from Reviewer approved to Done in Missing Values/Imputation Oct 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
6 participants
You can’t perform that action at this time.