Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+2] Basic version of MICE Imputation #8478

Merged
merged 170 commits into from Apr 16, 2018

Conversation

Projects
None yet
@sergeyf
Copy link
Contributor

sergeyf commented Feb 28, 2017

Reference Issue

This is in reference to #7840, and builds on #7838.

Fixes #7840.

This code provides basic MICE imputation functionality. It currently only uses Bayesian linear regression as the prediction model. Once this is merged, I will add predictive mean matching (slower but sometimes better). See here for a reference: https://stat.ethz.ch/education/semesters/ss2012/ams/paper/mice.pdf

@sergeyf sergeyf referenced this pull request Mar 1, 2017

Closed

Applying to new data #26

@codecov

This comment has been minimized.

Copy link

codecov bot commented Mar 2, 2017

Codecov Report

Merging #8478 into master will increase coverage by 0.03%.
The diff coverage is 98.2%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #8478      +/-   ##
==========================================
+ Coverage    96.1%   96.13%   +0.03%     
==========================================
  Files         337      337              
  Lines       63295    63179     -116     
==========================================
- Hits        60828    60740      -88     
+ Misses       2467     2439      -28
Impacted Files Coverage Δ
sklearn/dummy.py 97.74% <100%> (+0.05%) ⬆️
sklearn/utils/estimator_checks.py 93.29% <100%> (ø) ⬆️
sklearn/tests/test_dummy.py 100% <100%> (ø) ⬆️
sklearn/preprocessing/tests/test_imputation.py 100% <100%> (ø) ⬆️
sklearn/preprocessing/__init__.py 100% <100%> (ø) ⬆️
sklearn/preprocessing/imputation.py 95.25% <96.57%> (+1.13%) ⬆️
sklearn/linear_model/tests/test_bayes.py 90.42% <0%> (-0.76%) ⬇️
sklearn/linear_model/bayes.py 95.45% <0%> (-0.23%) ⬇️
sklearn/preprocessing/data.py 98.81% <0%> (-0.17%) ⬇️
sklearn/feature_selection/tests/test_rfe.py 98.57% <0%> (-0.08%) ⬇️
... and 12 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9808810...93d93cf. Read the comment docs.

@raghavrv raghavrv self-requested a review Mar 2, 2017

@raghavrv raghavrv added this to Would be very nice to have (MID) in Raghav's *personal* review-priority listing Mar 2, 2017

@sergeyf

This comment has been minimized.

Copy link
Contributor Author

sergeyf commented Mar 6, 2017

@raghavrv Do you have some suggestions for a minimal suite of tests that would cover MICE reasonably?

@raghavrv raghavrv changed the title [WIP] Basic version of MICE Imputation (3rd try) [WIP] Basic version of MICE Imputation Mar 7, 2017

@raghavrv

This comment has been minimized.

Copy link
Member

raghavrv commented Mar 7, 2017

You could try to replicate all the tests that are currently there for other imputation techniques as a basic starting point... Further, a hand computed data could be tested against to avoid unintended regressions when modifying this code...

@raghavrv

This comment has been minimized.

Copy link
Member

raghavrv commented Mar 7, 2017

Also you could try and write tests that would improve the coverage...

@gerkovink

This comment has been minimized.

Copy link

gerkovink commented Jun 14, 2018

Yes, if doing multiple imputation in a for loop with different random seeds
is identical to the proposal, then I think an example on imputation
uncertainty is the right way to go about it.​

Yes. Let randomness influence the parameter draws and imputations, but make sure that the imputation models remain constant over the loops.

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Jun 14, 2018

@sergeyf

This comment has been minimized.

Copy link
Contributor Author

sergeyf commented Jun 14, 2018

@RianneSchouten @stefvanbuuren just to confirm: can instances of MICE be run totally independently? That is, can the for loop over m imputation be the outer-most for loop? Or does it have to be the way Rianne's code is written: where the for loop over m is the inner-most for loop?

In both cases, I think we can either make a MICEPipeline or add functionality to the current Pipeline so that it knows how to properly do multiple imputations when MICEImputer is one of the pipeline's steps.

For example, let's say the pipeline is: A -> B -> MICEImputer -> C -> D.

A MICEPipeline would be given this pipeline and m. It would run steps A -> B as per usual, and then run the MICEImputer portion m times. The m outputs would all go through C -> D and the m results are then returned.

I think this is pretty much what @glemaitre suggested, right?

@glemaitre

This comment has been minimized.

Copy link
Contributor

glemaitre commented Jun 14, 2018

imputation models remain constant over the loops.

I am a bit confused. What do you mean by a constant model.

Couple of remarks which I would like to be clarified:

  1. What is the usefulness of n_imputations of the current implementation. Somehow, n_burn_in will allow to converge to a "stable" predictor. mu and sigma should more less converge to the same solution. Taking the average over n_imputations seem then not really useful.

  2. If my previous intuition is not wrong, we could:

    • Rename MICEImputer to something else (MultivariateImputer). Having MICE in the naming will bring us issue when users will expect algorithm to perform the same way than in R for instance.
    • We could remove n_imputations.
  3. In case we want to get a multiple imputation, we can indeed combine the MICEImputer with another estimator and put it inside a BaggingClassifier/VotingClassifier. However, this implementation will be inefficient since it will make n_burn_in iterations for each pipeline (imputer, estimator) instead of once.

  4. We could even think to have a generic object that could make multiple imputation using any imputer.

I would think that for the release it could be good to actually tackle (2). We can see afterwords how to better handle the multiple imputation.

NB: be aware that I am not familiar with the stats behind MICE, pardon me if my statements are wrong. They are intended to have constructive discussions :)

@glemaitre

This comment has been minimized.

Copy link
Contributor

glemaitre commented Jun 14, 2018

I think this is pretty much what @glemaitre suggested, right?

I think so. In addition, I would expect the MICEPipeline to "pool" the results.

@sergeyf

This comment has been minimized.

Copy link
Contributor Author

sergeyf commented Jun 14, 2018

Re: n_imputations, I found that taking the average over n_imputations reduced MSE of the imputed values. As @stefvanbuuren pointed out, doing so actually artificially inflates the correlation between the imputed entries, but it's still better (empirically) in a mean-squared error sense, which is how ML people think about imputation's effectiveness for continuous features. Just to provide more intuition here: a single imputation out of n_imputations is high-variance. As such, averaging them helps provide single stable imputations.

As such. we should keep this parameter in my opinion. Do any of the more statistically knowledgeable folks have an opinion about this point? There is always the option to do something like n_burn_in = 100, n_imputations=1, m=100 to replicate the way MICE was designed to be used. But for ML purposes, the most common use case will be n_burn_in=10, n_imputations=100, m=1.

What's the difference between my suggestion of a MICEPipeline and the idea in (3) that @glemaitre posted?

By the way, I don't think there is anything that can be done about the inefficiency that is pointed out in (3). It's the consequence of having multiple imputations - you have to multiply all your computation starting with the MICE step by m.

@stefvanbuuren

This comment has been minimized.

Copy link

stefvanbuuren commented Jun 14, 2018

@sergeyf Yes, for a given imputation model, you may create m completed datasets fully independently. It is entirely parallel. If you want to do generation in this way, just initialize each instance with its own seed, and store the resulting sets in a convenient form for repeated analyses and pooling.

@glemaitre

This comment has been minimized.

Copy link
Contributor

glemaitre commented Jun 14, 2018

What's the difference between my suggestion of a MICEPipeline and the idea in (3) that @glemaitre posted?

It is the same :)

By the way, I don't think there is anything that can be done about the inefficiency that is pointed out in (3). It's the consequence of having multiple imputations - you have to multiply all your computation starting with the MICE step by m.

We might have a work around sharing the imputer with a warm_start.

@sergeyf

This comment has been minimized.

Copy link
Contributor Author

sergeyf commented Jun 14, 2018

Thanks @stefvanbuuren for the confirmation.

@glemaitre I think based on Stef's comments, we shouldn't be sharing any of the computation. It's m totally separate/parallel imputers identical in every way except with different seeds.

@stefvanbuuren

This comment has been minimized.

Copy link

stefvanbuuren commented Jun 14, 2018

Averaging the data gives inflated correlations, too short confidence intervals and too low P-values.
I generally advise against it, unless you have reason to do otherwise. (https://stefvanbuuren.name/fimd/workflow.html).

The MSE and multiple imputation is not a happy marriage. If you try to select imputation methods that minimise MSE, then you end up imputing the same "best" value over and over again. But those single imputation methods do not appropriately incorporate the uncertainty of the imputation, and treat any imputed value as an observed value. See https://stefvanbuuren.name/fimd/sec-true.html for a short discussion.

@sergeyf

This comment has been minimized.

Copy link
Contributor Author

sergeyf commented Jun 14, 2018

@stefvanbuuren thanks for the additional warnings and readings. My take is that this "unhappy marriage" must last. No divorce =)

To clarify my point, here is how the current MICEImputer works:

(1) Run for n_burn_in iterations.
(2) Run for n_imputations iterations.
(3) Return the average of the last n_imputations iterations <- this the "wrong" part.

The "correct" thing to do is to set n_imputations = 1 and run steps (1) to (3) m times with different seeds, thus preserving uncertainty and avoiding the problems of averaging that Stef described.

But we also need to support the "wrong" MSE-minimizing case where n_imputations > 1 and m = 1. Good single imputations are very useful for real-life work, despite ignoring inherent uncertainty in missing values and inflating correlations, etc.

We CAN support both with only minor additional coding and/or some carefully written examples/documentation. That is, we should issue proper warnings about the problems with the ML use-case, and point out the theoretically correct ways to use MICE.

@gerkovink

This comment has been minimized.

Copy link

gerkovink commented Jun 14, 2018

@glemaitre With

Let randomness influence the parameter draws and imputations, but make sure that the imputation models remain constant over the loops.

I meant exactly what @sergeyf said:

totally separate/parallel imputers identical in every way except with different seeds.

It seems that we are speaking different languages :)

@glemaitre

This comment has been minimized.

Copy link
Contributor

glemaitre commented Jun 14, 2018

It seems that we are speaking different languages :)

Yep but don't worry, this is clear now.

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Jun 14, 2018

@stefvanbuuren

This comment has been minimized.

Copy link

stefvanbuuren commented Jun 14, 2018

@sergeyf I agree that the n_imputations > 1 and m = 1 could be useful in specific cases. In particular, it is valid for the case where the ONLY objective is to find the most likely value for each missing cell (e.g. image restoration where you want to show the most likely value for each missing pixel).

My hesitation is that - because of its apparent easiness - it will invite users to apply it in cases for which it is not designed, and hence give incorrect statistical inferences without the user being aware of it. It is perhaps better not to sell it is multiple imputation for python, but rather as a way to obtain predictions for the missing cells, and be very clear that the method is no replacement for multiple imputation.

@sergeyf

This comment has been minimized.

Copy link
Contributor Author

sergeyf commented Jun 14, 2018

Sounds like we're all agreed that the minimum that needs to be done is to change the name of MICEImputer (and alter some documentation) before the next release.

These options were mentioned: ChainedImputer, MultivariateImputer

I vote for ChainedImputer to preserve some connection to MICE.

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Jun 14, 2018

We have the following things to do:

  1. Determine the most appropriate way to use individual imputation samples in predictive modelling, clustering, etc, which are Scikit-learn's focus.
    a. is using a single draw acceptable?
    b. is averaging over multiple draws from the final fit appropriate?
    c. is ensembling multiple predictive estimators each trained on a different imputation most appropriate?
  2. Perhaps determine if, in a predictive modelling context, it is necessary to have the sophistication of MICE in sampling each imputation value rather than just using point predictions.
  3. Provide an example illustrating the inferential capabilities due to multiple imputation. I don't think there's anything limiting about our current interface, but it deserves an example.
  4. Rename MICEImputer to de-emphasise multiple imputation because it only performs a single one at a time.
@glemaitre

This comment has been minimized.

Copy link
Contributor

glemaitre commented Jun 14, 2018

I will open an issue and add couple of things missing but it seems like a plan.

@glemaitre glemaitre referenced this pull request Jun 14, 2018

Closed

MICE/Multiple Imputation branch #11259

0 of 4 tasks complete
@RianneSchouten

This comment has been minimized.

Copy link

RianneSchouten commented Jun 14, 2018

A minor other thing to think about:
X_filled is currently a mean imputed dataset. In the original idea of MICE, the first round of imputation happens with random imputation. By using different seeds, this ensures that each parallel running imputation sequence starts with a different dataset. I don't know to what extent it matters, though.

@sergeyf : i placed the loop over m as the most inner loop because it was the easiest way to implement it in the current code (X_filled is the same here for every m anyways). However, the result is not different from when you would make it the outer loop, as long as you continue with the right dataset. For example: impute var 1 in dataset 1, var 1 in dataset 2, var 1 in dataset 3, var 2 in dataset 1, var 2 in dataset 2, etc. is similar to impute var 1 in dataset 1, var 2 in dataset 1, var 1 in dataset 2, var 2 in dataset 2, var 1 in dataset 3, etc.

Another minor thing, it might be good to have a possibility to check whether the algorithm converges. For example by calculating and storing the means of every feature for every i_rnd.

@sergeyf

This comment has been minimized.

Copy link
Contributor Author

sergeyf commented Jun 14, 2018

Ah thanks @RianneSchouten. Currently, there is no way to start in with a random imputation. Might be good to add this support as well. Is the randomness in the MICE R package dependent on the non-missing values?

@RianneSchouten

This comment has been minimized.

Copy link

RianneSchouten commented Jun 14, 2018

@gerkovink

This comment has been minimized.

Copy link

gerkovink commented Jun 14, 2018

it takes the observed values of each feature and impute randomly from those observed
values.

That is indeed the default in mice, although specific starting values can also be provided. There is no strict need for a better form of initialization. There is randomness induced in the mice algorithm: therefore the autocorrelation between the iterations is usually low and the algorithm converges rapidly.

it might be good to have a possibility to check whether the algorithm converges. For example by calculating and storing the means of every feature for every i_rnd.

In mice the means and sd's for the chains at every iteration are calculated on the imputed values only, because the observed values remain constant. This may be more efficient.

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Jun 15, 2018

@RianneSchouten

This comment has been minimized.

Copy link

RianneSchouten commented Jun 15, 2018

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Jun 16, 2018

@sergeyf

This comment has been minimized.

Copy link
Contributor Author

sergeyf commented Jun 18, 2018

The rename PR is here: #11314

It's probably also a good place to contribute additional documentation about the difference between the original MICE use-case and sklearn's main use-case. Feel free to contribute or make suggestions in the PR.

jorisvandenbossche added a commit to jorisvandenbossche/scikit-learn that referenced this pull request Jul 17, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.