Documenting missing-values practices #21967

aperezlebel · 2021-12-13T12:41:22Z

Describe the issue linked to the documentation

Context

We discussed with @glemaitre and @GaelVaroquaux about documenting missing-values practices for prediction in scikit-learn as part of my PhD work at Inria (discussion here).

Indeed, the current documentation gives no recommendations on this point. We feel that there is now a better understanding of missing values in the context of supervised learning and thus that we have more hindsight on the theoretical and practical messages that would be helpful for the users than when the current documentation was written. We think it would be useful to restructure the documentation and examples to convey these messages.

Messages to convey

Main messages:

Missing values in inference and in supervised learning are different problems with different tradeoffs. Define the terms and highlight the differences.
Don't impute jointly training and test sets: data leakage and can't use in production.
Simpler learners need powerful imputation (e.g conditional imputation with Iterative Imputer). Define conditional imputation. (theoretical arguments can be found in LeMorvan 2020).
Conditional imputation is guaranteed to work only for "ignorable missingness" (Missing At Random mechanism, to define). Otherwise, the mask is needed (missingness is seldom ignored: the data are missing for a reason). Wikipedia pages on missing values can justify this.
Powerful learners + simple imputation or no imputation works best (robustness to missingness mechanisms and flexibility), e.g HistGradientBoosting (this comes from experience, including systematic benchmarks).
For categorical features, impute missing values as a new category (imputing to an existing category destroys information important to the learner).
Computation cost of imputation can quickly get large, and even intractable for the most costly methods (e.g IterativeImputer, KNNImputer).

Side messages:

The optimal predictor on partially-observed values is not always "good" imputation + the optimal predictor on the fully-observed values (Le Morvan et al. 2021). You need to account for missingness in some way.
For multiple imputation, need to separate training and test behaviors (cf main message 1 above).
As a consequence, ensemble methods, such as bagging, are a good solution for implementing multiple imputation in practice (a single supervised learning applied to many imputations is likely severely suboptimal).

Take-home messages:

If little data: use conditional imputation and simple learners.
If a lot of data (n>1000), use HistGradientBoosting.
Don't impute categorical variables.

Ressources

Wikipedia Missing data
Josse 2019 On the consistency of supervised learning with missing values
LeMorvan 2020 Linear predictor on linearly-generated data with missing values: non consistency and solutions
LeMorvan 2021 What's a good imputation to predict with missing values?
Perez-Lebel 2022 Benchmarking missing-values approaches for predictive models on health databases

Suggest a potential alternative/fix

After discussing with @glemaitre and @GaelVaroquaux, the following changes were suggested.

Big picture

The goal is to give the recommendations above, and have simple examples that convey the right intuitions (even simple simulated data can be didactic by showing the basic mechanisms).

Write a narrative doc page that gives the big picture messages listed above and some figures.
Replace the current examples about imputation that don't give a clear message.
An example to generate simple figures explaining the difference between Missing At Random and Missing Not At Random (as in https://www.slideshare.net/GaelVaroquaux/dirty-data-science-machine-learning-on-noncurated-data/27).
Didactic purpose: intuitions on the fact that missing values may distort distributions. Short example.
An example to develop intuitions on imputation with the interplay between imputation and learning, adapted from http://dirtydata.science/python/gen_notes/01_missing_values.html, but only the two first sections. Didactic purpose: How the mechanism + imputation modifies the link between X and y.
Adapt the docstrings to give local recommendations:
a. IterativeImputer: give time complexity (algorithmic scalability) and say it is not a magic bullet in the face of structured missingness.
b. KNNImputer: terrible computational scalability.
c. SimpleImputer: does not work well with simple models.

Proposed roadmap

(refers to above, and can be detailed in a board)

ogrisel · 2021-12-13T17:38:03Z

I like the general plan. Thanks @A-pl!

Don't impute categorical variables.

I would ad that using SimpleImputer(strategy="constant", fill_value="missing", missing_values=None) might still be helpful to convert missing value markers for downstream models that would accept string categorical labels but would raise an exception when encountering a None (or np.nan, or pd.NA) value.

glemaitre · 2021-12-13T22:03:47Z

I would ad that using SimpleImputer(strategy="constant", fill_value="missing", missing_values=None) might still be helpful

You are right. No imputation referred to "no advanced imputation". I think that we even discussed briefly if it would make sense that OneHotEncoder and OrdinalEncoder should allow for encoding missing values (instead of only being lenient) to simplify the pipeline. However this is out-of-scope for the current plan :)

GaelVaroquaux · 2021-12-14T09:10:56Z

we even discussed briefly if it would make sense that OneHotEncoder and OrdinalEncoder should allow for encoding missing values (instead of only being lenient) to simplify the pipeline.

This would be great.

However this is out-of-scope for the current plan :)

I agree. If there is a consensus that the above is a good suggestion (which I really think it is), we could open an issue, but focus on it later.

ogrisel · 2021-12-16T14:57:08Z

If there is a consensus that the above is a good suggestion (which I really think it is), we could open an issue, but focus on it later.

#21988 is still letting np.nan passthrough by default but at least would now give an option to replace np.nan by something else (e.g. -2 for instance, while -1 is the default for unknown values at predict time).

We could change the passtrough behavior in a follow-up PR but that would require a deprecation cycle.

richierocks · 2022-05-03T18:32:30Z

I'm excited that this is being worked on. I recently tried to figure out what I thought was a simple thing: how should I decide which of SimpleInputer, InterativeImputer or KNNImputer to use? The documentation for this comes up short. I tried asking the internet, but got little useful response.

aperezlebel · 2022-06-21T13:20:09Z

KNNImputer: terrible computational scalability.

I am not sure how much this claim holds now that #16397 reduced the memory footprint of KNN using chunked distance computation (see also #15604). By looking at the code, it looks O(np) to me now, while it was originally O(n^2 + np).

GaelVaroquaux · 2022-06-21T13:55:50Z

I don't think that exact KNN can be less than n^2: it requires computing the distances between each pair of samples.

aperezlebel · 2022-06-21T14:15:51Z

In terms of computational time yes I agree, but in terms of memory footprint (which was the bottleneck of KNNImputer before #16397), it seems to be achieved in less than n^2 with the current implementation since the distance matrix is never stored in full but by chunks.

GaelVaroquaux · 2022-06-21T15:34:22Z

Ah yes. Good point!

aperezlebel · 2022-06-23T17:14:56Z

For task 3 (simple figures explaining the missing values mechanisms), I made this figure:

MCAR: Missing Completely At Random, missingness does not depend on X.
MAR: Missing At Random, missingness does not depend on underlying missing values, but can depend on observed ones.
MNAR: Missing Not At Random, missingness depend on underlying missing values.

Any comments?

ArturoAmorQ · 2022-06-24T08:29:44Z

Any comments?

I think the images are quite clear, but I find the concept of "underlying missing values" a bit ambiguous. How can we clarify this point to users without a background on the terminology like myself?

aperezlebel · 2022-06-24T09:42:29Z

Yes I agree it could be clearer. I think we can add a paragraph before the figure that explains the setting: all samples have a truth value for all their feature but some of them are masked. We can also say a word on the fact that in practice some features are missing but have no true value behind (eg a feature "death date" for a person alive).

aperezlebel added the Documentation label Dec 13, 2021

thomasjpfan mentioned this issue Dec 15, 2021

ENH Adds encoded_missing_value to OrdinalEncoder #21988

Merged

glemaitre mentioned this issue Jan 4, 2022

[WIP] Example of multiple imputation with IterativeImputer #13025

Open

glemaitre mentioned this issue Mar 15, 2022

Allow other initialization strategies for IterativeImputer #22770

Open

glemaitre mentioned this issue May 3, 2022

DOC Accelerate plot_missing_values.py example #21792

Merged

aperezlebel mentioned this issue Jun 20, 2022

DOC Give local recommendations about IterativeImputer in docstrings #23701

Merged

aperezlebel mentioned this issue Jun 21, 2022

DOC Give local recommendations about SimpleImputer in docstring #23714

Merged

aperezlebel mentioned this issue Jun 23, 2022

[WIP] DOC Explain missing value mechanisms #23746

Draft

ArturoAmorQ mentioned this issue Aug 3, 2023

DOC Add example showcasing HGBT regression #26991

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documenting missing-values practices #21967

Documenting missing-values practices #21967

aperezlebel commented Dec 13, 2021 •

edited

ogrisel commented Dec 13, 2021

glemaitre commented Dec 13, 2021

GaelVaroquaux commented Dec 14, 2021 via email

ogrisel commented Dec 16, 2021 •

edited

richierocks commented May 3, 2022

aperezlebel commented Jun 21, 2022 •

edited

GaelVaroquaux commented Jun 21, 2022 via email

aperezlebel commented Jun 21, 2022

GaelVaroquaux commented Jun 21, 2022 via email

aperezlebel commented Jun 23, 2022 •

edited

ArturoAmorQ commented Jun 24, 2022

aperezlebel commented Jun 24, 2022 •

edited

Documenting missing-values practices #21967

Documenting missing-values practices #21967

Comments

aperezlebel commented Dec 13, 2021 • edited

Describe the issue linked to the documentation

Context

Messages to convey

Main messages:

Side messages:

Take-home messages:

Ressources

Suggest a potential alternative/fix

Big picture

Proposed roadmap

ogrisel commented Dec 13, 2021

glemaitre commented Dec 13, 2021

GaelVaroquaux commented Dec 14, 2021 via email

ogrisel commented Dec 16, 2021 • edited

richierocks commented May 3, 2022

aperezlebel commented Jun 21, 2022 • edited

GaelVaroquaux commented Jun 21, 2022 via email

aperezlebel commented Jun 21, 2022

GaelVaroquaux commented Jun 21, 2022 via email

aperezlebel commented Jun 23, 2022 • edited

ArturoAmorQ commented Jun 24, 2022

aperezlebel commented Jun 24, 2022 • edited

aperezlebel commented Dec 13, 2021 •

edited

ogrisel commented Dec 16, 2021 •

edited

aperezlebel commented Jun 21, 2022 •

edited

aperezlebel commented Jun 23, 2022 •

edited

aperezlebel commented Jun 24, 2022 •

edited