Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documenting missing-values practices #21967

Open
7 tasks
aperezlebel opened this issue Dec 13, 2021 · 12 comments
Open
7 tasks

Documenting missing-values practices #21967

aperezlebel opened this issue Dec 13, 2021 · 12 comments

Comments

@aperezlebel
Copy link
Contributor

aperezlebel commented Dec 13, 2021

Describe the issue linked to the documentation

Context

We discussed with @glemaitre and @GaelVaroquaux about documenting missing-values practices for prediction in scikit-learn as part of my PhD work at Inria (discussion here).

Indeed, the current documentation gives no recommendations on this point. We feel that there is now a better understanding of missing values in the context of supervised learning and thus that we have more hindsight on the theoretical and practical messages that would be helpful for the users than when the current documentation was written. We think it would be useful to restructure the documentation and examples to convey these messages.

Messages to convey

Main messages:

  1. Missing values in inference and in supervised learning are different problems with different tradeoffs. Define the terms and highlight the differences.
  2. Don't impute jointly training and test sets: data leakage and can't use in production.
  3. Simpler learners need powerful imputation (e.g conditional imputation with Iterative Imputer). Define conditional imputation. (theoretical arguments can be found in LeMorvan 2020).
  4. Conditional imputation is guaranteed to work only for "ignorable missingness" (Missing At Random mechanism, to define). Otherwise, the mask is needed (missingness is seldom ignored: the data are missing for a reason). Wikipedia pages on missing values can justify this.
  5. Powerful learners + simple imputation or no imputation works best (robustness to missingness mechanisms and flexibility), e.g HistGradientBoosting (this comes from experience, including systematic benchmarks).
  6. For categorical features, impute missing values as a new category (imputing to an existing category destroys information important to the learner).
  7. Computation cost of imputation can quickly get large, and even intractable for the most costly methods (e.g IterativeImputer, KNNImputer).

Side messages:

  1. The optimal predictor on partially-observed values is not always "good" imputation + the optimal predictor on the fully-observed values (Le Morvan et al. 2021). You need to account for missingness in some way.
  2. For multiple imputation, need to separate training and test behaviors (cf main message 1 above).
  3. As a consequence, ensemble methods, such as bagging, are a good solution for implementing multiple imputation in practice (a single supervised learning applied to many imputations is likely severely suboptimal).

Take-home messages:

  1. If little data: use conditional imputation and simple learners.
  2. If a lot of data (n>1000), use HistGradientBoosting.
  3. Don't impute categorical variables.

Ressources

  1. Wikipedia Missing data
  2. Josse 2019 On the consistency of supervised learning with missing values
  3. LeMorvan 2020 Linear predictor on linearly-generated data with missing values: non consistency and solutions
  4. LeMorvan 2021 What's a good imputation to predict with missing values?
  5. Perez-Lebel 2022 Benchmarking missing-values approaches for predictive models on health databases

Suggest a potential alternative/fix

After discussing with @glemaitre and @GaelVaroquaux, the following changes were suggested.

Big picture

The goal is to give the recommendations above, and have simple examples that convey the right intuitions (even simple simulated data can be didactic by showing the basic mechanisms).

  1. Write a narrative doc page that gives the big picture messages listed above and some figures.
  2. Replace the current examples about imputation that don't give a clear message.
  3. An example to generate simple figures explaining the difference between Missing At Random and Missing Not At Random (as in https://www.slideshare.net/GaelVaroquaux/dirty-data-science-machine-learning-on-noncurated-data/27).
    Didactic purpose: intuitions on the fact that missing values may distort distributions. Short example.
  4. An example to develop intuitions on imputation with the interplay between imputation and learning, adapted from http://dirtydata.science/python/gen_notes/01_missing_values.html, but only the two first sections. Didactic purpose: How the mechanism + imputation modifies the link between X and y.
  5. Adapt the docstrings to give local recommendations:
    a. IterativeImputer: give time complexity (algorithmic scalability) and say it is not a magic bullet in the face of structured missingness.
    b. KNNImputer: terrible computational scalability.
    c. SimpleImputer: does not work well with simple models.

Proposed roadmap

(refers to above, and can be detailed in a board)

@ogrisel
Copy link
Member

ogrisel commented Dec 13, 2021

I like the general plan. Thanks @A-pl!

Don't impute categorical variables.

I would ad that using SimpleImputer(strategy="constant", fill_value="missing", missing_values=None) might still be helpful to convert missing value markers for downstream models that would accept string categorical labels but would raise an exception when encountering a None (or np.nan, or pd.NA) value.

@glemaitre
Copy link
Member

I would ad that using SimpleImputer(strategy="constant", fill_value="missing", missing_values=None) might still be helpful

You are right. No imputation referred to "no advanced imputation". I think that we even discussed briefly if it would make sense that OneHotEncoder and OrdinalEncoder should allow for encoding missing values (instead of only being lenient) to simplify the pipeline. However this is out-of-scope for the current plan :)

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Dec 14, 2021 via email

@ogrisel
Copy link
Member

ogrisel commented Dec 16, 2021

If there is a consensus that the above is a good suggestion (which I really think it is), we could open an issue, but focus on it later.

#21988 is still letting np.nan passthrough by default but at least would now give an option to replace np.nan by something else (e.g. -2 for instance, while -1 is the default for unknown values at predict time).

We could change the passtrough behavior in a follow-up PR but that would require a deprecation cycle.

@richierocks
Copy link

I'm excited that this is being worked on. I recently tried to figure out what I thought was a simple thing: how should I decide which of SimpleInputer, InterativeImputer or KNNImputer to use? The documentation for this comes up short. I tried asking the internet, but got little useful response.

@aperezlebel
Copy link
Contributor Author

aperezlebel commented Jun 21, 2022

KNNImputer: terrible computational scalability.

I am not sure how much this claim holds now that #16397 reduced the memory footprint of KNN using chunked distance computation (see also #15604). By looking at the code, it looks O(np) to me now, while it was originally O(n^2 + np).

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jun 21, 2022 via email

@aperezlebel
Copy link
Contributor Author

In terms of computational time yes I agree, but in terms of memory footprint (which was the bottleneck of KNNImputer before #16397), it seems to be achieved in less than n^2 with the current implementation since the distance matrix is never stored in full but by chunks.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jun 21, 2022 via email

@aperezlebel
Copy link
Contributor Author

aperezlebel commented Jun 23, 2022

For task 3 (simple figures explaining the missing values mechanisms), I made this figure:
missing_value_mechanisms

  • MCAR: Missing Completely At Random, missingness does not depend on X.
  • MAR: Missing At Random, missingness does not depend on underlying missing values, but can depend on observed ones.
  • MNAR: Missing Not At Random, missingness depend on underlying missing values.

Any comments?

@ArturoAmorQ
Copy link
Member

Any comments?

I think the images are quite clear, but I find the concept of "underlying missing values" a bit ambiguous. How can we clarify this point to users without a background on the terminology like myself?

@aperezlebel
Copy link
Contributor Author

aperezlebel commented Jun 24, 2022

Yes I agree it could be clearer. I think we can add a paragraph before the figure that explains the setting: all samples have a truth value for all their feature but some of them are masked. We can also say a word on the fact that in practice some features are missing but have no true value behind (eg a feature "death date" for a person alive).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants