Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SimpleImputer with the rule strategies median-1/median+1 #25642

Open
gykovacs opened this issue Feb 18, 2023 · 2 comments
Open

SimpleImputer with the rule strategies median-1/median+1 #25642

gykovacs opened this issue Feb 18, 2023 · 2 comments
Labels
module:impute Needs Decision - Include Feature Requires decision regarding including feature New Feature

Comments

@gykovacs
Copy link

gykovacs commented Feb 18, 2023

Describe the workflow you want to enable

I have run into an issue with SimpleImputer. Given a feature of, say, integer type, it is completely reasonable to impute the median to missing values. However, when the overall number of records is even, there is a decent chance, that the median falls between two integers according to the well-known rule (Sorted[N/2-1] + Sorted[N/2])/2. The issue is, that technically, this kind of imputation breaks the domain of the feature, it used to be integer, but now there are spectacular .5 numbers, which can act weirdly in further processing.

Long story, short, when a sequence like 4, 3, ?, 2, 4, 5, 1 is imputed by 3.5, it is not an integer sequence anymore.

Describe your proposed solution

My recommendation is to introduce something like an "adjusted median", which would ensure that the imputed value is a value of the domain of the feature. My recommendation is to pick Sorted[N/2-1] or Sorted[N/2], whichever has the highest number of occurances in the data. If equal, take the smallest.

Basically the "most_frequent" strategy applied to Sorted[N/2-1] and Sorted[N/2] only.

Describe alternatives you've considered, if relevant

Alternative solutions and strategy names could work as well. In the problem described above, the issue is that median calculation is limited to its mathematical definition. np.percentile, just like percentile functions in R offer more flexibility, as what happens in SimpleImputer with the strategy=median is that the 50% percentile is taken with linear interpolation. np.percentile could do it with nearest interpolation. I think offering this control would improve the flexibility of the imputer with very little effort.

Additional context

No response

@gykovacs gykovacs added Needs Triage Issue requires triage New Feature labels Feb 18, 2023
@thomasjpfan
Copy link
Member

Thank you for opening the issue! I can see the motivation behind median-1 and median+1. As for scikit-learn's inclusion criterion, may you share a reference that showcases or studies imputation using median-1/median+1?

@thomasjpfan thomasjpfan added module:impute Needs Decision - Include Feature Requires decision regarding including feature and removed Needs Triage Issue requires triage labels Feb 24, 2023
@glevv
Copy link
Contributor

glevv commented Apr 5, 2023

Do you mean lower median and higher median? I think this could be done by np.quantile(x, q=0.5, method='lower') and np.quantile(x, q=0.5, method='higher') respectively. Don't know about usefulness of it, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:impute Needs Decision - Include Feature Requires decision regarding including feature New Feature
Projects
None yet
Development

No branches or pull requests

3 participants