SimpleImputer with the rule strategies median-1/median+1 #25642

gykovacs · 2023-02-18T19:27:48Z

Describe the workflow you want to enable

I have run into an issue with SimpleImputer. Given a feature of, say, integer type, it is completely reasonable to impute the median to missing values. However, when the overall number of records is even, there is a decent chance, that the median falls between two integers according to the well-known rule (Sorted[N/2-1] + Sorted[N/2])/2. The issue is, that technically, this kind of imputation breaks the domain of the feature, it used to be integer, but now there are spectacular .5 numbers, which can act weirdly in further processing.

Long story, short, when a sequence like 4, 3, ?, 2, 4, 5, 1 is imputed by 3.5, it is not an integer sequence anymore.

Describe your proposed solution

My recommendation is to introduce something like an "adjusted median", which would ensure that the imputed value is a value of the domain of the feature. My recommendation is to pick Sorted[N/2-1] or Sorted[N/2], whichever has the highest number of occurances in the data. If equal, take the smallest.

Basically the "most_frequent" strategy applied to Sorted[N/2-1] and Sorted[N/2] only.

Describe alternatives you've considered, if relevant

Alternative solutions and strategy names could work as well. In the problem described above, the issue is that median calculation is limited to its mathematical definition. np.percentile, just like percentile functions in R offer more flexibility, as what happens in SimpleImputer with the strategy=median is that the 50% percentile is taken with linear interpolation. np.percentile could do it with nearest interpolation. I think offering this control would improve the flexibility of the imputer with very little effort.

Additional context

No response

The text was updated successfully, but these errors were encountered:

thomasjpfan · 2023-02-24T14:43:05Z

Thank you for opening the issue! I can see the motivation behind median-1 and median+1. As for scikit-learn's inclusion criterion, may you share a reference that showcases or studies imputation using median-1/median+1?

glevv · 2023-04-05T08:51:02Z

Do you mean lower median and higher median? I think this could be done by np.quantile(x, q=0.5, method='lower') and np.quantile(x, q=0.5, method='higher') respectively. Don't know about usefulness of it, though.

gykovacs added Needs Triage Issue requires triage New Feature labels Feb 18, 2023

thomasjpfan added module:impute Needs Decision - Include Feature Requires decision regarding including feature and removed Needs Triage Issue requires triage labels Feb 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SimpleImputer with the rule strategies median-1/median+1 #25642

SimpleImputer with the rule strategies median-1/median+1 #25642

gykovacs commented Feb 18, 2023 •

edited

thomasjpfan commented Feb 24, 2023

glevv commented Apr 5, 2023 •

edited

SimpleImputer with the rule strategies median-1/median+1 #25642

SimpleImputer with the rule strategies median-1/median+1 #25642

Comments

gykovacs commented Feb 18, 2023 • edited

Describe the workflow you want to enable

Describe your proposed solution

Describe alternatives you've considered, if relevant

Additional context

thomasjpfan commented Feb 24, 2023

glevv commented Apr 5, 2023 • edited

gykovacs commented Feb 18, 2023 •

edited

glevv commented Apr 5, 2023 •

edited