SimpleImputer with the rule strategies median-1/median+1 #25642
Labels
module:impute
Needs Decision - Include Feature
Requires decision regarding including feature
New Feature
Describe the workflow you want to enable
I have run into an issue with
SimpleImputer
. Given a feature of, say, integer type, it is completely reasonable to impute the median to missing values. However, when the overall number of records is even, there is a decent chance, that the median falls between two integers according to the well-known rule (Sorted[N/2-1] + Sorted[N/2])/2. The issue is, that technically, this kind of imputation breaks the domain of the feature, it used to be integer, but now there are spectacular .5 numbers, which can act weirdly in further processing.Long story, short, when a sequence like 4, 3, ?, 2, 4, 5, 1 is imputed by 3.5, it is not an integer sequence anymore.
Describe your proposed solution
My recommendation is to introduce something like an "adjusted median", which would ensure that the imputed value is a value of the domain of the feature. My recommendation is to pick Sorted[N/2-1] or Sorted[N/2], whichever has the highest number of occurances in the data. If equal, take the smallest.
Basically the "most_frequent" strategy applied to Sorted[N/2-1] and Sorted[N/2] only.
Describe alternatives you've considered, if relevant
Alternative solutions and strategy names could work as well. In the problem described above, the issue is that median calculation is limited to its mathematical definition.
np.percentile
, just like percentile functions inR
offer more flexibility, as what happens inSimpleImputer
with thestrategy=median
is that the 50% percentile is taken with linear interpolation.np.percentile
could do it withnearest
interpolation. I think offering this control would improve the flexibility of the imputer with very little effort.Additional context
No response
The text was updated successfully, but these errors were encountered: