-
-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
impute.KNNImputer always uses just the mean value #17140
Comments
You are only passing that single column to fit_transform, so the imputer
has no other information by which to judge neighborhood.
We should probably raise a warning when only one column is passed to
KNNImputer (and IterativeImputer), since there is no basis for doing
anything but SimpleImputer approaches in that case.
|
You don't need the whole table to do things like weighted average of nearest neighbors. On the same data, imputeTS does the right thing in R (moving average over 4 neighbors with exponential decay) even when you feed it only that column: library(imputeTS)
df$Incoming.Examinations <- na_ma(df$Incoming.Examinations, k = 4, weighting = "exponential")
In fact, the rest of the table has data that is coincidental to that column (same timestamp) but is causally unrelated. This is typical for a lot of real world data that is timestamped. Taking the other columns into account would produce nonsense results. The whole phenomenon is captured in that column. I'm actually quite surprised the other columns are even mentioned in this context. |
Ahh. Neighbourhood here means neighbourhood in a metric space, not in a
time series (or index order). If you want that effect with KNNImputer
(which is not designed specifically for time series), you will need to pass
another feature representing time such that each point's nearest neighbours
in that feature are the points nearest in time..
|
You mean like this? exguess = imputer.fit_transform(dfnew[['Incoming Examinations', 'Date']]) It's picky. The Date column is type datetime64[ns] and I get this error:
Why can't it use the index? It's ordered already, and it's passed along if I use the double brakets [[]]. |
Or how about this: I want to tell the library - assume things are in the right order already, and the intervals are regular. Just start with the simplest scenario. This is a very common occurrence in data analysis, and it's the default assumption with the equivalent R library. Even an optional flag to that effect would be great here. |
imputer.fit_transform(dfnew['Incoming Examinations'].reset_index())
?
|
That worked! Thank you. Now, the result is an array of arrays, and only the second element in each small array contains the imputed series, the first element seems to be an index. So I still have to do this:
I know it may sound like nitpicking (especially since it works), but it still feels like the library tries to outsmart me. I only care about one column, I clearly gave it only one column, it should return back to me a single column with imputed values in a format similar to the input. In the R world you can do something equivalent to this...
...and it just works. It would also be hard to retrace this whole thing just based on the library documentation. Not sure how to put it better than this - it's a matter of expectations. |
It's not trying to outsmart you, so much as to follow its own API
principles. Scikit-learn is designed mostly for multivariate data, and for
datasets where the order of the inputs should be disregarded, under the
assumption that each sample is more-or-less IID. Your expectations
explicitly go against that assumption.
But if the documentation can be clarified to emphasise that neighborhood is
*not* about sample order, that might be a valuable improvement. What would
you change in the user guide?
… |
Yeah, literally what you said in the last comment is very illuminating. If you expect this to work... df['column'] = imputer_magic(df['column'], params) ...then you're in for a hard time. The library expects multivariate input, and the order is not implied by the row structure of the input. What you need to do instead is take that single column and give it an artificial index, and then extract the single column that you're looking for from the imputer output: output = imputer_magic(dfnew['column'].reset_index())
df['column'] = [row[1] for row in output] Two lines instead of one, and slightly more verbose, not too bad. I can live with that. The workflow is about the same. |
You should be able to do the indexation on the second line with [:,1]
applied to the returned array.
|
Right. The stumbling block for me was the data order thing. With that out of the way, the rest is doable. |
Following up on this, is there a way to pass the whole dataset to the KNNImputer but only return the imputed values for a specific column? like I want to use different imputer for different features but at the same time, I want to keep them all inside a pipeline. |
Describe the bug
Trying to use KNNImputer with a Pandas dataframe to impute missing values. It always just uses the mean value of the column instead of actually trying to impute smoother values.
Steps/Code to Reproduce
This is pydf.csv:
Actual Results
This is exguess:
2060.13636364 is the mean value of that column. I expected something much closer to its neighbors.
Versions
Running in a Jupyter notebook on Windows, all installed with Anaconda.
The text was updated successfully, but these errors were encountered: