impute.KNNImputer always uses just the mean value #17140

FlorinAndrei · 2020-05-06T08:35:11Z

Describe the bug

Trying to use KNNImputer with a Pandas dataframe to impute missing values. It always just uses the mean value of the column instead of actually trying to impute smoother values.

Steps/Code to Reproduce

import pandas as pd
from sklearn.impute import KNNImputer

df = pd.read_csv('pydf.csv', index_col=0)
imputer = KNNImputer()
dfnew = df.copy(deep=True)
exguess = imputer.fit_transform(dfnew[['Incoming Examinations']])
exguess

This is pydf.csv:

,Incoming Examinations,Year,Month,Date
19,362.0,2006,1,2006-01-01
10,436.0,2006,2,2006-02-01
57,,2006,3,2006-03-01
23,490.0,2006,4,2006-04-01
12,508.0,2006,5,2006-05-01
70,,2006,6,2006-06-01
68,393.0,2006,7,2006-07-01
63,596.0,2006,8,2006-08-01
36,634.0,2006,9,2006-09-01
29,613.0,2006,10,2006-10-01
46,545.0,2006,11,2006-11-01
50,411.0,2006,12,2006-12-01
72,398.0,2007,1,2007-01-01
43,311.0,2007,2,2007-02-01
91,664.0,2007,3,2007-03-01
92,680.0,2007,4,2007-04-01
87,507.0,2007,5,2007-05-01
11,467.0,2007,6,2007-06-01
77,566.0,2007,7,2007-07-01
76,806.0,2007,8,2007-08-01
95,732.0,2007,9,2007-09-01
40,886.0,2007,10,2007-10-01
64,776.0,2007,11,2007-11-01
0,698.0,2007,12,2007-12-01
88,875.0,2008,1,2008-01-01
26,840.0,2008,2,2008-02-01
5,724.0,2008,3,2008-03-01
53,1115.0,2008,4,2008-04-01
90,997.0,2008,5,2008-05-01
35,775.0,2008,6,2008-06-01
75,886.0,2008,7,2008-07-01
49,1041.0,2008,8,2008-08-01
59,1011.0,2008,9,2008-09-01
86,,2008,10,2008-10-01
45,939.0,2008,11,2008-11-01
7,,2008,12,2008-12-01
28,1004.0,2009,1,2009-01-01
60,1065.0,2009,2,2009-02-01
22,1263.0,2009,3,2009-03-01
4,962.0,2009,4,2009-04-01
52,,2009,5,2009-05-01
81,1429.0,2009,6,2009-06-01
94,1205.0,2009,7,2009-07-01
24,890.0,2009,8,2009-08-01
38,1320.0,2009,9,2009-09-01
37,1276.0,2009,10,2009-10-01
84,1757.0,2009,11,2009-11-01
58,1710.0,2009,12,2009-12-01
54,1710.0,2010,1,2010-01-01
34,1709.0,2010,2,2010-02-01
67,1578.0,2010,3,2010-03-01
30,1604.0,2010,4,2010-04-01
74,1758.0,2010,5,2010-05-01
1,,2010,6,2010-06-01
71,1457.0,2010,7,2010-07-01
85,1607.0,2010,8,2010-08-01
16,1808.0,2010,9,2010-09-01
65,1866.0,2010,10,2010-10-01
2,1934.0,2010,11,2010-11-01
3,2294.0,2010,12,2010-12-01
44,,2011,1,2011-01-01
27,2334.0,2011,2,2011-02-01
41,1973.0,2011,3,2011-03-01
61,2262.0,2011,4,2011-04-01
82,2259.0,2011,5,2011-05-01
15,2217.0,2011,6,2011-06-01
42,2739.0,2011,7,2011-07-01
89,2772.0,2011,8,2011-08-01
8,3383.0,2011,9,2011-09-01
14,2869.0,2011,10,2011-10-01
48,2239.0,2011,11,2011-11-01
33,,2011,12,2011-12-01
78,2789.0,2012,1,2012-01-01
39,3455.0,2012,2,2012-02-01
17,2940.0,2012,3,2012-03-01
79,2968.0,2012,4,2012-04-01
25,3466.0,2012,5,2012-05-01
73,3037.0,2012,6,2012-06-01
47,3946.0,2012,7,2012-07-01
62,3459.0,2012,8,2012-08-01
20,3446.0,2012,9,2012-09-01
51,3258.0,2012,10,2012-10-01
55,4729.0,2012,11,2012-11-01
21,3694.0,2012,12,2012-12-01
9,4610.0,2013,1,2013-01-01
32,4841.0,2013,2,2013-02-01
80,5172.0,2013,3,2013-03-01
93,4351.0,2013,4,2013-04-01
18,4730.0,2013,5,2013-05-01
31,4706.0,2013,6,2013-06-01
66,5000.0,2013,7,2013-07-01
6,4978.0,2013,8,2013-08-01
56,5008.0,2013,9,2013-09-01
83,6094.0,2013,10,2013-10-01
13,4874.0,2013,11,2013-11-01
69,5806.0,2013,12,2013-12-01

Actual Results

This is exguess:

array([[ 362.        ],
       [ 436.        ],
       [2060.13636364],
       [ 490.        ],
       [ 508.        ],
       [2060.13636364],
       [ 393.        ],
       [ 596.        ],
       [ 634.        ],
       [ 613.        ],
       [ 545.        ],
       [ 411.        ],
       [ 398.        ],
       [ 311.        ],
       [ 664.        ],
       [ 680.        ],
       [ 507.        ],
       [ 467.        ],
       [ 566.        ],
       [ 806.        ],
       [ 732.        ],
       [ 886.        ],
       [ 776.        ],
       [ 698.        ],
       [ 875.        ],
       [ 840.        ],
       [ 724.        ],
       [1115.        ],
       [ 997.        ],
       [ 775.        ],
       [ 886.        ],
       [1041.        ],
       [1011.        ],
       [2060.13636364],
       [ 939.        ],
       [2060.13636364],
       [1004.        ],
       [1065.        ],
       [1263.        ],
       [ 962.        ],
       [2060.13636364],
       [1429.        ],
       [1205.        ],
       [ 890.        ],
       [1320.        ],
       [1276.        ],
       [1757.        ],
       [1710.        ],
       [1710.        ],
       [1709.        ],
       [1578.        ],
       [1604.        ],
       [1758.        ],
       [2060.13636364],
       [1457.        ],
       [1607.        ],
       [1808.        ],
       [1866.        ],
       [1934.        ],
       [2294.        ],
       [2060.13636364],
       [2334.        ],
       [1973.        ],
       [2262.        ],
       [2259.        ],
       [2217.        ],
       [2739.        ],
       [2772.        ],
       [3383.        ],
       [2869.        ],
       [2239.        ],
       [2060.13636364],
       [2789.        ],
       [3455.        ],
       [2940.        ],
       [2968.        ],
       [3466.        ],
       [3037.        ],
       [3946.        ],
       [3459.        ],
       [3446.        ],
       [3258.        ],
       [4729.        ],
       [3694.        ],
       [4610.        ],
       [4841.        ],
       [5172.        ],
       [4351.        ],
       [4730.        ],
       [4706.        ],
       [5000.        ],
       [4978.        ],
       [5008.        ],
       [6094.        ],
       [4874.        ],
       [5806.        ]])

2060.13636364 is the mean value of that column. I expected something much closer to its neighbors.

Versions

>>> import sklearn; sklearn.show_versions()

System:
    python: 3.7.7 (default, Apr 15 2020, 05:09:04) [MSC v.1916 64 bit (AMD64)]
executable: C:\Users\flori\Anaconda3\python.exe
   machine: Windows-10-10.0.18362-SP0

Python dependencies:
       pip: 20.0.2
setuptools: 46.1.3.post20200330
   sklearn: 0.22.1
     numpy: 1.18.1
     scipy: 1.4.1
    Cython: 0.29.17
    pandas: 1.0.3
matplotlib: 3.1.3
    joblib: 0.14.1

Built with OpenMP: True

Running in a Jupyter notebook on Windows, all installed with Anaconda.

The text was updated successfully, but these errors were encountered:

jnothman · 2020-05-06T11:24:00Z

You are only passing that single column to fit_transform, so the imputer has no other information by which to judge neighborhood. We should probably raise a warning when only one column is passed to KNNImputer (and IterativeImputer), since there is no basis for doing anything but SimpleImputer approaches in that case.

FlorinAndrei · 2020-05-06T19:29:01Z

You don't need the whole table to do things like weighted average of nearest neighbors. On the same data, imputeTS does the right thing in R (moving average over 4 neighbors with exponential decay) even when you feed it only that column:

library(imputeTS)
df$Incoming.Examinations <- na_ma(df$Incoming.Examinations, k = 4, weighting = "exponential")

> n2$Incoming.Examinations
 [1]  362.0000  436.0000  451.2400  490.0000  508.0000  495.3214  393.0000  596.0000  634.0000  613.0000  545.0000  411.0000
[13]  398.0000  311.0000  664.0000  680.0000  433.0000  467.0000  566.0000  806.0000  732.0000  886.0000  776.0000  698.0000
[25]  875.0000  840.0000  724.0000 1115.0000  997.0000  775.0000  886.0000 1041.0000 1011.0000  976.3077  939.0000 1013.6538
[37] 1004.0000 1065.0000 1263.0000  962.0000 1174.4667 1429.0000 1205.0000  890.0000 1320.0000 1276.0000 1757.0000 1710.0000
[49] 1710.0000 1709.0000 1578.0000 1604.0000 1758.0000 1630.3667 1457.0000 1607.0000 1808.0000 1866.0000 1934.0000 2294.0000
[61] 2165.8333 2334.0000 1973.0000 2262.0000 2259.0000 2217.0000 2739.0000 2772.0000 3383.0000 2869.0000 2239.0000 2796.8667
[73] 2789.0000 3455.0000 2940.0000 2968.0000 3466.0000 3037.0000 3946.0000 3459.0000 3446.0000 3258.0000 4729.0000 3694.0000
[85] 4610.0000 4841.0000 5172.0000 4351.0000 4730.0000 4706.0000 5000.0000 4978.0000 5008.0000 6094.0000 4874.0000 5806.0000

In fact, the rest of the table has data that is coincidental to that column (same timestamp) but is causally unrelated. This is typical for a lot of real world data that is timestamped. Taking the other columns into account would produce nonsense results.

The whole phenomenon is captured in that column. I'm actually quite surprised the other columns are even mentioned in this context.

jnothman · 2020-05-06T21:34:51Z

Ahh. Neighbourhood here means neighbourhood in a metric space, not in a time series (or index order). If you want that effect with KNNImputer (which is not designed specifically for time series), you will need to pass another feature representing time such that each point's nearest neighbours in that feature are the points nearest in time..

FlorinAndrei · 2020-05-06T21:58:59Z

You mean like this?

exguess = imputer.fit_transform(dfnew[['Incoming Examinations', 'Date']])

It's picky. The Date column is type datetime64[ns] and I get this error:

TypeError: invalid type promotion

Why can't it use the index? It's ordered already, and it's passed along if I use the double brakets [[]].

FlorinAndrei · 2020-05-06T23:46:48Z

Or how about this: I want to tell the library - assume things are in the right order already, and the intervals are regular. Just start with the simplest scenario. This is a very common occurrence in data analysis, and it's the default assumption with the equivalent R library. Even an optional flag to that effect would be great here.

jnothman · 2020-05-07T08:00:32Z

imputer.fit_transform(dfnew['Incoming Examinations'].reset_index()) ?

FlorinAndrei · 2020-05-11T02:11:21Z

That worked! Thank you.

Now, the result is an array of arrays, and only the second element in each small array contains the imputed series, the first element seems to be an index. So I still have to do this:

exguess = imputer.fit_transform(dfnew['Incoming Examinations'].reset_index())
dfnew['Incoming Examinations'] = [v[1] for v in exguess]

I know it may sound like nitpicking (especially since it works), but it still feels like the library tries to outsmart me. I only care about one column, I clearly gave it only one column, it should return back to me a single column with imputed values in a format similar to the input.

In the R world you can do something equivalent to this...

df['column'] = imputer_magic(df['column'], params)

...and it just works.

It would also be hard to retrace this whole thing just based on the library documentation.

Not sure how to put it better than this - it's a matter of expectations.

jnothman · 2020-05-11T04:24:57Z

It's not trying to outsmart you, so much as to follow its own API principles. Scikit-learn is designed mostly for multivariate data, and for datasets where the order of the inputs should be disregarded, under the assumption that each sample is more-or-less IID. Your expectations explicitly go against that assumption. But if the documentation can be clarified to emphasise that neighborhood is *not* about sample order, that might be a valuable improvement. What would you change in the user guide?

…

FlorinAndrei · 2020-05-11T07:08:22Z

What would you change in the user guide?

Yeah, literally what you said in the last comment is very illuminating.

If you expect this to work...

df['column'] = imputer_magic(df['column'], params)

...then you're in for a hard time. The library expects multivariate input, and the order is not implied by the row structure of the input.

What you need to do instead is take that single column and give it an artificial index, and then extract the single column that you're looking for from the imputer output:

output = imputer_magic(dfnew['column'].reset_index())
df['column'] = [row[1] for row in output]

Two lines instead of one, and slightly more verbose, not too bad. I can live with that. The workflow is about the same.

jnothman · 2020-05-11T07:42:50Z

You should be able to do the indexation on the second line with [:,1] applied to the returned array.

FlorinAndrei · 2020-05-11T08:41:00Z

Right.

The stumbling block for me was the data order thing. With that out of the way, the rest is doable.

amrhitch · 2022-07-07T02:36:17Z

Following up on this, is there a way to pass the whole dataset to the KNNImputer but only return the imputed values for a specific column? like I want to use different imputer for different features but at the same time, I want to keep them all inside a pipeline.

FlorinAndrei added the Bug: triage label May 6, 2020

cmarmo added the module:impute label May 14, 2020

cmarmo added Documentation help wanted and removed Bug: triage labels Oct 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

impute.KNNImputer always uses just the mean value #17140

impute.KNNImputer always uses just the mean value #17140

FlorinAndrei commented May 6, 2020

jnothman commented May 6, 2020 via email

FlorinAndrei commented May 6, 2020 •

edited

jnothman commented May 6, 2020 via email

FlorinAndrei commented May 6, 2020

FlorinAndrei commented May 6, 2020 •

edited

jnothman commented May 7, 2020 via email

FlorinAndrei commented May 11, 2020 •

edited

jnothman commented May 11, 2020 via email

FlorinAndrei commented May 11, 2020

jnothman commented May 11, 2020 via email

FlorinAndrei commented May 11, 2020

amrhitch commented Jul 7, 2022

impute.KNNImputer always uses just the mean value #17140

impute.KNNImputer always uses just the mean value #17140

Comments

FlorinAndrei commented May 6, 2020

Describe the bug

Steps/Code to Reproduce

Actual Results

Versions

jnothman commented May 6, 2020 via email

FlorinAndrei commented May 6, 2020 • edited

jnothman commented May 6, 2020 via email

FlorinAndrei commented May 6, 2020

FlorinAndrei commented May 6, 2020 • edited

jnothman commented May 7, 2020 via email

FlorinAndrei commented May 11, 2020 • edited

jnothman commented May 11, 2020 via email

FlorinAndrei commented May 11, 2020

jnothman commented May 11, 2020 via email

FlorinAndrei commented May 11, 2020

amrhitch commented Jul 7, 2022

FlorinAndrei commented May 6, 2020 •

edited

FlorinAndrei commented May 6, 2020 •

edited

FlorinAndrei commented May 11, 2020 •

edited