Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

impute.KNNImputer always uses just the mean value #17140

Open
FlorinAndrei opened this issue May 6, 2020 · 12 comments
Open

impute.KNNImputer always uses just the mean value #17140

FlorinAndrei opened this issue May 6, 2020 · 12 comments

Comments

@FlorinAndrei
Copy link

Describe the bug

Trying to use KNNImputer with a Pandas dataframe to impute missing values. It always just uses the mean value of the column instead of actually trying to impute smoother values.

Steps/Code to Reproduce

import pandas as pd
from sklearn.impute import KNNImputer

df = pd.read_csv('pydf.csv', index_col=0)
imputer = KNNImputer()
dfnew = df.copy(deep=True)
exguess = imputer.fit_transform(dfnew[['Incoming Examinations']])
exguess

This is pydf.csv:

,Incoming Examinations,Year,Month,Date
19,362.0,2006,1,2006-01-01
10,436.0,2006,2,2006-02-01
57,,2006,3,2006-03-01
23,490.0,2006,4,2006-04-01
12,508.0,2006,5,2006-05-01
70,,2006,6,2006-06-01
68,393.0,2006,7,2006-07-01
63,596.0,2006,8,2006-08-01
36,634.0,2006,9,2006-09-01
29,613.0,2006,10,2006-10-01
46,545.0,2006,11,2006-11-01
50,411.0,2006,12,2006-12-01
72,398.0,2007,1,2007-01-01
43,311.0,2007,2,2007-02-01
91,664.0,2007,3,2007-03-01
92,680.0,2007,4,2007-04-01
87,507.0,2007,5,2007-05-01
11,467.0,2007,6,2007-06-01
77,566.0,2007,7,2007-07-01
76,806.0,2007,8,2007-08-01
95,732.0,2007,9,2007-09-01
40,886.0,2007,10,2007-10-01
64,776.0,2007,11,2007-11-01
0,698.0,2007,12,2007-12-01
88,875.0,2008,1,2008-01-01
26,840.0,2008,2,2008-02-01
5,724.0,2008,3,2008-03-01
53,1115.0,2008,4,2008-04-01
90,997.0,2008,5,2008-05-01
35,775.0,2008,6,2008-06-01
75,886.0,2008,7,2008-07-01
49,1041.0,2008,8,2008-08-01
59,1011.0,2008,9,2008-09-01
86,,2008,10,2008-10-01
45,939.0,2008,11,2008-11-01
7,,2008,12,2008-12-01
28,1004.0,2009,1,2009-01-01
60,1065.0,2009,2,2009-02-01
22,1263.0,2009,3,2009-03-01
4,962.0,2009,4,2009-04-01
52,,2009,5,2009-05-01
81,1429.0,2009,6,2009-06-01
94,1205.0,2009,7,2009-07-01
24,890.0,2009,8,2009-08-01
38,1320.0,2009,9,2009-09-01
37,1276.0,2009,10,2009-10-01
84,1757.0,2009,11,2009-11-01
58,1710.0,2009,12,2009-12-01
54,1710.0,2010,1,2010-01-01
34,1709.0,2010,2,2010-02-01
67,1578.0,2010,3,2010-03-01
30,1604.0,2010,4,2010-04-01
74,1758.0,2010,5,2010-05-01
1,,2010,6,2010-06-01
71,1457.0,2010,7,2010-07-01
85,1607.0,2010,8,2010-08-01
16,1808.0,2010,9,2010-09-01
65,1866.0,2010,10,2010-10-01
2,1934.0,2010,11,2010-11-01
3,2294.0,2010,12,2010-12-01
44,,2011,1,2011-01-01
27,2334.0,2011,2,2011-02-01
41,1973.0,2011,3,2011-03-01
61,2262.0,2011,4,2011-04-01
82,2259.0,2011,5,2011-05-01
15,2217.0,2011,6,2011-06-01
42,2739.0,2011,7,2011-07-01
89,2772.0,2011,8,2011-08-01
8,3383.0,2011,9,2011-09-01
14,2869.0,2011,10,2011-10-01
48,2239.0,2011,11,2011-11-01
33,,2011,12,2011-12-01
78,2789.0,2012,1,2012-01-01
39,3455.0,2012,2,2012-02-01
17,2940.0,2012,3,2012-03-01
79,2968.0,2012,4,2012-04-01
25,3466.0,2012,5,2012-05-01
73,3037.0,2012,6,2012-06-01
47,3946.0,2012,7,2012-07-01
62,3459.0,2012,8,2012-08-01
20,3446.0,2012,9,2012-09-01
51,3258.0,2012,10,2012-10-01
55,4729.0,2012,11,2012-11-01
21,3694.0,2012,12,2012-12-01
9,4610.0,2013,1,2013-01-01
32,4841.0,2013,2,2013-02-01
80,5172.0,2013,3,2013-03-01
93,4351.0,2013,4,2013-04-01
18,4730.0,2013,5,2013-05-01
31,4706.0,2013,6,2013-06-01
66,5000.0,2013,7,2013-07-01
6,4978.0,2013,8,2013-08-01
56,5008.0,2013,9,2013-09-01
83,6094.0,2013,10,2013-10-01
13,4874.0,2013,11,2013-11-01
69,5806.0,2013,12,2013-12-01

Actual Results

This is exguess:

array([[ 362.        ],
       [ 436.        ],
       [2060.13636364],
       [ 490.        ],
       [ 508.        ],
       [2060.13636364],
       [ 393.        ],
       [ 596.        ],
       [ 634.        ],
       [ 613.        ],
       [ 545.        ],
       [ 411.        ],
       [ 398.        ],
       [ 311.        ],
       [ 664.        ],
       [ 680.        ],
       [ 507.        ],
       [ 467.        ],
       [ 566.        ],
       [ 806.        ],
       [ 732.        ],
       [ 886.        ],
       [ 776.        ],
       [ 698.        ],
       [ 875.        ],
       [ 840.        ],
       [ 724.        ],
       [1115.        ],
       [ 997.        ],
       [ 775.        ],
       [ 886.        ],
       [1041.        ],
       [1011.        ],
       [2060.13636364],
       [ 939.        ],
       [2060.13636364],
       [1004.        ],
       [1065.        ],
       [1263.        ],
       [ 962.        ],
       [2060.13636364],
       [1429.        ],
       [1205.        ],
       [ 890.        ],
       [1320.        ],
       [1276.        ],
       [1757.        ],
       [1710.        ],
       [1710.        ],
       [1709.        ],
       [1578.        ],
       [1604.        ],
       [1758.        ],
       [2060.13636364],
       [1457.        ],
       [1607.        ],
       [1808.        ],
       [1866.        ],
       [1934.        ],
       [2294.        ],
       [2060.13636364],
       [2334.        ],
       [1973.        ],
       [2262.        ],
       [2259.        ],
       [2217.        ],
       [2739.        ],
       [2772.        ],
       [3383.        ],
       [2869.        ],
       [2239.        ],
       [2060.13636364],
       [2789.        ],
       [3455.        ],
       [2940.        ],
       [2968.        ],
       [3466.        ],
       [3037.        ],
       [3946.        ],
       [3459.        ],
       [3446.        ],
       [3258.        ],
       [4729.        ],
       [3694.        ],
       [4610.        ],
       [4841.        ],
       [5172.        ],
       [4351.        ],
       [4730.        ],
       [4706.        ],
       [5000.        ],
       [4978.        ],
       [5008.        ],
       [6094.        ],
       [4874.        ],
       [5806.        ]])

2060.13636364 is the mean value of that column. I expected something much closer to its neighbors.

Versions

>>> import sklearn; sklearn.show_versions()

System:
    python: 3.7.7 (default, Apr 15 2020, 05:09:04) [MSC v.1916 64 bit (AMD64)]
executable: C:\Users\flori\Anaconda3\python.exe
   machine: Windows-10-10.0.18362-SP0

Python dependencies:
       pip: 20.0.2
setuptools: 46.1.3.post20200330
   sklearn: 0.22.1
     numpy: 1.18.1
     scipy: 1.4.1
    Cython: 0.29.17
    pandas: 1.0.3
matplotlib: 3.1.3
    joblib: 0.14.1

Built with OpenMP: True

Running in a Jupyter notebook on Windows, all installed with Anaconda.

@jnothman
Copy link
Member

jnothman commented May 6, 2020 via email

@FlorinAndrei
Copy link
Author

FlorinAndrei commented May 6, 2020

You don't need the whole table to do things like weighted average of nearest neighbors. On the same data, imputeTS does the right thing in R (moving average over 4 neighbors with exponential decay) even when you feed it only that column:

library(imputeTS)
df$Incoming.Examinations <- na_ma(df$Incoming.Examinations, k = 4, weighting = "exponential")
> n2$Incoming.Examinations
 [1]  362.0000  436.0000  451.2400  490.0000  508.0000  495.3214  393.0000  596.0000  634.0000  613.0000  545.0000  411.0000
[13]  398.0000  311.0000  664.0000  680.0000  433.0000  467.0000  566.0000  806.0000  732.0000  886.0000  776.0000  698.0000
[25]  875.0000  840.0000  724.0000 1115.0000  997.0000  775.0000  886.0000 1041.0000 1011.0000  976.3077  939.0000 1013.6538
[37] 1004.0000 1065.0000 1263.0000  962.0000 1174.4667 1429.0000 1205.0000  890.0000 1320.0000 1276.0000 1757.0000 1710.0000
[49] 1710.0000 1709.0000 1578.0000 1604.0000 1758.0000 1630.3667 1457.0000 1607.0000 1808.0000 1866.0000 1934.0000 2294.0000
[61] 2165.8333 2334.0000 1973.0000 2262.0000 2259.0000 2217.0000 2739.0000 2772.0000 3383.0000 2869.0000 2239.0000 2796.8667
[73] 2789.0000 3455.0000 2940.0000 2968.0000 3466.0000 3037.0000 3946.0000 3459.0000 3446.0000 3258.0000 4729.0000 3694.0000
[85] 4610.0000 4841.0000 5172.0000 4351.0000 4730.0000 4706.0000 5000.0000 4978.0000 5008.0000 6094.0000 4874.0000 5806.0000

In fact, the rest of the table has data that is coincidental to that column (same timestamp) but is causally unrelated. This is typical for a lot of real world data that is timestamped. Taking the other columns into account would produce nonsense results.

The whole phenomenon is captured in that column. I'm actually quite surprised the other columns are even mentioned in this context.

@jnothman
Copy link
Member

jnothman commented May 6, 2020 via email

@FlorinAndrei
Copy link
Author

You mean like this?

exguess = imputer.fit_transform(dfnew[['Incoming Examinations', 'Date']])

It's picky. The Date column is type datetime64[ns] and I get this error:

TypeError: invalid type promotion

Why can't it use the index? It's ordered already, and it's passed along if I use the double brakets [[]].

@FlorinAndrei
Copy link
Author

FlorinAndrei commented May 6, 2020

Or how about this: I want to tell the library - assume things are in the right order already, and the intervals are regular. Just start with the simplest scenario. This is a very common occurrence in data analysis, and it's the default assumption with the equivalent R library. Even an optional flag to that effect would be great here.

@jnothman
Copy link
Member

jnothman commented May 7, 2020 via email

@FlorinAndrei
Copy link
Author

FlorinAndrei commented May 11, 2020

That worked! Thank you.

Now, the result is an array of arrays, and only the second element in each small array contains the imputed series, the first element seems to be an index. So I still have to do this:

exguess = imputer.fit_transform(dfnew['Incoming Examinations'].reset_index())
dfnew['Incoming Examinations'] = [v[1] for v in exguess]

I know it may sound like nitpicking (especially since it works), but it still feels like the library tries to outsmart me. I only care about one column, I clearly gave it only one column, it should return back to me a single column with imputed values in a format similar to the input.

In the R world you can do something equivalent to this...

df['column'] = imputer_magic(df['column'], params)

...and it just works.

It would also be hard to retrace this whole thing just based on the library documentation.

Not sure how to put it better than this - it's a matter of expectations.

@jnothman
Copy link
Member

jnothman commented May 11, 2020 via email

@FlorinAndrei
Copy link
Author

What would you change in the user guide?

Yeah, literally what you said in the last comment is very illuminating.

If you expect this to work...

df['column'] = imputer_magic(df['column'], params)

...then you're in for a hard time. The library expects multivariate input, and the order is not implied by the row structure of the input.

What you need to do instead is take that single column and give it an artificial index, and then extract the single column that you're looking for from the imputer output:

output = imputer_magic(dfnew['column'].reset_index())
df['column'] = [row[1] for row in output]

Two lines instead of one, and slightly more verbose, not too bad. I can live with that. The workflow is about the same.

@jnothman
Copy link
Member

jnothman commented May 11, 2020 via email

@FlorinAndrei
Copy link
Author

Right.

The stumbling block for me was the data order thing. With that out of the way, the rest is doable.

@amrhitch
Copy link

amrhitch commented Jul 7, 2022

Following up on this, is there a way to pass the whole dataset to the KNNImputer but only return the imputed values for a specific column? like I want to use different imputer for different features but at the same time, I want to keep them all inside a pipeline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants