Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Adds KNNImputer #12852
[MRG] Adds KNNImputer #12852
Changes from 218 commits
a31c43a
eacb19d
d4049e2
cfb7c97
aa8547a
009efa9
70f294a
a54c162
c412e3b
ffe6774
c2d6a6c
9a19677
a6a0a2f
4fbbe40
29bdccb
6bb5471
e393cb0
6e5ec30
dd027f9
f33bff4
2e1ea48
1098499
a698120
95e0f56
215c8c9
fab313b
b2d5640
cd90614
2c9993a
473b191
de587b3
3d58616
5873d17
fd11002
2f41aa2
135056c
8c7190e
9616c2b
7e8f900
df9dba7
d26724a
2b327da
1704672
a1cc41d
1417f3e
34f68a5
508270c
0562054
24943ec
a449c5b
a485db9
6058548
1abbce8
0b67233
3e08209
851ab3c
7a0647f
b17906f
bd6eb69
2ea131b
b1d9397
d7cbdfb
1c9d858
01722f1
36d1d72
95f15ff
8e82d0d
f463b15
8a16e28
a93827c
5de5b60
eddf18f
2058186
69f2b7f
202cd37
6414081
2825fcc
745fa2d
44f0210
82d5d20
d8b23e6
c682361
1912611
607ff7f
87677e7
39e1da8
e1afa12
1ded8c0
367c115
eb702ef
ec16839
ac07331
b13714b
01e2a09
eba229b
73068eb
0b66143
7d7960e
dda6028
d4e4d91
315ea9f
1f5334a
6e47533
0c8215f
0768e04
e3b2a74
6cc237f
c10a171
df4ce74
bd396bf
a4fce96
2336f9d
31cf06c
114dcc5
f450104
fa3cbc8
73393c6
7527c25
9b724a6
f8067d0
48e4563
2626cb7
eadc983
8ebf67d
c702658
e335067
e449cda
6375202
824e75e
8270c8d
73200d4
45c1841
530a184
999d8fa
bd8a252
e7c82a3
fdd457f
16a07bf
5014599
14fe318
24c8c92
5727c74
0e1ccc5
ee3246d
92e93d7
447d5a9
f8db656
9cff097
d131a8d
1b95ea6
3a6a33b
c71493e
d35f131
a7bfc47
513d09d
cbff0a9
867bb54
31c3df0
cb9207a
c72c7eb
161bdce
5a032e7
3c1b608
811b6f8
59e6d30
1873a87
a739923
984e019
5f89240
5387fcf
5a318b1
dcd8098
20c8576
126fe56
7d16493
81663c6
69d2947
306e98d
b0f2e63
e9e97c9
1454d8e
3e8db3d
29405cc
25ae4ba
29fc09c
4c8cc7d
1ef0bf2
0246a52
4e101fa
5e32b5d
c838da5
a90ee5e
bbc774b
f79f24c
ff2c697
8f33d4a
3b0803c
0f5f436
4f4d4ae
e1f622d
324223a
8d8f17a
91fa0f1
6d877b5
bbf75f5
2467c07
e4da293
be0b1b4
5786f30
cf600d2
65a59b4
922bc70
272bc47
c5e0e75
52d48e7
fa51ae6
b3b3a53
7d9e2ce
58e8037
2d6beff
08409e0
93223ab
3766e2b
c4bf0ec
7e8d6ca
0cee529
4a934fe
6dcef51
2be7ac9
b789b05
ae5bcbe
2607235
e1783ac
83bca4a
68441a1
118eef2
f808d20
655f51c
606bb48
62bd37b
8201cfb
a9eefbd
1ea4456
795adbd
9a3a01a
b75c376
e533575
6672bb2
30972ae
d9dc8b9
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is 'composite estimator' really necesary? ...or can the sentence be usefully simplified ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds fine to me. How would you phrase it? "Imputer transformers can be used to create pipelines that support data with missing values?"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be a duplication of the preceding heading??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'completing missing values' appears to be somewhat of a mis-nomer... is this referring to Rubin's terminology of 'missing at random' vs 'missing completely at random' etc. If yes, then let's call a cat a cat.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, it just means "for replacing missing values" or something. It's unnecessarily verbose, I suppose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I misread it as "completely missing", maybe @banilo did as well? Maybe "filling in" instead of completing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"in the training set, also called donors" and remove the next sentence?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this last sentence correct? I thought that's not what we're doing any more?
Shouldn't it be "one of the n_neighbors nearest neighbors with the feature present"? A logical "and" would mean that you compute the n_neighbors nearest neighbors and compute the samples with the feature present and then intersect the sets, which is not what's happening, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Not that the set of donors can be different for different features of the same sample"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the sentence about there being less donor neighbors than n_neighbors. Isn't "the training set average" the same as the average of these donors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When there are not enough donors, currently this imputer considers the whole training set as its donors.
Thinking about it now, this seems a little strange.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well we don't need to handle the insufficient donors case specially if there's no weighting. But whether we are right to disregard weighting when there are insufficient neighbours, I don't know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amueller To clarify, what I meant by "everyone is a donor" is that this implementation uses the feature mean for the entire training set (ignoring the missing values) for imputation. This is done when there are not enough donors for a given sample.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's clear, but if there aren't enough donors then at most n_neighbors available donors = everyone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jnothman summarizes my point well, i.e. without weighting there's no special case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General comment on the whole example: Who abouting indicating missingness as NaN in the entire example, to make it more educational. In some place, having 0 to indicate missingness and replacing by 0 may be confusing to some people.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed that missing_values=0 in this example is a bit obscure... but it is a separate issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bigger conceptual-statistical point: I would personally not call the 5th analysis (=iterative imputation) "multivariate", because the 4th analysis (=kNN) uses the same supervised estimator as the 5th analysis, just in a nested/repeated fashion. As such either analysis 4 and 5 are multivariate or neither is. As an alternative, how about: "Iterative imputation" which is a common term in the statistical and machine literature from my experience.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that should be changed in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not addressed