Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Adds KNNImputer #12852

Merged
merged 269 commits into from
Sep 3, 2019
Merged
Show file tree
Hide file tree
Changes from 218 commits
Commits
Show all changes
269 commits
Select commit Hold shift + click to select a range
a31c43a
Addressed review comments #5
ashimb9 Jul 31, 2017
eacb19d
Edited comments
ashimb9 Jul 31, 2017
d4049e2
Merge branch 'naneuclid' into knnimpute
ashimb9 Jul 31, 2017
cfb7c97
KNN Imputation with masked_euclidean and sklearn.neighbors
ashimb9 Aug 3, 2017
aa8547a
fixed array base check
ashimb9 Aug 3, 2017
009efa9
Fix column mean to nanmean
ashimb9 Aug 3, 2017
70f294a
Added weight support and cleaned the code
ashimb9 Aug 6, 2017
a54c162
Added inf check
ashimb9 Aug 6, 2017
c412e3b
Changed error message
ashimb9 Aug 6, 2017
ffe6774
Added test suite and example. Expanded docstring description
ashimb9 Aug 8, 2017
c2d6a6c
Changes to preprocessing __init__
ashimb9 Aug 8, 2017
9a19677
Added KNNImputer exception for NaN and inf in estimator_checks
ashimb9 Aug 8, 2017
a6a0a2f
Moved _check_weights() to fit()
ashimb9 Aug 9, 2017
4fbbe40
Addressed review comments - 1
ashimb9 Aug 18, 2017
29bdccb
Make NearestNeighbor import local to fit
ashimb9 Aug 18, 2017
6bb5471
Updated doc/modules/preprocessing.rst
ashimb9 Aug 18, 2017
e393cb0
More circular import fixes
ashimb9 Aug 18, 2017
6e5ec30
pep8 fixes
ashimb9 Aug 18, 2017
dd027f9
Minor comment updates
ashimb9 Aug 18, 2017
f33bff4
Addressed review comments (part 2)
ashimb9 Aug 20, 2017
2e1ea48
Fixed pyflex issues
ashimb9 Aug 20, 2017
1098499
Added test for callable weights and updated comments.
ashimb9 Sep 3, 2017
a698120
Pep8 fixes
ashimb9 Sep 3, 2017
95e0f56
Comment, doc, and pep8 fixes
ashimb9 Sep 15, 2017
215c8c9
Docstring changes
ashimb9 Sep 15, 2017
fab313b
Changes to unit tests as per review comments
ashimb9 Sep 15, 2017
b2d5640
Tests moved to test_imputation
ashimb9 Sep 15, 2017
cd90614
Addressed review comments
ashimb9 Sep 19, 2017
2c9993a
test changes
ashimb9 Sep 19, 2017
473b191
Test changes part 2
ashimb9 Sep 19, 2017
de587b3
Fixed weight matrix shape issue
ashimb9 Sep 21, 2017
3d58616
Minor changes
ashimb9 Sep 21, 2017
5873d17
Fixed degenerate donor issue. Added tests
ashimb9 Sep 22, 2017
fd11002
Further test updates
ashimb9 Sep 22, 2017
2f41aa2
minor test fix
ashimb9 Sep 23, 2017
135056c
more minor changes
ashimb9 Sep 24, 2017
8c7190e
Moved weight_matrix inside if-weighted block
ashimb9 Sep 24, 2017
9616c2b
Addressed Review Comments
ashimb9 Dec 12, 2017
7e8f900
Fixed plot_missing example
ashimb9 Dec 12, 2017
df9dba7
Fixed Error Msg
ashimb9 Dec 12, 2017
d26724a
Modified missing check for sparse matrix
ashimb9 Dec 12, 2017
2b327da
Test update
ashimb9 Dec 12, 2017
1704672
Fixed nan check on sparse
ashimb9 Dec 17, 2017
a1cc41d
Review Comments Addressed (partial)
ashimb9 Dec 17, 2017
1417f3e
Fix merge conflit
ashimb9 Dec 19, 2017
34f68a5
Updated doc module
ashimb9 Dec 19, 2017
508270c
Added support for using only neighbors with non-missing features
ashimb9 Jan 26, 2018
0562054
Test update
ashimb9 Jan 26, 2018
24943ec
Import Numpy code for np.unique for older versions
ashimb9 Jan 26, 2018
a449c5b
Remove version check
ashimb9 Jan 26, 2018
a485db9
Minor fix
ashimb9 Jan 26, 2018
6058548
Added strategy to only use neighbors with non-nan value
ashimb9 Mar 28, 2018
1abbce8
Sync with upstream and merge with master
ashimb9 Mar 31, 2018
0b67233
Edit import path in test file
ashimb9 Mar 31, 2018
3e08209
Error fixes with imports and examples
ashimb9 Mar 31, 2018
851ab3c
Added use_complete docstring
ashimb9 Mar 31, 2018
7a0647f
Changed comments and fixed docstring
ashimb9 Mar 31, 2018
b17906f
Added more doctest fix and min neighbor check
ashimb9 Mar 31, 2018
bd6eb69
fix docs
ashimb9 Mar 31, 2018
2ea131b
Increase col_max_missing threshold for example plot
ashimb9 Mar 31, 2018
b1d9397
Lower missing rate in demo since tests are failing
ashimb9 Mar 31, 2018
d7cbdfb
Remove redundant check and changes in plot
ashimb9 Mar 31, 2018
1c9d858
Handling insufficient neighbors scenario
ashimb9 Mar 31, 2018
01722f1
Removed k actual neighbors algo
ashimb9 Apr 7, 2018
36d1d72
Addressed Comments
ashimb9 Apr 22, 2018
95f15ff
Merge branch 'master' into knnimpute
ashimb9 Apr 22, 2018
8e82d0d
Sync with upstream and merge
ashimb9 Apr 22, 2018
f463b15
Sync and merge
ashimb9 Apr 22, 2018
8a16e28
Minor bug fixes
ashimb9 Apr 28, 2018
a93827c
Removing flotsam
ashimb9 Apr 28, 2018
5de5b60
Minor bug fixes
ashimb9 Apr 29, 2018
eddf18f
Merge to upstream
ashimb9 May 26, 2018
2058186
Revert changes to sklearn/neighbors
jnothman Sep 30, 2018
69f2b7f
Merge branch 'master' into knnimpute
jnothman Sep 30, 2018
202cd37
Revert changes to deprecated file
jnothman Sep 30, 2018
6414081
COSMIT _MASKED_METRICS -> _NAN_METRICS
jnothman Sep 30, 2018
2825fcc
'NaN' no longer stands for NaN
jnothman Sep 30, 2018
745fa2d
Fix missing_values validation
jnothman Oct 3, 2018
44f0210
Attempt to reinstate neighbors changes
jnothman Oct 3, 2018
82d5d20
Fix up test failures
jnothman Oct 3, 2018
d8b23e6
Fix flake8 issues in example
jnothman Oct 3, 2018
c682361
Default force_all_finite to True rather than False
jnothman Oct 4, 2018
1912611
Fix example usage
jnothman Oct 4, 2018
607ff7f
Fix masked_euclidean testing in nearest neighbors
jnothman Oct 4, 2018
87677e7
Fix missing_values in masked_euclidean_distances
jnothman Oct 4, 2018
39e1da8
Can't subtract list and set in Py2
jnothman Oct 4, 2018
e1afa12
Merge branch 'master' into knnimpute
jnothman Jan 17, 2019
1ded8c0
RFC: Reduce diffs
thomasjpfan Dec 20, 2018
367c115
WIP
thomasjpfan Dec 20, 2018
eb702ef
MRG: Reduce diffs
thomasjpfan Dec 20, 2018
ec16839
TST: Fix
thomasjpfan Dec 20, 2018
ac07331
DOC: Refactor
thomasjpfan Dec 20, 2018
b13714b
WIP
thomasjpfan Dec 20, 2018
01e2a09
RFC: Adjustments
thomasjpfan Dec 21, 2018
eba229b
ENH: Completes implementation
thomasjpfan Dec 21, 2018
73068eb
ENH: Adds to __init__
thomasjpfan Dec 21, 2018
0b66143
DOC: Adds whats_new
thomasjpfan Dec 21, 2018
7d7960e
DOC: Adds autosummary
thomasjpfan Dec 21, 2018
dda6028
RFC: Minor
thomasjpfan Dec 21, 2018
d4e4d91
DOC: Grammer
thomasjpfan Dec 21, 2018
315ea9f
TST: Increases coverage
thomasjpfan Dec 21, 2018
1f5334a
TST: Increases coverage
thomasjpfan Dec 21, 2018
6e47533
RFC: Minor
thomasjpfan Dec 21, 2018
0c8215f
DOC: Adjust order
thomasjpfan Dec 21, 2018
0768e04
RFC: Removes euclidean metric from neighbors
thomasjpfan Jan 4, 2019
e3b2a74
RFC
thomasjpfan Jan 4, 2019
6cc237f
ENH: Adds support for sparse
thomasjpfan Jan 10, 2019
c10a171
RFC: Removes unused
thomasjpfan Jan 10, 2019
df4ce74
DOC Fix warnings in examples (#12654)
adrinjalali Jan 17, 2019
bd396bf
DOC Add an example of inductive clustering (#10852)
chkoar Jan 17, 2019
a4fce96
DOC credit multiple authors of new example
jnothman Jan 17, 2019
2336f9d
DOC fix plot_iris references after files renamed
jnothman Jan 17, 2019
31cf06c
MNT Remove accidentally added example
jnothman Jan 17, 2019
114dcc5
docstring fix X in predict/predict_proba (#13004)
agamemnonc Jan 17, 2019
f450104
MNT more informative warning in estimator_checks (#13002)
jnothman Jan 17, 2019
fa3cbc8
MAINT Pin numpy version 1.5.* for pypy (#13011)
rth Jan 18, 2019
73393c6
DOC Label Spreading clumping factor must be in (0, 1) (#13015)
zjpoh Jan 19, 2019
7527c25
FIX Parallelisation of decomposition/sparse_encode (#13005)
nixphix Jan 20, 2019
9b724a6
FIX Convert the negative indices to positive ones in ColumnTransforme…
pierretallotte Jan 22, 2019
f8067d0
MAINT Fix PyPy CI with numpy 1.15 (#13018)
rth Jan 22, 2019
48e4563
DOC Move datasets.mldata_filename to deprecated section in classes.rst
qinhanmin2014 Jan 25, 2019
2626cb7
FIX Fix shuffle not passed in MLP (#12582)
samwaterbury Jan 25, 2019
eadc983
[MRG] Configure lgtm.yml for CPP (#13044)
thomasjpfan Jan 26, 2019
8ebf67d
FIX float16 overflow on accumulator operations in StandardScaler (#13…
baluyotraf Jan 26, 2019
c702658
TST Use random state to initialize MLPClassifier. (#12892)
xhan7279 Jan 27, 2019
e335067
API Deprecate externals.six (#12916)
qinhanmin2014 Jan 27, 2019
e449cda
DOC Remove outdated doc in KBinsDiscretizer (#13047)
qinhanmin2014 Jan 27, 2019
6375202
DOC Remove outdated doc in KBinsDiscretizer
qinhanmin2014 Jan 27, 2019
824e75e
EXA Improve example plot_svm_anova.py (#11731)
qinhanmin2014 Jan 28, 2019
8270c8d
DOC Correct TF-IDF formula in TfidfTransformer comments. (#13054)
vishaalkapoor Jan 29, 2019
73200d4
FIX an issue w/ large sparse matrix indices in CountVectorizer (#11295)
gvacaliuc Jan 30, 2019
45c1841
DOC More details about the attributes in MinMaxScaler (#13029)
qinhanmin2014 Jan 30, 2019
530a184
DOC Clean up the advanced installation doc to remove python < 3.5 par…
jeremiedbb Jan 30, 2019
999d8fa
API NMF and non_negative_factorization have inconsistent default init…
zjpoh Jan 30, 2019
bd8a252
MAINT: pin flake8 to stable version (#13066)
glemaitre Jan 30, 2019
e7c82a3
EXA: fix xlabel and ylabel in plot_cv_digits.py (#13067)
qinhanmin2014 Jan 30, 2019
fdd457f
MAINT: remove flake8 pinning in circle ci (#13071)
glemaitre Jan 30, 2019
16a07bf
DOC Adds an example to PatchExtractor (#12819)
CatChenal Jan 31, 2019
5014599
[MRG] Use Scipy cython BLAS API instead of bundled CBLAS (#12732)
jeremiedbb Feb 1, 2019
14fe318
MNT Ignore PendingDeprecationWarnings of matrix subclass with pytest …
NicolasHug Feb 2, 2019
24c8c92
MNT More clean up after we remove python < 3.5 (#13078)
qinhanmin2014 Feb 2, 2019
5727c74
MNT remove __future__ imports (#12791)
surgan12 Feb 2, 2019
0e1ccc5
MNT do not call fit twice in TransformedTargetetRegressor (#11641)
glemaitre Feb 3, 2019
ee3246d
FIX add support for non numeric values in MissingIndicator (#13046)
glemaitre Feb 3, 2019
92e93d7
MNT redundant from __future__ import (#13079)
qinhanmin2014 Feb 3, 2019
447d5a9
CI install pillow in pypy job (#13081)
qinhanmin2014 Feb 4, 2019
f8db656
MNT Remove utils.validation._shape_repr (#13083)
qinhanmin2014 Feb 4, 2019
9cff097
CI install pillow in Travis cron job (#13080)
qinhanmin2014 Feb 4, 2019
d131a8d
MNT Remove utils.fixes.euler_gamma (#13082)
qinhanmin2014 Feb 4, 2019
1b95ea6
MNT Update setup and travis to support OpenMP (#13053)
jeremiedbb Feb 4, 2019
3a6a33b
DOC change from 'means_prior' to 'mean_prior' in BayesianGaussianMix…
walk-to-work Feb 4, 2019
c71493e
FIX added assertion for ValueError when cv iterator is empty (#12961)
esvhd Feb 5, 2019
d35f131
[MRG] DOC Adds explicit reference to clang (#13093)
thomasjpfan Feb 5, 2019
a7bfc47
DOC Refer to ONNX (#13095)
jnothman Feb 6, 2019
513d09d
[MRG] DOC Adds _pairwise property to dev docs (#13094)
thomasjpfan Feb 7, 2019
cbff0a9
DOC Correct code example in doc/developers/contributing.rst (#13098)
bharatr21 Feb 7, 2019
867bb54
RFC Better variable names
thomasjpfan Feb 7, 2019
31c3df0
DOC Update version to 0.21
thomasjpfan Feb 7, 2019
cb9207a
RFC Address comments
thomasjpfan Feb 7, 2019
c72c7eb
RFC: Updates metric name to nan_euclidean
thomasjpfan Feb 7, 2019
161bdce
RFC Address comments
thomasjpfan Feb 7, 2019
5a032e7
RFC Lowers the number of variables
thomasjpfan Feb 7, 2019
3c1b608
Merge remote-tracking branch 'upstream/master' into masked_euclidean_…
thomasjpfan Feb 7, 2019
811b6f8
RFC Address comments
thomasjpfan Feb 7, 2019
59e6d30
RFC Uses pytest
thomasjpfan Feb 7, 2019
1873a87
STY Flake8
thomasjpfan Feb 7, 2019
a739923
RFC Lowers LOC
thomasjpfan Feb 8, 2019
984e019
RFC
thomasjpfan Feb 8, 2019
5f89240
RFC Rename to statistic
thomasjpfan Feb 8, 2019
5387fcf
RFC Uses less memory for distance matrix
thomasjpfan Feb 8, 2019
5a318b1
STY Spelling
thomasjpfan Feb 8, 2019
dcd8098
REV take_along_axis not in numpy 1.11.0
thomasjpfan Feb 8, 2019
20c8576
TST Checks nan and normal euclidean distance
thomasjpfan Feb 8, 2019
126fe56
ENH Updates whats_new
thomasjpfan Feb 8, 2019
7d16493
DOC English
thomasjpfan Feb 8, 2019
81663c6
DOC Rewords _compute_impute doc
thomasjpfan Feb 11, 2019
69d2947
RFC _impute directly imputes X
thomasjpfan Feb 11, 2019
306e98d
RF
thomasjpfan Feb 18, 2019
b0f2e63
Merge remote-tracking branch 'upstream/master' into masked_euclidean_…
thomasjpfan Feb 18, 2019
e9e97c9
RF Address comments
thomasjpfan Feb 19, 2019
1454d8e
RFC Uses to matrix
thomasjpfan Feb 19, 2019
3e8db3d
DOC Changes variable name to reference samples and features
thomasjpfan Feb 19, 2019
29405cc
DOC Grammer
thomasjpfan Feb 19, 2019
25ae4ba
RFC Moves _get_mask to utils
thomasjpfan Feb 19, 2019
29fc09c
BUG Fix
thomasjpfan Feb 19, 2019
4c8cc7d
RFC Minimizes diff
thomasjpfan Feb 19, 2019
1ef0bf2
RFC Moves _get_mask to utils
thomasjpfan Feb 19, 2019
0246a52
DOC Updates whats new
thomasjpfan Feb 25, 2019
4e101fa
DOC Updates name
thomasjpfan Feb 25, 2019
5e32b5d
CLN Address comments
thomasjpfan Feb 25, 2019
c838da5
Merge remote-tracking branch 'upstream/master' into masked_euclidean
thomasjpfan Feb 27, 2019
a90ee5e
STY flake8
thomasjpfan Feb 27, 2019
bbc774b
BUG Fix
thomasjpfan Feb 27, 2019
f79f24c
BUG Add allow_nan tag
thomasjpfan Feb 28, 2019
ff2c697
TST: Improves test coverage
thomasjpfan Feb 28, 2019
8f33d4a
CLN Address comments
thomasjpfan May 8, 2019
3b0803c
Merge remote-tracking branch 'upstream/master' into masked_euclidean
thomasjpfan May 8, 2019
0f5f436
DOC Moves whats new
thomasjpfan May 8, 2019
4f4d4ae
CLN Address diffs
thomasjpfan May 8, 2019
e1f622d
DOC Update version
thomasjpfan May 8, 2019
324223a
DOC Rewords
thomasjpfan Jun 17, 2019
8d8f17a
Merge remote-tracking branch 'upstream/master' into masked_euclidean
thomasjpfan Jun 17, 2019
91fa0f1
CLN Move to _knn
thomasjpfan Jun 17, 2019
6d877b5
DOC Moves whats_new
thomasjpfan Jun 17, 2019
bbf75f5
CLN Less diffs
thomasjpfan Jun 17, 2019
2467c07
DOC Rewords user guide
thomasjpfan Jun 17, 2019
e4da293
ENH Simplifies knn imputer
thomasjpfan Jun 17, 2019
be0b1b4
ENH Address comments
thomasjpfan Jun 18, 2019
5786f30
CLN Rename variable
thomasjpfan Jun 18, 2019
cf600d2
ENH Improves float32 handling
thomasjpfan Jun 19, 2019
65a59b4
DOC Adds comments
thomasjpfan Jun 19, 2019
922bc70
Merge remote-tracking branch 'upstream/master' into masked_euclidean
thomasjpfan Jun 19, 2019
272bc47
BUG Makes fit_X private
thomasjpfan Jun 19, 2019
c5e0e75
Merge remote-tracking branch 'upstream/master' into masked_euclidean
thomasjpfan Jul 17, 2019
52d48e7
Merge remote-tracking branch 'upstream/master' into masked_euclidean
thomasjpfan Jul 18, 2019
fa51ae6
Merge remote-tracking branch 'upstream/master' into masked_euclidean
thomasjpfan Jul 26, 2019
b3b3a53
CLN Moves force_all_finite up
thomasjpfan Jul 26, 2019
7d9e2ce
CLN Address comments
thomasjpfan Jul 26, 2019
58e8037
REV Remove missing test
thomasjpfan Jul 26, 2019
2d6beff
ENH Removes sparse support
thomasjpfan Jul 26, 2019
08409e0
STY Flake8
thomasjpfan Jul 26, 2019
93223ab
ENH Moves all _get_mask to utils
thomasjpfan Jul 29, 2019
3766e2b
Merge branch 'master' into masked_euclidean
amueller Jul 29, 2019
c4bf0ec
CLN Completely remove sparse support
thomasjpfan Jul 29, 2019
7e8d6ca
Merge remote-tracking branch 'upstream/master' into masked_euclidean
thomasjpfan Jul 30, 2019
0cee529
Merge remote-tracking branch 'upstream/master' into masked_euclidean
thomasjpfan Jul 31, 2019
4a934fe
CLN Address comments
thomasjpfan Jul 31, 2019
6dcef51
Remove statistics_ from KNNImputer (#8)
jnothman Aug 1, 2019
2be7ac9
Revert "Remove statistics_ from KNNImputer (#8)"
thomasjpfan Aug 1, 2019
b789b05
TST Adds more tests
thomasjpfan Aug 1, 2019
ae5bcbe
Merge remote-tracking branch 'upstream/master' into masked_euclidean
thomasjpfan Aug 1, 2019
2607235
TST Adds test on dropping features
thomasjpfan Aug 1, 2019
e1783ac
Merge remote-tracking branch 'upstream/master' into masked_euclidean
thomasjpfan Aug 6, 2019
83bca4a
WIP
thomasjpfan Aug 7, 2019
68441a1
ENH Adjusts handling when there is not enough neighbors
thomasjpfan Aug 7, 2019
118eef2
CLN Address comments
thomasjpfan Aug 14, 2019
f808d20
CLN Address comments
thomasjpfan Aug 14, 2019
655f51c
Merge remote-tracking branch 'upstream/master' into masked_euclidean
thomasjpfan Aug 14, 2019
606bb48
ENH Updates check_is_fitted
thomasjpfan Aug 14, 2019
62bd37b
CLN Removes squared
thomasjpfan Aug 14, 2019
8201cfb
CLN Address jnothman's comment
thomasjpfan Aug 15, 2019
a9eefbd
CLN Refactor and improve docstring
thomasjpfan Aug 18, 2019
1ea4456
CLN Address joels comments
thomasjpfan Aug 18, 2019
795adbd
Merge remote-tracking branch 'upstream/master' into masked_euclidean
thomasjpfan Aug 22, 2019
9a3a01a
Merge remote-tracking branch 'upstream/master' into masked_euclidean
thomasjpfan Aug 31, 2019
b75c376
STY Fix
thomasjpfan Aug 31, 2019
e533575
CLN Combines weighted tests
thomasjpfan Sep 3, 2019
6672bb2
BUG Fixes bug with test without missing values
thomasjpfan Sep 3, 2019
30972ae
STY Fix
thomasjpfan Sep 3, 2019
d9dc8b9
ENH Stores fit_X mask during fit
thomasjpfan Sep 3, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions doc/modules/classes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -662,6 +662,7 @@ Kernels:
impute.SimpleImputer
impute.IterativeImputer
impute.MissingIndicator
impute.KNNImputer

.. _kernel_approximation_ref:

Expand Down Expand Up @@ -964,6 +965,7 @@ See the :ref:`metrics` section of the user guide for further details.
metrics.pairwise.laplacian_kernel
metrics.pairwise.linear_kernel
metrics.pairwise.manhattan_distances
metrics.pairwise.nan_euclidean_distances
amueller marked this conversation as resolved.
Show resolved Hide resolved
metrics.pairwise.pairwise_kernels
metrics.pairwise.polynomial_kernel
metrics.pairwise.rbf_kernel
Expand Down
51 changes: 51 additions & 0 deletions doc/modules/impute.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,14 @@ missing values (e.g. :class:`impute.IterativeImputer`).
Univariate feature imputation
=============================

Imputer transformers can be used in a Pipeline as a way to build a composite
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 'composite estimator' really necesary? ...or can the sentence be usefully simplified ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds fine to me. How would you phrase it? "Imputer transformers can be used to create pipelines that support data with missing values?"

estimator that supports imputation. See
:ref:`sphx_glr_auto_examples_plot_missing_values.py`.


Simple univariate imputation
============================
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be a duplication of the preceding heading??


The :class:`SimpleImputer` class provides basic strategies for imputing missing
values. Missing values can be imputed with a provided constant value, or using
the statistics (mean, median or most frequent) of each column in which the
Expand Down Expand Up @@ -178,6 +186,49 @@ References
.. [2] Roderick J A Little and Donald B Rubin (1986). "Statistical Analysis
with Missing Data". John Wiley & Sons, Inc., New York, NY, USA.

.. _knnimpute:

Nearest neighbors imputation
amueller marked this conversation as resolved.
Show resolved Hide resolved
============================

The :class:`KNNImputer` class provides imputation for completing missing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'completing missing values' appears to be somewhat of a mis-nomer... is this referring to Rubin's terminology of 'missing at random' vs 'missing completely at random' etc. If yes, then let's call a cat a cat.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, it just means "for replacing missing values" or something. It's unnecessarily verbose, I suppose.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I misread it as "completely missing", maybe @banilo did as well? Maybe "filling in" instead of completing?

values using the k-Nearest Neighbors approach. Each sample's missing values
are imputed using values from ``n_neighbors`` nearest neighbors found in the
training set. In this context, a donor is defined to be a neighbor that
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"in the training set, also called donors" and remove the next sentence?

contributes to the imputation of a given sample. For each missing feature in a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this last sentence correct? I thought that's not what we're doing any more?
Shouldn't it be "one of the n_neighbors nearest neighbors with the feature present"? A logical "and" would mean that you compute the n_neighbors nearest neighbors and compute the samples with the feature present and then intersect the sets, which is not what's happening, right?

sample, the donors are selected such that they have the feature present and
they are one of the ``n_neighbors`` nearest neighbors.

Each sample can potentially have multiple sets of ``n_neighbors`` donors
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Not that the set of donors can be different for different features of the same sample"?

depending on the particular feature being imputed.

Each missing feature is then imputed as the average, either weighted or
unweighted, of these neighbors. When the number of donor neighbors is less
than ``n_neighbors``, the training set average for that feature is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the sentence about there being less donor neighbors than n_neighbors. Isn't "the training set average" the same as the average of these donors?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When there are not enough donors, currently this imputer considers the whole training set as its donors.

Thinking about it now, this seems a little strange.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well we don't need to handle the insufficient donors case specially if there's no weighting. But whether we are right to disregard weighting when there are insufficient neighbours, I don't know.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amueller To clarify, what I meant by "everyone is a donor" is that this implementation uses the feature mean for the entire training set (ignoring the missing values) for imputation. This is done when there are not enough donors for a given sample.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's clear, but if there aren't enough donors then at most n_neighbors available donors = everyone.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnothman summarizes my point well, i.e. without weighting there's no special case.

used for imputation. When a sample has more than a ``feature_max_missing``
fraction of its features missing, then it is excluded from being a donor for
imputation. For more information on the methodology, see ref. [OLGA]_.

The following snippet demonstrates how to replace missing values,
encoded as ``np.nan``, using the mean feature value of the two nearest
neighbors of samples with missing values::

>>> import numpy as np
>>> from sklearn.impute import KNNImputer
>>> nan = np.nan
>>> X = [[1, 2, nan], [3, 4, 3], [nan, 6, 5], [8, 8, 7]]
>>> imputer = KNNImputer(n_neighbors=2, weights="uniform")
>>> imputer.fit_transform(X)
array([[1. , 2. , 4. ],
[3. , 4. , 3. ],
[5.5, 6. , 5. ],
[8. , 8. , 7. ]])

.. [OLGA] Olga Troyanskaya, Michael Cantor, Gavin Sherlock, Pat Brown, Trevor
Hastie, Robert Tibshirani, David Botstein and Russ B. Altman, Missing value
estimation methods for DNA microarrays, BIOINFORMATICS Vol. 17 no. 6, 2001
Pages 520-525.

.. _missing_indicator:

Marking imputed values
Expand Down
5 changes: 5 additions & 0 deletions doc/whats_new/v0.21.rst
Original file line number Diff line number Diff line change
Expand Up @@ -592,6 +592,11 @@ Support for Python 3.4 and below has been officially dropped.
``fit.predict`` were not equivalent. :pr:`13142` by
:user:`Jérémie du Boisberranger <jeremiedbb>`.

- |Feature| Added the :func:`metrics.nan_euclidean_distances` metirc, which
amueller marked this conversation as resolved.
Show resolved Hide resolved
calculates euclidean distances in the presence of missing values.
:issue:`12852` by :user:`Ashim Bhattarai <ashimb9>` and
:user:`Thomas Fan <thomasjpfan>`.

:mod:`sklearn.model_selection`
..............................

Expand Down
7 changes: 7 additions & 0 deletions doc/whats_new/v0.22.rst
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,13 @@ Changelog
:pr:`123456` by :user:`Joe Bloggs <joeongithub>`.
where 123456 is the *pull request* number, not the issue number.

:mod:`sklearn.impute`
.....................

- |MajorFeature| Added :class:`impute.KNNImputer`, to impute missing values using
k-Nearest Neighbors. :issue:`12852` by :user:`Ashim Bhattarai <ashimb9>` and
:user:`Thomas Fan <thomasjpfan>`.

:mod:`sklearn.svm`
..................

Expand Down
17 changes: 15 additions & 2 deletions examples/impute/plot_missing_values.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,9 @@
The median is a more robust estimator for data with high magnitude variables
amueller marked this conversation as resolved.
Show resolved Hide resolved
which could dominate results (otherwise known as a 'long tail').

With ``KNNImputer``, missing values can be imputed using the weighted
or unweighted mean of the desired number of nearest neighbors.

amueller marked this conversation as resolved.
Show resolved Hide resolved
Another option is the :class:`sklearn.impute.IterativeImputer`. This uses
round-robin linear regression, treating every variable as an output in
turn. The version implemented assumes Gaussian (output) variables. If your
Expand All @@ -27,7 +30,8 @@
from sklearn.datasets import load_boston
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline, make_union
from sklearn.impute import SimpleImputer, IterativeImputer, MissingIndicator
from sklearn.impute import (
SimpleImputer, KNNImputer, IterativeImputer, MissingIndicator)
from sklearn.model_selection import cross_val_score

rng = np.random.RandomState(0)
amueller marked this conversation as resolved.
Show resolved Hide resolved
Expand Down Expand Up @@ -79,6 +83,13 @@ def get_results(dataset):
imputer = SimpleImputer(missing_values=0, strategy="mean")
mean_impute_scores = get_scores_for_imputer(imputer, X_missing, y_missing)

# Estimate the score after kNN-imputation of the missing values
knn_rf_estimator = make_pipeline(
KNNImputer(missing_values=0, sample_max_missing=0.99),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General comment on the whole example: Who abouting indicating missingness as NaN in the entire example, to make it more educational. In some place, having 0 to indicate missingness and replacing by 0 may be confusing to some people.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that missing_values=0 in this example is a bit obscure... but it is a separate issue.

RandomForestRegressor(random_state=0, n_estimators=100))
knn_impute_scores = cross_val_score(knn_rf_estimator, X_missing, y_missing,
scoring='neg_mean_squared_error')

# Estimate the score after iterative imputation of the missing values
imputer = IterativeImputer(missing_values=0,
random_state=0,
Expand All @@ -90,6 +101,7 @@ def get_results(dataset):
return ((full_scores.mean(), full_scores.std()),
(zero_impute_scores.mean(), zero_impute_scores.std()),
(mean_impute_scores.mean(), mean_impute_scores.std()),
(knn_impute_scores.mean(), knn_impute_scores.std()),
(iterative_impute_scores.mean(), iterative_impute_scores.std()))


Expand All @@ -107,8 +119,9 @@ def get_results(dataset):
x_labels = ['Full data',
'Zero imputation',
'Mean Imputation',
'KNN Imputation',
'Multivariate Imputation']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bigger conceptual-statistical point: I would personally not call the 5th analysis (=iterative imputation) "multivariate", because the 4th analysis (=kNN) uses the same supervised estimator as the 5th analysis, just in a nested/repeated fashion. As such either analysis 4 and 5 are multivariate or neither is. As an alternative, how about: "Iterative imputation" which is a common term in the statistical and machine literature from my experience.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that should be changed in this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not addressed

colors = ['r', 'g', 'b', 'orange']
colors = ['r', 'g', 'b', 'orange', 'black']

# plot diabetes results
plt.figure(figsize=(12, 6))
amueller marked this conversation as resolved.
Show resolved Hide resolved
Expand Down
Loading