Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Adds KNNImputer #12852

Merged
merged 269 commits into from Sep 3, 2019
Merged

[MRG] Adds KNNImputer #12852

merged 269 commits into from Sep 3, 2019

Conversation

@thomasjpfan
Copy link
Member

thomasjpfan commented Dec 21, 2018

Reference Issues/PRs

Continues #9348, and a part of #9212

Closes #2989
Closes #9348
Closes #9212

What does this implement/fix? Explain your changes.

This PR cleans up the work done on #9348 and #9212:

  1. Adds more tests.
  2. Cleans up warnings/errors.
  3. Currently, the distance between two completely nan vectors is nan. This causes the diagonal to sometimes be nan. This behavior can be changed depending on our preferences.

Edit: KNNImputer has been added this to PR with the following updates:

  1. Address comments in #9212
  2. Reduces the number of variables.
  3. Memory optimizations.
  4. Raises as soon as possible (before calculations).
ashimb9 added 30 commits Jul 31, 2017
thomasjpfan added 3 commits Aug 14, 2019
Copy link
Member

jnothman left a comment

This is another comment I had pending. Not sure when I'll give it another full pass

sklearn/impute/tests/test_knn.py Show resolved Hide resolved
Copy link
Member

jnothman left a comment

@amueller, this one doesn't store any attributes ending in _.

Sorry, still not really able to give this a full review.

return X[:, valid_idx]

def _more_tags(self):
return {'allow_nan': True}

This comment has been minimized.

Copy link
@jnothman

jnothman Aug 18, 2019

Member
Suggested change
return {'allow_nan': True}
return {'allow_nan': is_scalar_nan(self.missing_values)}
Parameters
----------
dist_pot_donors : array-like, shape=(n_receivers, n_train_samples)

This comment has been minimized.

Copy link
@jnothman

jnothman Aug 18, 2019

Member

Should reformat according to the recent edicts on these matters.

@thomasjpfan

This comment has been minimized.

Copy link
Member Author

thomasjpfan commented Aug 18, 2019

@jnothman check_is_fitted checks for _ in the beginning as well:

if (v.endswith("_") or v.startswith("_"))

thomasjpfan added 2 commits Aug 18, 2019
@amueller

This comment has been minimized.

Copy link
Member

amueller commented Aug 22, 2019

tests are failing...

@thomasjpfan

This comment has been minimized.

Copy link
Member Author

thomasjpfan commented Aug 22, 2019

Test failure was unrelated, now that #14689 is merged, merging this with master should fix it.

least one neighbor with a defined distance, the weighted or unweighted average
of the remaining neighbors will be used during imputation. If a feature is
always missing in training, it is removed during `transform`. For more
information on the methodology, see ref. [OL2001]_.

This comment has been minimized.

Copy link
@jnothman

jnothman Aug 28, 2019

Member

Do they consider distance weighting? It might be worth noting the differences...

This comment has been minimized.

Copy link
@thomasjpfan

thomasjpfan Sep 3, 2019

Author Member

It looks like they use distance weighting by default.

This comment has been minimized.

Copy link
@thomasjpfan

thomasjpfan Sep 3, 2019

Author Member

Gerhard Tutz, Shahla Ramzan,
Improved Methods for the Imputation of Missing Data by Nearest Neighbor Methods shows in their experiments that weighted kNN performs a little better than unweighted.

This comment has been minimized.

Copy link
@thomasjpfan

thomasjpfan Sep 3, 2019

Author Member

Lorenzo Beretta* and Alessandro Santaniello, Nearest neighbor imputation algorithms: a critical evaluation, shows that weighted is slightly better than unweighted.

])
knn = KNNImputer(missing_values=na, n_neighbors=2).fit(X)

X_transform = knn.transform(X)

This comment has been minimized.

Copy link
@jnothman

jnothman Aug 28, 2019

Member

Please check with test data where the feature has values

This comment has been minimized.

Copy link
@thomasjpfan

thomasjpfan Sep 3, 2019

Author Member

PR updated with this suggestion.

dist = nan_euclidean_distances(X)
r1c1_nbor_dists = dist[1, [0, 2, 3, 4, 5]]
r1c3_nbor_dists = dist[1, [0, 3, 4, 5, 6]]
r1c1_nbor_wt = (1 / r1c1_nbor_dists)

This comment has been minimized.

Copy link
@jnothman

jnothman Aug 28, 2019

Member
Suggested change
r1c1_nbor_wt = (1 / r1c1_nbor_dists)
r1c1_nbor_wt = 1 / r1c1_nbor_dists
r1c1_nbor_dists = dist[1, [0, 2, 3, 4, 5]]
r1c3_nbor_dists = dist[1, [0, 3, 4, 5, 6]]
r1c1_nbor_wt = (1 / r1c1_nbor_dists)
r1c3_nbor_wt = (1 / r1c3_nbor_dists)

This comment has been minimized.

Copy link
@jnothman

jnothman Aug 28, 2019

Member
Suggested change
r1c3_nbor_wt = (1 / r1c3_nbor_dists)
r1c3_nbor_wt = 1 / r1c3_nbor_dists


@pytest.mark.parametrize("na", [-1, np.nan])
def test_knn_imputer_with_weighted_features(na):

This comment has been minimized.

Copy link
@jnothman

jnothman Aug 28, 2019

Member

"weighted features" seems to be a strange name. This also appears to overlap in purpose with test_knn_imputer_weight_distance. What's the distinction?

This comment has been minimized.

Copy link
@thomasjpfan

thomasjpfan Sep 3, 2019

Author Member

There was not a distinction. test_knn_imputer_with_weighted_features was removed and the tests from there was moved into test_knn_imputer_weight_distance

Copy link
Member

jnothman left a comment

Otherwise LGTM!!!

thomasjpfan added 6 commits Aug 31, 2019
@thomasjpfan

This comment has been minimized.

Copy link
Member Author

thomasjpfan commented Sep 3, 2019

The mask of fit_X is now stored as a private attribute during fit.

Copy link
Member

jnothman left a comment

I think this is good to hit the road.

Congrats @ashimb9 and thanks @thomasjpfan for finishing it off!

@jnothman jnothman merged commit 2d1b9e3 into scikit-learn:master Sep 3, 2019
18 checks passed
18 checks passed
LGTM analysis: C/C++ No code changes detected
Details
LGTM analysis: JavaScript No code changes detected
Details
LGTM analysis: Python No new or fixed alerts
Details
ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: doc Your tests passed on CircleCI!
Details
ci/circleci: doc artifact Link to 0/doc/_changed.html
Details
ci/circleci: doc-min-dependencies Your tests passed on CircleCI!
Details
ci/circleci: lint Your tests passed on CircleCI!
Details
codecov/patch 98.24% of diff hit (target 96.91%)
Details
codecov/project 96.92% (+<.01%) compared to 9d62e2b
Details
scikit-learn.scikit-learn Build #20190903.10 succeeded
Details
scikit-learn.scikit-learn (Linux py35_conda_openblas) Linux py35_conda_openblas succeeded
Details
scikit-learn.scikit-learn (Linux py35_ubuntu_atlas) Linux py35_ubuntu_atlas succeeded
Details
scikit-learn.scikit-learn (Linux pylatest_conda_mkl_pandas) Linux pylatest_conda_mkl_pandas succeeded
Details
scikit-learn.scikit-learn (Linux32 py35_ubuntu_atlas_32bit) Linux32 py35_ubuntu_atlas_32bit succeeded
Details
scikit-learn.scikit-learn (Windows py35_pip_openblas_32bit) Windows py35_pip_openblas_32bit succeeded
Details
scikit-learn.scikit-learn (Windows py37_conda_mkl) Windows py37_conda_mkl succeeded
Details
scikit-learn.scikit-learn (macOS pylatest_conda_mkl) macOS pylatest_conda_mkl succeeded
Details
@amueller

This comment has been minimized.

Copy link
Member

amueller commented Sep 4, 2019

yayyyy finally ;)

@banilo

This comment has been minimized.

Copy link
Contributor

banilo commented Sep 4, 2019

@chitcode

This comment has been minimized.

Copy link

chitcode commented Sep 15, 2019

Very happy to see this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
You can’t perform that action at this time.