Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] ENH add missing indicator in KNNImputer #15010

Merged

Conversation

@glemaitre
Copy link
Contributor

glemaitre commented Sep 18, 2019

Add the add_indicator to KNNImputer for consistency with other imputers.
Since all imputers would share this feature, we can create a private base class.

TODO:

  • add the missing indicator to KNNImputer
  • potentially create a base class for all future imputer
  • add either common test / test for KNNImputer

For another PR

  • update the example to use the feature + notebook style
@thomasjpfan

This comment has been minimized.

Copy link
Member

thomasjpfan commented Sep 18, 2019

You got this to faster than I did. :)

potentially create a base class for all future imputer

Since add_indicator does something fairly small, I am okay with this.

glemaitre added 2 commits Sep 18, 2019
@glemaitre glemaitre changed the title [WIP] EHN add missing indicator in KNNImputer [MRG] EHN add missing indicator in KNNImputer Sep 18, 2019
@glemaitre

This comment has been minimized.

Copy link
Contributor Author

glemaitre commented Sep 18, 2019

I think this is ready for the first round of review. The example will come in another PR. It will keep the diff quite manageable.

@jnothman @thomasjpfan Do you want to have a look at it?

@thomasjpfan thomasjpfan self-assigned this Sep 18, 2019
@glemaitre glemaitre added this to WAITING FOR REVIEW in Guillaume's pet Sep 18, 2019
Copy link
Member

thomasjpfan left a comment

This type of abstraction usually makes code harder to read for people unfamiliar with the code because one needs to go to the parent class to know information such as "when is _fit called", "when is _transform" called.

An alternative would be something like this:

class _BaseImputer:
    def __init__(self, missing_values=np.nan, add_indicator=False):
        self.missing_values = missing_values
        self.add_indicator = add_indicator

    def _fit_indicator(self, X):
        if self.add_indicator:
            self.indicator_ = MissingIndicator(
                missing_values=self.missing_values, error_on_new=False)
            self.indicator_.fit(X)
        else:
            self.indicator_ = None

    def _concatenate_indicator(self, X):
        if self.add_indicator is None:
            return X
        X_trans_indicator = self.indicator_.transform(X)
        hstack = sparse.hstack if sparse.issparse(X) else np.hstack
        return hstack((X_trans_imputer, X_trans_indicator))

class SimpleImputer(_BaseImputer):

    def fit(self, X):
        # fit things
        super()._fit_indicator(X)
        return self

    def transform(self, X):
        # do transform things
        return super()._concatenate_indicator(self, X)

Personally, I can read both abstraction styles about the same. But I think the above implementation is a little more explicit.

@glemaitre

This comment has been minimized.

Copy link
Contributor Author

glemaitre commented Sep 18, 2019

The + of the current implementation is that you will not be able to code an imputer without defining _fit and _transform

@thomasjpfan

This comment has been minimized.

Copy link
Member

thomasjpfan commented Sep 18, 2019

We need to decide which design is easier to maintain.

The current design, has the parent call functions that a child will define. This API exposed this contract through the abstractness of _fit and _transform.

My proposal has the child calls methods in the parent class, which I consider more explicit. The parent class provides methods that the child class calls to do common operations. (The operations may change state, or use that state to do transform)

@glemaitre

This comment has been minimized.

Copy link
Contributor Author

glemaitre commented Sep 18, 2019

My proposal has the child calls methods in the parent class

So I still need to go to the parents to know what it does.

Where I find your design a bit dangerous is that nothing will avoid me to create a new imputer without calling _concatenate_indicator and _fit_indicator (in some way it makes things more flexible as well). However, using abstractmethod will raise an error for a newcomer implementing an imputer telling that you need to implement both _fit_*** and _transform_*** (and makes stuff really rigid). In this case, I think that we always gonna need the indicator the having no flexibility would not be that bad, isn't it?

NB: _fit could be renamed to _fit_imputer and _transform to _transform_imputer to have explicit name.

This said, if @jnothman could give some advice, I'll be happy to go with one design or the other (I am far to be good with those).

@thomasjpfan

This comment has been minimized.

Copy link
Member

thomasjpfan commented Sep 18, 2019

So I still need to go to the parents to know what it does.

You would know to go to the parent class to see what it does, since it explicitly calls the parent method. Imagine a new contributor with this PR:

  1. Goes to SimpleImputer and tries to find fit.
  2. Notice it is not there, so we navigate to the parent class and find fit.
  3. Reads fit and sees there is indicator logic.
  4. Then notice it calls _fit_imputer.
  5. Navigate to _fit_imputer and sees its an abstract method.
  6. Navigate back to SimpleImputer._fit_imputer, now we can see how the imputer works.

With the my proposal:

  1. Goes to SimpueImputer and finds fit.
  2. Learns how the imputer works.
  3. Notice super()._fit_indicator, so we go to the super class and see how that works.

I am also happy with either design.

@glemaitre

This comment has been minimized.

Copy link
Contributor Author

glemaitre commented Sep 18, 2019

You convinced me then. I did not pay attention enough to the super() call which makes thing obvious.

glemaitre added 3 commits Sep 18, 2019
@glemaitre

This comment has been minimized.

Copy link
Contributor Author

glemaitre commented Sep 18, 2019

@thomasjpfan With the refactoring, I think this is fine. I just find a bit complex the handling for copying X but I did not come with something easier right now.

sklearn/impute/_base.py Outdated Show resolved Hide resolved
sklearn/impute/_base.py Outdated Show resolved Hide resolved
sklearn/impute/_base.py Outdated Show resolved Hide resolved
glemaitre added 3 commits Oct 2, 2019
@glemaitre

This comment has been minimized.

Copy link
Contributor Author

glemaitre commented Oct 22, 2019

@thomasjpfan Do you have any other comments. This might be good to push this one forward for the release to have consistent imputer.

sklearn/impute/_base.py Show resolved Hide resolved
sklearn/impute/_base.py Outdated Show resolved Hide resolved
sklearn/impute/_base.py Show resolved Hide resolved
sklearn/impute/_base.py Outdated Show resolved Hide resolved
sklearn/impute/tests/test_common.py Show resolved Hide resolved
sklearn/impute/tests/test_base.py Outdated Show resolved Hide resolved
sklearn/impute/tests/test_base.py Outdated Show resolved Hide resolved
sklearn/impute/_base.py Show resolved Hide resolved
Copy link
Contributor

NicolasHug left a comment

Made a few comments but looks good overall.

sklearn/impute/tests/test_base.py Outdated Show resolved Hide resolved
sklearn/impute/_knn.py Outdated Show resolved Hide resolved
sklearn/impute/_base.py Outdated Show resolved Hide resolved
sklearn/impute/_base.py Outdated Show resolved Hide resolved
sklearn/impute/_base.py Show resolved Hide resolved
sklearn/impute/_base.py Show resolved Hide resolved
sklearn/impute/_iterative.py Outdated Show resolved Hide resolved
sklearn/impute/_base.py Outdated Show resolved Hide resolved
sklearn/impute/_base.py Show resolved Hide resolved
glemaitre and others added 8 commits Oct 24, 2019
Co-Authored-By: Nicolas Hug <contact@nicolas-hug.com>
Co-Authored-By: Nicolas Hug <contact@nicolas-hug.com>
… into missing_indicator_knnimputer
@glemaitre

This comment has been minimized.

Copy link
Contributor Author

glemaitre commented Oct 25, 2019

I fixed PEP8

Copy link
Contributor

NicolasHug left a comment

LGTM, Thanks @glemaitre

sklearn/impute/_base.py Outdated Show resolved Hide resolved
@glemaitre

This comment has been minimized.

Copy link
Contributor Author

glemaitre commented Oct 25, 2019

The failure is due to an update of Python 3.8.

Copy link
Member

thomasjpfan left a comment

LGTM

doc/whats_new/v0.22.rst Outdated Show resolved Hide resolved
Co-Authored-By: Thomas J Fan <thomasjpfan@gmail.com>
@glemaitre

This comment has been minimized.

Copy link
Contributor Author

glemaitre commented Oct 25, 2019

Done :)

return X_imputed

hstack = sparse.hstack if sparse.issparse(X_imputed) else np.hstack
if X_indicator is not None:

This comment has been minimized.

Copy link
@thomasjpfan

thomasjpfan Oct 25, 2019

Member

Last nit:

Can we change the order of this logic to error first:

if X_indicator is None:
    raise ValueError(...)

hstack = ...
return hstack(...)
@jnothman jnothman changed the title [MRG] EHN add missing indicator in KNNImputer [MRG] ENH add missing indicator in KNNImputer Oct 27, 2019
glemaitre added 2 commits Oct 28, 2019
nit
@thomasjpfan thomasjpfan merged commit 332759f into scikit-learn:master Oct 28, 2019
17 of 19 checks passed
17 of 19 checks passed
scikit-learn.scikit-learn Build #20191028.24 failed
Details
scikit-learn.scikit-learn (Linux pylatest_pip_openblas_pandas) Linux pylatest_pip_openblas_pandas failed
Details
LGTM analysis: C/C++ No code changes detected
Details
LGTM analysis: JavaScript No code changes detected
Details
LGTM analysis: Python No new or fixed alerts
Details
ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: doc Your tests passed on CircleCI!
Details
ci/circleci: doc artifact Link to 0/doc/_changed.html
Details
ci/circleci: doc-min-dependencies Your tests passed on CircleCI!
Details
ci/circleci: lint Your tests passed on CircleCI!
Details
codecov/patch 100% of diff hit (target 97.21%)
Details
codecov/project Absolute coverage decreased by -<.01% but relative coverage increased by +2.78% compared to a91bae7
Details
scikit-learn.scikit-learn (Linux py35_conda_openblas) Linux py35_conda_openblas succeeded
Details
scikit-learn.scikit-learn (Linux py35_ubuntu_atlas) Linux py35_ubuntu_atlas succeeded
Details
scikit-learn.scikit-learn (Linux pylatest_conda_mkl) Linux pylatest_conda_mkl succeeded
Details
scikit-learn.scikit-learn (Linux32 py35_ubuntu_atlas_32bit) Linux32 py35_ubuntu_atlas_32bit succeeded
Details
scikit-learn.scikit-learn (Windows py35_pip_openblas_32bit) Windows py35_pip_openblas_32bit succeeded
Details
scikit-learn.scikit-learn (Windows py37_conda_mkl) Windows py37_conda_mkl succeeded
Details
scikit-learn.scikit-learn (macOS pylatest_conda_mkl) macOS pylatest_conda_mkl succeeded
Details
@glemaitre glemaitre moved this from WAITING FOR REVIEW to REVIEWED AND WAITING FOR CHANGES in Guillaume's pet Oct 28, 2019
@glemaitre glemaitre moved this from REVIEWED AND WAITING FOR CHANGES to MERGED in Guillaume's pet Oct 28, 2019
@adrinjalali adrinjalali added this to In progress in Meeting Issues via automation Oct 29, 2019
@adrinjalali adrinjalali moved this from In progress to Done in Meeting Issues Oct 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.