New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] SimpleImputer(strategy="constant") #11211

Merged
merged 33 commits into from Jun 20, 2018

Conversation

Projects
None yet
7 participants
@jeremiedbb
Contributor

jeremiedbb commented Jun 6, 2018

This is WIP to implement the constant strategy for the SimpleImputer as described in #11208.

jeremiedbb added some commits Jun 6, 2018

Show outdated Hide outdated sklearn/impute.py Outdated
Show outdated Hide outdated sklearn/impute.py Outdated
Show outdated Hide outdated sklearn/impute.py Outdated
@sklearn-lgtm

This comment was marked as outdated.

Show comment
Hide comment
@sklearn-lgtm

sklearn-lgtm Jun 6, 2018

This pull request introduces 1 alert when merging f300fe2 into 646ceb1 - view on LGTM.com

new alerts:

  • 1 for Comparison of identical values

Comment posted by LGTM.com

sklearn-lgtm commented Jun 6, 2018

This pull request introduces 1 alert when merging f300fe2 into 646ceb1 - view on LGTM.com

new alerts:

  • 1 for Comparison of identical values

Comment posted by LGTM.com

@sklearn-lgtm

This comment was marked as outdated.

Show comment
Hide comment
@sklearn-lgtm

sklearn-lgtm Jun 7, 2018

This pull request introduces 1 alert when merging 35e30ac into 646ceb1 - view on LGTM.com

new alerts:

  • 1 for Comparison of identical values

Comment posted by LGTM.com

sklearn-lgtm commented Jun 7, 2018

This pull request introduces 1 alert when merging 35e30ac into 646ceb1 - view on LGTM.com

new alerts:

  • 1 for Comparison of identical values

Comment posted by LGTM.com

Show outdated Hide outdated sklearn/impute.py Outdated
Show outdated Hide outdated sklearn/impute.py Outdated
@sklearn-lgtm

This comment was marked as spam.

Show comment
Hide comment
@sklearn-lgtm

sklearn-lgtm Jun 7, 2018

This pull request introduces 2 alerts when merging ea4a929 into 646ceb1 - view on LGTM.com

new alerts:

  • 2 for Comparison of identical values

Comment posted by LGTM.com

sklearn-lgtm commented Jun 7, 2018

This pull request introduces 2 alerts when merging ea4a929 into 646ceb1 - view on LGTM.com

new alerts:

  • 2 for Comparison of identical values

Comment posted by LGTM.com

@ogrisel

This is starting to look really good. As told in real life the main thing missing is a narrative doc update.

Here are some further comments in the code:

Show outdated Hide outdated sklearn/tests/test_impute.py Outdated
Show outdated Hide outdated sklearn/impute.py Outdated
Show outdated Hide outdated sklearn/impute.py Outdated
Show outdated Hide outdated sklearn/impute.py Outdated
Show outdated Hide outdated sklearn/impute.py Outdated
Show outdated Hide outdated sklearn/impute.py Outdated
@ogrisel

One more comment.

if X.dtype.kind == "O":
most_frequent = np.empty(X.shape[0], dtype=object)
else:
most_frequent = np.empty(X.shape[0])

This comment has been minimized.

@ogrisel

ogrisel Jun 7, 2018

Member

Maybe this should be:

most_frequent = np.empty(X.shape[0], dtype=X.dtype)
@ogrisel

ogrisel Jun 7, 2018

Member

Maybe this should be:

most_frequent = np.empty(X.shape[0], dtype=X.dtype)

This comment has been minimized.

@jeremiedbb

jeremiedbb Jun 14, 2018

Contributor

@ogrisel actually, we don't always want most_frequent to have the same dtype as X. For example, if there is a column full of NaNs in an integer array, X has integer dtype, whereas most_frequent (which stores the statistics column-wise) will have a np.nan for the column of NaNs.

@jeremiedbb

jeremiedbb Jun 14, 2018

Contributor

@ogrisel actually, we don't always want most_frequent to have the same dtype as X. For example, if there is a column full of NaNs in an integer array, X has integer dtype, whereas most_frequent (which stores the statistics column-wise) will have a np.nan for the column of NaNs.

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Jun 14, 2018

Contributor

so discussed here: if you have integer with all -1, specify missing_value of -1, then the most_frequent will be NaN (due to implementation details to signal this case), and then X.dtype would clash with NaN.

@jorisvandenbossche

jorisvandenbossche Jun 14, 2018

Contributor

so discussed here: if you have integer with all -1, specify missing_value of -1, then the most_frequent will be NaN (due to implementation details to signal this case), and then X.dtype would clash with NaN.

This comment has been minimized.

@ogrisel

ogrisel Jun 14, 2018

Member

ok

@ogrisel

ogrisel Jun 14, 2018

Member

ok

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Jun 8, 2018

Member

BTW there are still some tests are broken because of the switch from "NaN" to np.nan as the default missing value marker.

We decided to not use a string as the default marker to avoid having weird behavior in case a user has the "NaN" string in a CSV file for instance. I think it's less surprising to not have "magic" string in SimpleImputer and now is the time to change this because of the unreleased introduction of the new SimpleImputer class.

Member

ogrisel commented Jun 8, 2018

BTW there are still some tests are broken because of the switch from "NaN" to np.nan as the default missing value marker.

We decided to not use a string as the default marker to avoid having weird behavior in case a user has the "NaN" string in a CSV file for instance. I think it's less surprising to not have "magic" string in SimpleImputer and now is the time to change this because of the unreleased introduction of the new SimpleImputer class.

@glemaitre glemaitre added this to the 0.20 milestone Jun 8, 2018

@jeremiedbb

This comment has been minimized.

Show comment
Hide comment
@jeremiedbb

jeremiedbb Jun 11, 2018

Contributor

The switch from "NaN" to np.nan for the default value of missing_values does not seem to be a good idea eventually. Setting a parameter to np.nan as default is not possible right now. There is a general test for all estimators that checks the default parameters.

    # We get the default parameters from init and then
    # compare these against the actual values of the attributes.

(from estimator_checks.py in check_parameters_default_constructible)

It results in a nan != nan error.

I see 3 solutions:

  • Modify the check_parameters_default_constructible to allow np.nan. I don't think that one is a good idea
  • Go back to "NaN" for the default value.
  • Set default value to None and call the fit with missing_values=np.nan if X is numerical and leave None otherwise.

Which solution should we apply ?

Contributor

jeremiedbb commented Jun 11, 2018

The switch from "NaN" to np.nan for the default value of missing_values does not seem to be a good idea eventually. Setting a parameter to np.nan as default is not possible right now. There is a general test for all estimators that checks the default parameters.

    # We get the default parameters from init and then
    # compare these against the actual values of the attributes.

(from estimator_checks.py in check_parameters_default_constructible)

It results in a nan != nan error.

I see 3 solutions:

  • Modify the check_parameters_default_constructible to allow np.nan. I don't think that one is a good idea
  • Go back to "NaN" for the default value.
  • Set default value to None and call the fit with missing_values=np.nan if X is numerical and leave None otherwise.

Which solution should we apply ?

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Jun 11, 2018

Member

Hum pandas.read_csv is using object dtype columns with np.nan for missing values when parsing CSV files with string contents in a column. So using None as the default missing value for non-numerical columns will not work out of the box for this kind of pipeline.

Maybe we should go back to the "NaN" special magic string. But this mean that we will not be able to properly handle columns that hat this token as a legit value and missing values in the same column.

Another option would be to introduce a module level MISSING singleton marker:

class MissingValueMarker:
    pass


MISSING = MissingValueMarker()


class SimpleImputer(strategy='...', missing_value=MISSING, ...):
    ...

This marker could by default match np.nan both for floating point and object dtype data. For integer dtyped data it would not match anything though and the user would be required to pass a specific integer value explicitly.

Member

ogrisel commented Jun 11, 2018

Hum pandas.read_csv is using object dtype columns with np.nan for missing values when parsing CSV files with string contents in a column. So using None as the default missing value for non-numerical columns will not work out of the box for this kind of pipeline.

Maybe we should go back to the "NaN" special magic string. But this mean that we will not be able to properly handle columns that hat this token as a legit value and missing values in the same column.

Another option would be to introduce a module level MISSING singleton marker:

class MissingValueMarker:
    pass


MISSING = MissingValueMarker()


class SimpleImputer(strategy='...', missing_value=MISSING, ...):
    ...

This marker could by default match np.nan both for floating point and object dtype data. For integer dtyped data it would not match anything though and the user would be required to pass a specific integer value explicitly.

@jeremiedbb

This comment has been minimized.

Show comment
Hide comment
@jeremiedbb

jeremiedbb Jun 11, 2018

Contributor

I said "and leave None otherwise" but we could call the fit with missing_values = np.nan for all dtypes.
So instead of

 if sparse.issparse(X):
      self.statistics_ = self._sparse_fit(X,
                                          self.strategy,
                                          self.missing_values,
                                          fill_value)

we'd have something like

missing_values = np.nan if self.missing_values is None else self.missing_values

 if sparse.issparse(X):
     self.statistics_ = self._sparse_fit(X,
                                         self.strategy,
                                         missing_values,
                                         fill_value)

is it fine ?

Contributor

jeremiedbb commented Jun 11, 2018

I said "and leave None otherwise" but we could call the fit with missing_values = np.nan for all dtypes.
So instead of

 if sparse.issparse(X):
      self.statistics_ = self._sparse_fit(X,
                                          self.strategy,
                                          self.missing_values,
                                          fill_value)

we'd have something like

missing_values = np.nan if self.missing_values is None else self.missing_values

 if sparse.issparse(X):
     self.statistics_ = self._sparse_fit(X,
                                         self.strategy,
                                         missing_values,
                                         fill_value)

is it fine ?

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Jun 11, 2018

Member

The problem with this approach is that nobody could use None as the missing value marker in that case. There might be pipelines where None is used upstream as the missing value marker for object dtyped data even if it's not the default behavior of pandas dataframes.

Member

ogrisel commented Jun 11, 2018

The problem with this approach is that nobody could use None as the missing value marker in that case. There might be pipelines where None is used upstream as the missing value marker for object dtyped data even if it's not the default behavior of pandas dataframes.

@jeremiedbb

This comment has been minimized.

Show comment
Hide comment
@jeremiedbb

jeremiedbb Jun 11, 2018

Contributor

That's a problem indeed.

Another option would be to introduce a module level MISSING singleton marker:
it seems that's the only way to avoid the problem of having a default value different of the np.nan missing_values. And it's more elegant.
I'm going with this solution if it's ok for you

Contributor

jeremiedbb commented Jun 11, 2018

That's a problem indeed.

Another option would be to introduce a module level MISSING singleton marker:
it seems that's the only way to avoid the problem of having a default value different of the np.nan missing_values. And it's more elegant.
I'm going with this solution if it's ok for you

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 11, 2018

Member
Member

jnothman commented Jun 11, 2018

@jeremiedbb

This comment has been minimized.

Show comment
Hide comment
@jeremiedbb

jeremiedbb Jun 11, 2018

Contributor

Yes it's the only remaining issue regarding np.nan as the default value. It only happens when checking the constructor. Afterward, the imputer works fine with nans.

Contributor

jeremiedbb commented Jun 11, 2018

Yes it's the only remaining issue regarding np.nan as the default value. It only happens when checking the constructor. Afterward, the imputer works fine with nans.

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Jun 11, 2018

Contributor

Yes, agree with @jnothman. np.nan is the logical default, so then it seems a bit stupid to introduce something some singleton marker that will be repladed with np.nan afterwards anyhow, to just satisfy the current estimator checks.

Contributor

jorisvandenbossche commented Jun 11, 2018

Yes, agree with @jnothman. np.nan is the logical default, so then it seems a bit stupid to introduce something some singleton marker that will be repladed with np.nan afterwards anyhow, to just satisfy the current estimator checks.

@sklearn-lgtm

This comment was marked as spam.

Show comment
Hide comment
@sklearn-lgtm

sklearn-lgtm Jun 11, 2018

This pull request introduces 2 alerts when merging a6c33b1 into 646ceb1 - view on LGTM.com

new alerts:

  • 2 for Comparison of identical values

Comment posted by LGTM.com

sklearn-lgtm commented Jun 11, 2018

This pull request introduces 2 alerts when merging a6c33b1 into 646ceb1 - view on LGTM.com

new alerts:

  • 2 for Comparison of identical values

Comment posted by LGTM.com

Show outdated Hide outdated sklearn/impute.py Outdated
@jnothman

Please also avoid @ referencing me in commit messages... if it gets merged, it sends me spam notifications every time someone does something silly (like rebasing) with public forks of scikit-learn.

def _validate_input(self, X):
allowed_strategies = ["mean", "median", "most_frequent", "constant"]
if self.strategy not in allowed_strategies:
raise ValueError("Can only use these strategies: {0} "

This comment has been minimized.

@jnothman

jnothman Jun 16, 2018

Member

No test coverage

@jnothman

jnothman Jun 16, 2018

Member

No test coverage

if invalid_mask.any():
missing = np.arange(X.shape[1])[invalid_mask]
if self.verbose:
warnings.warn("Deleting features without "

This comment has been minimized.

@jnothman

jnothman Jun 16, 2018

Member

Not tested

@jnothman

jnothman Jun 16, 2018

Member

Not tested

@sklearn-lgtm

This comment has been minimized.

Show comment
Hide comment
@sklearn-lgtm

sklearn-lgtm Jun 17, 2018

This pull request introduces 3 alerts when merging 20456f4 into bb38539 - view on LGTM.com

new alerts:

  • 3 for Comparison of identical values

Comment posted by LGTM.com

sklearn-lgtm commented Jun 17, 2018

This pull request introduces 3 alerts when merging 20456f4 into bb38539 - view on LGTM.com

new alerts:

  • 3 for Comparison of identical values

Comment posted by LGTM.com

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 17, 2018

Member

Please add an entry to the change log at doc/whats_new/v0.20.rst. Like the other entries there, please reference this pull request with :issue: and credit yourself (and other contributors if applicable) with :user:

Or perhaps modify the existing entry introducing SimpleImputer. Note the differences:

  • strategy='constant'
  • strategy='most_frequent' with string columns
  • missing_values="NaN" should now be missing_values=np.nan
Member

jnothman commented Jun 17, 2018

Please add an entry to the change log at doc/whats_new/v0.20.rst. Like the other entries there, please reference this pull request with :issue: and credit yourself (and other contributors if applicable) with :user:

Or perhaps modify the existing entry introducing SimpleImputer. Note the differences:

  • strategy='constant'
  • strategy='most_frequent' with string columns
  • missing_values="NaN" should now be missing_values=np.nan
Show outdated Hide outdated sklearn/impute.py Outdated
Show outdated Hide outdated sklearn/impute.py Outdated
Show outdated Hide outdated sklearn/impute.py Outdated
@jeremiedbb

This comment has been minimized.

Show comment
Hide comment
@jeremiedbb

jeremiedbb Jun 18, 2018

Contributor

outdated

Contributor

jeremiedbb commented Jun 18, 2018

outdated

@sklearn-lgtm

This comment has been minimized.

Show comment
Hide comment
@sklearn-lgtm

sklearn-lgtm Jun 18, 2018

This pull request introduces 3 alerts when merging 7d3d1b5 into 4143356 - view on LGTM.com

new alerts:

  • 3 for Comparison of identical values

Comment posted by LGTM.com

sklearn-lgtm commented Jun 18, 2018

This pull request introduces 3 alerts when merging 7d3d1b5 into 4143356 - view on LGTM.com

new alerts:

  • 3 for Comparison of identical values

Comment posted by LGTM.com

@sklearn-lgtm

This comment has been minimized.

Show comment
Hide comment
@sklearn-lgtm

sklearn-lgtm Jun 18, 2018

This pull request introduces 3 alerts when merging 972668b into 4143356 - view on LGTM.com

new alerts:

  • 3 for Comparison of identical values

Comment posted by LGTM.com

sklearn-lgtm commented Jun 18, 2018

This pull request introduces 3 alerts when merging 972668b into 4143356 - view on LGTM.com

new alerts:

  • 3 for Comparison of identical values

Comment posted by LGTM.com

Show outdated Hide outdated sklearn/impute.py Outdated

jeremiedbb added some commits Jun 20, 2018

@sklearn-lgtm

This comment has been minimized.

Show comment
Hide comment
@sklearn-lgtm

sklearn-lgtm Jun 20, 2018

This pull request introduces 3 alerts when merging fb1a4e9 into 4143356 - view on LGTM.com

new alerts:

  • 3 for Comparison of identical values

Comment posted by LGTM.com

sklearn-lgtm commented Jun 20, 2018

This pull request introduces 3 alerts when merging fb1a4e9 into 4143356 - view on LGTM.com

new alerts:

  • 3 for Comparison of identical values

Comment posted by LGTM.com

Show outdated Hide outdated sklearn/impute.py Outdated
@sklearn-lgtm

This comment has been minimized.

Show comment
Hide comment
@sklearn-lgtm

sklearn-lgtm Jun 20, 2018

This pull request introduces 3 alerts when merging c8246f2 into 4143356 - view on LGTM.com

new alerts:

  • 3 for Comparison of identical values

Comment posted by LGTM.com

sklearn-lgtm commented Jun 20, 2018

This pull request introduces 3 alerts when merging c8246f2 into 4143356 - view on LGTM.com

new alerts:

  • 3 for Comparison of identical values

Comment posted by LGTM.com

@ogrisel ogrisel merged commit 0d8a04b into scikit-learn:master Jun 20, 2018

5 of 6 checks passed

continuous-integration/appveyor/pr Waiting for AppVeyor build to complete
Details
LGTM analysis: Python 3 new alerts
Details
ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: python2 Your tests passed on CircleCI!
Details
ci/circleci: python3 Your tests passed on CircleCI!
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Jun 20, 2018

Member

Thank you very much @jeremiedbb!

Member

ogrisel commented Jun 20, 2018

Thank you very much @jeremiedbb!

@jeremiedbb jeremiedbb deleted the jeremiedbb:constant-imputer branch Jun 26, 2018

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jun 29, 2018

Member

I like the new example, but I feel the example could be simplified. Can we not remove the fill_value and handle_unknown parameters?
Also, I forgot why remainder='passthrough' by default. What was the reason for that again?
What was the reason to going from the make_pipeline to Pipeline? it seems it makes stuff quite a bit longer...

Member

amueller commented Jun 29, 2018

I like the new example, but I feel the example could be simplified. Can we not remove the fill_value and handle_unknown parameters?
Also, I forgot why remainder='passthrough' by default. What was the reason for that again?
What was the reason to going from the make_pipeline to Pipeline? it seems it makes stuff quite a bit longer...

@glemaitre

This comment has been minimized.

Show comment
Hide comment
@glemaitre

glemaitre Jun 29, 2018

Contributor

What was the reason to going from the make_pipeline to Pipeline? it seems it makes stuff quite a bit longer...

It makes the example a bit longer but the parameter names to be set are more user friendly for an example:

  • Initial name: 'columntransformer__pipeline-0__simpleimputer__strategy'
  • Now: 'preprocessor__num__imputer__strategy'
Contributor

glemaitre commented Jun 29, 2018

What was the reason to going from the make_pipeline to Pipeline? it seems it makes stuff quite a bit longer...

It makes the example a bit longer but the parameter names to be set are more user friendly for an example:

  • Initial name: 'columntransformer__pipeline-0__simpleimputer__strategy'
  • Now: 'preprocessor__num__imputer__strategy'
@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Jul 3, 2018

Contributor

Also, I forgot why remainder='passthrough' by default. What was the reason for that again?

Should we reconsider this? If we want to have 'drop' in most of our examples, that might be an indication.
I also had a colleague who had not dropped 'y' from its data, and because it was passed through in the ColumnTransformer, noticed very good results ...

Contributor

jorisvandenbossche commented Jul 3, 2018

Also, I forgot why remainder='passthrough' by default. What was the reason for that again?

Should we reconsider this? If we want to have 'drop' in most of our examples, that might be an indication.
I also had a colleague who had not dropped 'y' from its data, and because it was passed through in the ColumnTransformer, noticed very good results ...

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jul 3, 2018

Member
Member

jnothman commented Jul 3, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment