Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] New feature: SamplingImputer #11368

Open
wants to merge 46 commits into
base: main
Choose a base branch
from

Conversation

jeremiedbb
Copy link
Member

@jeremiedbb jeremiedbb commented Jun 27, 2018

This is WIP to add the sampling imputation strategy as discussed in #11209

I'm not sure if we want to make it a new strategy for the SimpleImputer, or a new class: SamplingImputer, only dedicated to this transformation.

I made a commit for both so you can see and tell me what you think is best.
To me, just make it new strategy for the SimpleImputer is fine, since it's a simple strategy and it doesn't change the code at all (except adding the strategy).

There are still TODOs in there:

  • support sparse
  • make tests
  • update doc

@jnothman
Copy link
Member

Cool!

There are lots of reasons it's different from the SimpleImputer:

  • it needs to store different model attributes
  • it adds a new random_state parameter
  • it can't be covered by the same tests

But I suppose it would be convenient for users to just have it as a different "strategy". Potentially any of these strategies could be combined with KNN imputation...

Do you have an opinion?

@sklearn-lgtm
Copy link

This pull request introduces 3 alerts when merging cbb0b19 into 3b5abf7 - view on LGTM.com

new alerts:

  • 3 for Unused local variable

Comment posted by LGTM.com

else:
generators[i] = np.nan

return generators
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this overloading of statistics_ with something of a very different meaning.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess since we can't store a closure as an attribute, we have to directly store the non-missing values array and their probas. Then store it in statistics_ would make sens since we would store the distribution itself.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, you can store a functools.partial

row_mask = np.logical_not(row_mask).astype(np.bool)
row = row[row_mask]
if row.size > 0:
uniques, counts = np.unique(row, return_counts=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if all(counts == 1), which is likely for float valued features, we could just set proba = None.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also not entirely sure we should be doing unique, rather than just storing all the values from X, including repetitions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that for inputs with float dtypes, it's very likely that all(counts == 1). However, for integer dtypes or categorical inputs (which this strategy allows), it's very likely that storing uniques and counts would be way less memory consuming.
I think the best is to not use uniques for float and use unique + counts otherwise.
What do you think ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be happy with always using unique + counts, but setting p=None when len(uniques) == n_non_missing)

row_mask = np.logical_not(row_mask).astype(np.bool)
row = row[row_mask]
if row.size > 0:
uniques, counts = np.unique(row, return_counts=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also not entirely sure we should be doing unique, rather than just storing all the values from X, including repetitions.

if row.size > 0:
uniques, counts = np.unique(row, return_counts=True)
probas = counts / counts.sum()
g = lambda k: random_state.choice(uniques, k, p=probas)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't store a closure as an attribute. The estimator will not be picklable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather just store the values and (if we go that way) their probabilities, as separate attributes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean 2 attributes for the values and their probas or separate from statistics_ ?
I think making new attributes, different from statistics_ would be confusing since they would only be used for that strategy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. That's what I mean. Already, the attributes are fundamentally quite different from what's already there.

@jeremiedbb
Copy link
Member Author

Considering our discussion, I'm starting to put myself on the side of the SamplingImputer. First, all the strategies in SimpleImputer are deterministic. Then, storing the values and probas in different attributes would imply different parameters and attributes depending on the strategy, which is not very user friendly (although I still think that statistics_ is not a bad name for storing values and their probas).

@sklearn-lgtm
Copy link

This pull request introduces 3 alerts and fixes 2 when merging f93e525 into 4e51f29 - view on LGTM.com

new alerts:

  • 2 for Unused local variable
  • 1 for Unused import

fixed alerts:

  • 1 for Unused import
  • 1 for Module is imported with 'import' and 'import from'

Comment posted by LGTM.com

@jeremiedbb jeremiedbb force-pushed the samplingImputer-vs-strat=sample branch from c9ebc0e to 457ba91 Compare June 29, 2018 11:43
@sklearn-lgtm
Copy link

This pull request introduces 2 alerts when merging 1f72947 into 526aede - view on LGTM.com

new alerts:

  • 2 for Unused local variable

Comment posted by LGTM.com

@jnothman
Copy link
Member

Test failures.

@jeremiedbb
Copy link
Member Author

@jnothman are you ok to go for the new class SamplingImputer and not new strategy ? or should we wait the opinion of more people ?

@jnothman
Copy link
Member

jnothman commented Jul 2, 2018 via email

@jnothman
Copy link
Member

jnothman commented Jul 2, 2018

I think SamplingImputer has fundamentally different properties, and should be documented as a separate beast.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I'm not entirely convinced we should have ever supported completing sparse matrices with missing_values=0

You are currently missing a test for .transform.

I also think we should consider some parametrised common tests for imputers, which check that the only values modified are those which were missing, and that all values which were missing are now not missing.

I suppose those test improvements could come separately.

tests[i] = np.allclose(Xts[i], Xts[i-1])
assert not np.all(tests)

assert np.mean(np.concatenate(Xts)) == pytest.approx(np.nanmean(X),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also check equality of variance

@jeremiedbb
Copy link
Member Author

Test failures.

The failing test comes from a common test. I opened an issue (#11401) on how we should handle it because it also impacts the SimpleImputer.

@jeremiedbb
Copy link
Member Author

And I'm not entirely convinced we should have ever supported completing sparse matrices with missing_values=0

I agree that

if issparse(X):
    X = X.toarray()

does not feel right :) but isn't it the user's responsibility to handle is data properly.
Maybe we could raise a warning in that case at least, so that the user is informed that he is doing something probably wrong ?

@jnothman
Copy link
Member

jnothman commented Jul 3, 2018

but isn't it the user's responsibility to handle is data properly.

Yes, and no. The problem for us is that it's a nuisance to maintain!

@jnothman
Copy link
Member

jnothman commented Jul 3, 2018

and "0" is sort of ambiguous in sparse matrices: did they really want us to impute explicit zeros?

@jeremiedbb
Copy link
Member Author

Since the whole module has not been released yet, it's not too late to make important changes.
We could still remove the support of imputing the implicit zeros of sparse matrices.

and "0" is sort of ambiguous in sparse matrices: did they really want us to impute explicit zeros?

It would be really weird to have a matrix with implicit zeros and explicit zeros and that the missing values are only the explicit zeros. But this situation (not handled right now) could be easily handled. Just remove self.missing_values != 0 in

if sparse.issparse(X) and self.missing_values != 0:

@jnothman
Copy link
Member

jnothman commented Jul 3, 2018 via email

@amueller
Copy link
Member

I think we should go ahead with this without multiple imputations. I think in the other imputers we decided to bail on multiple imputations, and I think this would be useful without it.

from .utils.validation import FLOAT_DTYPES
from .utils.fixes import _object_dtype_isnan
from .utils.fixes import _uniques_counts
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we bumped numpy depencencies, right?

for i in range(X.shape[1]):
column = X[~mask[:, i], i]
if column.size > 0:
uniques[i], counts = _uniques_counts(column)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should default to binning to like 256 bins.

@amueller
Copy link
Member

Maybe try on this one for an example? https://www.openml.org/d/40966

@amueller
Copy link
Member

Or baseball: https://www.openml.org/d/185

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jun 10, 2019 via email

@jnothman
Copy link
Member

jnothman commented Jun 10, 2019 via email

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jun 10, 2019 via email

@amueller
Copy link
Member

The intuition being that for prediction, the supervised learner can learn
to have a specific behavior for the missing values if the imputed value
is consistent (th 3).

I would assume to have an indicator feature as well, so this argument doesn't really hold.

@amueller
Copy link
Member

I would have put this in as "common practice", not necessarily because it's the best thing to do. I think it would be interesting to run an experiment on CC-18, but wouldn't require it.
What would you compare? mean + indicator with sampled + indicator on logistic regression and random forest? how do you tune parameters?

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jun 11, 2019 via email

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jun 11, 2019 via email

@amueller
Copy link
Member

also see #5745.

I would say cross validation.

I meant how to do you define the search space ;)

@amueller
Copy link
Member

If anyone wants to help with this, it would be good to come up with / run some benchmarks.

@amueller amueller added this to PR phase in Andy's pets Jul 17, 2019
@kacper-w
Copy link

Any updates on this?

@jnothman
Copy link
Member

Fair question, @kacper-w. What do you think @jeremiedbb?

Base automatically changed from master to main January 22, 2021 10:50
@ggoulet
Copy link

ggoulet commented Mar 22, 2021

However, to include this in scikit-learn, I would like a credible
example that demonstrates its value.

I think it could be relevant in the context of clustering where we want to avoid learning clusters based on missing variables. In certain contexts, two observations missing the same feature doesn't imply the obs. are more similar/closer. Imputing missing values with a single fill_value (e.g. the mean) can have undesired consequences on the resulting clusters. Sadly, I don't have any research to back this up.

@GaelVaroquaux
Copy link
Member

In terms of example, I suspect that we will find what is needed in section 4.1.2 of https://arxiv.org/abs/1902.06931
image

@cmarmo cmarmo added Needs Decision - Include Feature Requires decision regarding including feature and removed help wanted labels Jan 15, 2022
@thomasjpfan thomasjpfan added the Needs Benchmarks A tag for the issues and PRs which require some benchmarks label Nov 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:utils Needs Benchmarks A tag for the issues and PRs which require some benchmarks Needs Decision - Include Feature Requires decision regarding including feature New Feature
Projects
No open projects
Andy's pets
PR phase
Development

Successfully merging this pull request may close these issues.

None yet