[MRG] New feature: SamplingImputer #11368

jeremiedbb · 2018-06-27T10:19:54Z

This is WIP to add the sampling imputation strategy as discussed in #11209

I'm not sure if we want to make it a new strategy for the SimpleImputer, or a new class: SamplingImputer, only dedicated to this transformation.

I made a commit for both so you can see and tell me what you think is best.
To me, just make it new strategy for the SimpleImputer is fine, since it's a simple strategy and it doesn't change the code at all (except adding the strategy).

There are still TODOs in there:

support sparse
make tests
update doc

jnothman · 2018-06-27T10:40:58Z

Cool!

There are lots of reasons it's different from the SimpleImputer:

it needs to store different model attributes
it adds a new random_state parameter
it can't be covered by the same tests

But I suppose it would be convenient for users to just have it as a different "strategy". Potentially any of these strategies could be combined with KNN imputation...

Do you have an opinion?

sklearn-lgtm · 2018-06-27T10:44:10Z

This pull request introduces 3 alerts when merging cbb0b19 into 3b5abf7 - view on LGTM.com

new alerts:

3 for Unused local variable

Comment posted by LGTM.com

jnothman · 2018-06-27T10:43:36Z

sklearn/impute.py

+                else:
+                    generators[i] = np.nan
+
+            return generators


I don't like this overloading of statistics_ with something of a very different meaning.

I guess since we can't store a closure as an attribute, we have to directly store the non-missing values array and their probas. Then store it in statistics_ would make sens since we would store the distribution itself.

Well, you can store a functools.partial

jnothman · 2018-06-27T10:45:08Z

sklearn/impute.py

+                row_mask = np.logical_not(row_mask).astype(np.bool)
+                row = row[row_mask]
+                if row.size > 0:
+                    uniques, counts = np.unique(row, return_counts=True)


if all(counts == 1), which is likely for float valued features, we could just set proba = None.

I'm also not entirely sure we should be doing unique, rather than just storing all the values from X, including repetitions.

I agree that for inputs with float dtypes, it's very likely that all(counts == 1). However, for integer dtypes or categorical inputs (which this strategy allows), it's very likely that storing uniques and counts would be way less memory consuming.
I think the best is to not use uniques for float and use unique + counts otherwise.
What do you think ?

I'd be happy with always using unique + counts, but setting p=None when len(uniques) == n_non_missing)

jnothman · 2018-06-27T10:45:34Z

sklearn/impute.py

+                row_mask = np.logical_not(row_mask).astype(np.bool)
+                row = row[row_mask]
+                if row.size > 0:
+                    uniques, counts = np.unique(row, return_counts=True)


I'm also not entirely sure we should be doing unique, rather than just storing all the values from X, including repetitions.

jnothman · 2018-06-27T10:45:57Z

sklearn/impute.py

+                if row.size > 0:
+                    uniques, counts = np.unique(row, return_counts=True)
+                    probas = counts / counts.sum()
+                    g = lambda k: random_state.choice(uniques, k, p=probas)


We can't store a closure as an attribute. The estimator will not be picklable.

I would rather just store the values and (if we go that way) their probabilities, as separate attributes.

You mean 2 attributes for the values and their probas or separate from statistics_ ?
I think making new attributes, different from statistics_ would be confusing since they would only be used for that strategy.

Yes. That's what I mean. Already, the attributes are fundamentally quite different from what's already there.

jeremiedbb · 2018-06-29T08:59:00Z

Considering our discussion, I'm starting to put myself on the side of the SamplingImputer. First, all the strategies in SimpleImputer are deterministic. Then, storing the values and probas in different attributes would imply different parameters and attributes depending on the strategy, which is not very user friendly (although I still think that statistics_ is not a bad name for storing values and their probas).

sklearn-lgtm · 2018-06-29T09:30:27Z

This pull request introduces 3 alerts and fixes 2 when merging f93e525 into 4e51f29 - view on LGTM.com

new alerts:

2 for Unused local variable
1 for Unused import

fixed alerts:

1 for Unused import
1 for Module is imported with 'import' and 'import from'

Comment posted by LGTM.com

…s-strat=sample

sklearn-lgtm · 2018-06-29T12:15:11Z

This pull request introduces 2 alerts when merging 1f72947 into 526aede - view on LGTM.com

new alerts:

2 for Unused local variable

Comment posted by LGTM.com

jnothman · 2018-06-30T10:03:41Z

Test failures.

jeremiedbb · 2018-07-02T11:03:40Z

@jnothman are you ok to go for the new class SamplingImputer and not new strategy ? or should we wait the opinion of more people ?

jnothman · 2018-07-02T11:33:19Z

I'm okay with SamplingImputer.

jnothman · 2018-07-02T11:33:45Z

I think SamplingImputer has fundamentally different properties, and should be documented as a separate beast.

jnothman

And I'm not entirely convinced we should have ever supported completing sparse matrices with missing_values=0

You are currently missing a test for .transform.

I also think we should consider some parametrised common tests for imputers, which check that the only values modified are those which were missing, and that all values which were missing are now not missing.

I suppose those test improvements could come separately.

jnothman · 2018-07-02T11:34:01Z

sklearn/tests/test_impute.py

+        tests[i] = np.allclose(Xts[i], Xts[i-1])
+    assert not np.all(tests)
+
+    assert np.mean(np.concatenate(Xts)) == pytest.approx(np.nanmean(X),


also check equality of variance

jeremiedbb · 2018-07-02T13:29:51Z

Test failures.

The failing test comes from a common test. I opened an issue (#11401) on how we should handle it because it also impacts the SimpleImputer.

jeremiedbb · 2018-07-02T13:35:17Z

And I'm not entirely convinced we should have ever supported completing sparse matrices with missing_values=0

I agree that

if issparse(X):
    X = X.toarray()

does not feel right :) but isn't it the user's responsibility to handle is data properly.
Maybe we could raise a warning in that case at least, so that the user is informed that he is doing something probably wrong ?

jnothman · 2018-07-03T02:46:38Z

but isn't it the user's responsibility to handle is data properly.

Yes, and no. The problem for us is that it's a nuisance to maintain!

jnothman · 2018-07-03T02:47:01Z

and "0" is sort of ambiguous in sparse matrices: did they really want us to impute explicit zeros?

jeremiedbb · 2018-07-03T14:01:19Z

Since the whole module has not been released yet, it's not too late to make important changes.
We could still remove the support of imputing the implicit zeros of sparse matrices.

and "0" is sort of ambiguous in sparse matrices: did they really want us to impute explicit zeros?

It would be really weird to have a matrix with implicit zeros and explicit zeros and that the missing values are only the explicit zeros. But this situation (not handled right now) could be easily handled. Just remove self.missing_values != 0 in

if sparse.issparse(X) and self.missing_values != 0:

jnothman · 2018-07-03T14:18:34Z

missing_values=0 is rarely a good idea, especially with sparse matrices. I would be happy to see sparse matrix support with missing_values=0 disappear, as I think it would usually be a result of misuse. I

This reverts commit 3b7305e.

amueller · 2019-06-10T21:53:34Z

I think we should go ahead with this without multiple imputations. I think in the other imputers we decided to bail on multiple imputations, and I think this would be useful without it.

amueller · 2019-06-10T21:55:46Z

sklearn/impute.py

 from .utils.validation import FLOAT_DTYPES
 from .utils.fixes import _object_dtype_isnan
+from .utils.fixes import _uniques_counts


we bumped numpy depencencies, right?

amueller · 2019-06-10T21:57:33Z

sklearn/impute.py

+        for i in range(X.shape[1]):
+            column = X[~mask[:, i], i]
+            if column.size > 0:
+                uniques[i], counts = _uniques_counts(column)


I think we should default to binning to like 256 bins.

amueller · 2019-06-10T22:04:46Z

Maybe try on this one for an example? https://www.openml.org/d/40966

amueller · 2019-06-10T22:07:03Z

Or baseball: https://www.openml.org/d/185

GaelVaroquaux · 2019-06-10T22:34:11Z

Actually, without out multiple imputation, based on our study (Josse, Pennec, Prost, Varoquaux, On the consistency of supervised learning with missing values, arxiv 2019), I would believe that this is actually detrimental to prediction. The intuition being that for prediction, the supervised learner can learn to have a specific behavior for the missing values if the imputed value is consistent (th 3). With multiple imputation at test time, there is potential benefit (th 2). I do not want to be too hang up on the results of our paper: they rely on assumptions (MAR, and consistent predictors). However, to include this in scikit-learn, I would like a credible example that demonstrates its value. Because theory seems to be saying that it does not bring value.

jnothman · 2019-06-10T22:48:19Z

Presumably cross validation will effectively score multiple imputation. Is your result saying that the right way to use the final estimator is with multiple imputations too?

GaelVaroquaux · 2019-06-10T22:59:10Z

Presumably cross validation will effectively score multiple imputation. Is your result saying that the right way to use the final estimator is with multiple imputations too?

Yes, the result is saying that multiple imputation is important at test time.

amueller · 2019-06-11T02:57:27Z

The intuition being that for prediction, the supervised learner can learn
to have a specific behavior for the missing values if the imputed value
is consistent (th 3).

I would assume to have an indicator feature as well, so this argument doesn't really hold.

amueller · 2019-06-11T03:05:21Z

I would have put this in as "common practice", not necessarily because it's the best thing to do. I think it would be interesting to run an experiment on CC-18, but wouldn't require it.
What would you compare? mean + indicator with sampled + indicator on logistic regression and random forest? how do you tune parameters?

GaelVaroquaux · 2019-06-11T06:03:17Z

I would assume to have an indicator feature as well, so this argument doesn't really hold.

Possibly. But that why we need an experiment demonstrating some value.

GaelVaroquaux · 2019-06-11T06:05:34Z

I would have put this in as "common practice", not necessarily because it's the best thing to do.

Well, the more I go, the more I see common practice that is really bad.

What would you compare? mean + indicator with sampled + indicator on logistic regression and random forest?

Sounds right. And only one of the two learners should display a benefit. Note that our theoretical results do not apply to log-reg, as it is not universal consistent.

how do you tune parameters?

I would say cross validation.

amueller · 2019-06-13T15:47:02Z

also see #5745.

I would say cross validation.

I meant how to do you define the search space ;)

amueller · 2019-07-12T15:17:19Z

If anyone wants to help with this, it would be good to come up with / run some benchmarks.

kacper-w · 2020-06-12T20:50:12Z

Any updates on this?

jnothman · 2020-06-13T22:37:54Z

Fair question, @kacper-w. What do you think @jeremiedbb?

ggoulet · 2021-03-22T20:20:56Z

However, to include this in scikit-learn, I would like a credible
example that demonstrates its value.

I think it could be relevant in the context of clustering where we want to avoid learning clusters based on missing variables. In certain contexts, two observations missing the same feature doesn't imply the obs. are more similar/closer. Imputing missing values with a single fill_value (e.g. the mean) can have undesired consequences on the resulting clusters. Sadly, I don't have any research to back this up.

GaelVaroquaux · 2021-03-22T20:41:41Z

In terms of example, I suspect that we will find what is needed in section 4.1.2 of https://arxiv.org/abs/1902.06931

jeremiedbb added 2 commits June 27, 2018 11:57

add strategy=sample to SimpleImputer

3b7305e

add class SamplingImputer

cbb0b19

jnothman reviewed Jun 27, 2018

View reviewed changes

SamplingImputer handle full missing column + add 1 test

457ba91

jeremiedbb force-pushed the samplingImputer-vs-strat=sample branch from c9ebc0e to 457ba91 Compare June 29, 2018 11:43

Merge remote-tracking branch 'upstream/master' into samplingImputer-v…

1f72947

…s-strat=sample

jnothman mentioned this pull request Jun 30, 2018

Support standard data science use-case #10603

Open

jeremiedbb added 3 commits July 1, 2018 23:10

SamplingImputer - dense + sparse support

32f1747

add test deterministic + test error raised

e39f2a7

update tests

4474c30

Merge branch 'master' into samplingImputer-vs-strat=sample

526f47d

jnothman reviewed Jul 2, 2018

View reviewed changes

jeremiedbb added 2 commits July 3, 2018 17:16

SamplingImputer remove support sparse with missing=0

5d08876

Revert "add strategy=sample to SimpleImputer"

ca84451

This reverts commit 3b7305e.

TomDLT added the New Feature label Sep 24, 2018

jnothman mentioned this pull request Jun 10, 2019

Add Sampling Imputer sampling from training data #14060

Closed

amueller reviewed Jun 10, 2019

View reviewed changes

rth removed the Sprint label Jun 27, 2019

amueller added the help wanted label Jul 12, 2019

amueller added this to PR phase in Andy's pets Jul 17, 2019

github-actions bot added the module:utils label Mar 2, 2020

thomasjpfan mentioned this pull request Nov 24, 2020

Add strategy="random" to SimpleImputer #11209

Open

Base automatically changed from master to main January 22, 2021 10:50

cmarmo added Needs Decision - Include Feature Requires decision regarding including feature and removed help wanted labels Jan 15, 2022

thomasjpfan mentioned this pull request Apr 22, 2022

added 'bootstrap' imputation strategy (renamed from 'random') #5745

Closed

thomasjpfan added the Needs Benchmarks A tag for the issues and PRs which require some benchmarks label Nov 4, 2022

[MRG] New feature: SamplingImputer #11368

Are you sure you want to change the base?

[MRG] New feature: SamplingImputer #11368

Conversation

jeremiedbb commented Jun 27, 2018 • edited

jnothman commented Jun 27, 2018

sklearn-lgtm commented Jun 27, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeremiedbb commented Jun 29, 2018

sklearn-lgtm commented Jun 29, 2018

sklearn-lgtm commented Jun 29, 2018

jnothman commented Jun 30, 2018

jeremiedbb commented Jul 2, 2018

jnothman commented Jul 2, 2018 via email

jnothman commented Jul 2, 2018

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeremiedbb commented Jul 2, 2018

jeremiedbb commented Jul 2, 2018

jnothman commented Jul 3, 2018

jnothman commented Jul 3, 2018

jeremiedbb commented Jul 3, 2018

jnothman commented Jul 3, 2018 via email

amueller commented Jun 10, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amueller commented Jun 10, 2019

amueller commented Jun 10, 2019

GaelVaroquaux commented Jun 10, 2019 via email

jnothman commented Jun 10, 2019 via email

GaelVaroquaux commented Jun 10, 2019 via email

amueller commented Jun 11, 2019

amueller commented Jun 11, 2019

GaelVaroquaux commented Jun 11, 2019 via email

GaelVaroquaux commented Jun 11, 2019 via email

amueller commented Jun 13, 2019

amueller commented Jul 12, 2019

kacper-w commented Jun 12, 2020

jnothman commented Jun 13, 2020

ggoulet commented Mar 22, 2021

GaelVaroquaux commented Mar 22, 2021

jeremiedbb commented Jun 27, 2018 •

edited