-
-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] New feature: SamplingImputer #11368
base: main
Are you sure you want to change the base?
[MRG] New feature: SamplingImputer #11368
Conversation
Cool! There are lots of reasons it's different from the SimpleImputer:
But I suppose it would be convenient for users to just have it as a different "strategy". Potentially any of these strategies could be combined with KNN imputation... Do you have an opinion? |
This pull request introduces 3 alerts when merging cbb0b19 into 3b5abf7 - view on LGTM.com new alerts:
Comment posted by LGTM.com |
sklearn/impute.py
Outdated
else: | ||
generators[i] = np.nan | ||
|
||
return generators |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this overloading of statistics_
with something of a very different meaning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess since we can't store a closure as an attribute, we have to directly store the non-missing values array and their probas. Then store it in statistics_
would make sens since we would store the distribution itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, you can store a functools.partial
sklearn/impute.py
Outdated
row_mask = np.logical_not(row_mask).astype(np.bool) | ||
row = row[row_mask] | ||
if row.size > 0: | ||
uniques, counts = np.unique(row, return_counts=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if all(counts == 1)
, which is likely for float valued features, we could just set proba = None.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also not entirely sure we should be doing unique, rather than just storing all the values from X, including repetitions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that for inputs with float dtypes, it's very likely that all(counts == 1)
. However, for integer dtypes or categorical inputs (which this strategy allows), it's very likely that storing uniques and counts would be way less memory consuming.
I think the best is to not use uniques for float and use unique + counts otherwise.
What do you think ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be happy with always using unique + counts, but setting p=None when len(uniques) == n_non_missing)
sklearn/impute.py
Outdated
row_mask = np.logical_not(row_mask).astype(np.bool) | ||
row = row[row_mask] | ||
if row.size > 0: | ||
uniques, counts = np.unique(row, return_counts=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also not entirely sure we should be doing unique, rather than just storing all the values from X, including repetitions.
sklearn/impute.py
Outdated
if row.size > 0: | ||
uniques, counts = np.unique(row, return_counts=True) | ||
probas = counts / counts.sum() | ||
g = lambda k: random_state.choice(uniques, k, p=probas) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't store a closure as an attribute. The estimator will not be picklable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather just store the values and (if we go that way) their probabilities, as separate attributes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean 2 attributes for the values and their probas or separate from statistics_
?
I think making new attributes, different from statistics_
would be confusing since they would only be used for that strategy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. That's what I mean. Already, the attributes are fundamentally quite different from what's already there.
Considering our discussion, I'm starting to put myself on the side of the SamplingImputer. First, all the strategies in SimpleImputer are deterministic. Then, storing the values and probas in different attributes would imply different parameters and attributes depending on the strategy, which is not very user friendly (although I still think that |
This pull request introduces 3 alerts and fixes 2 when merging f93e525 into 4e51f29 - view on LGTM.com new alerts:
fixed alerts:
Comment posted by LGTM.com |
c9ebc0e
to
457ba91
Compare
This pull request introduces 2 alerts when merging 1f72947 into 526aede - view on LGTM.com new alerts:
Comment posted by LGTM.com |
Test failures. |
@jnothman are you ok to go for the new class |
I'm okay with SamplingImputer.
|
I think SamplingImputer has fundamentally different properties, and should be documented as a separate beast. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I'm not entirely convinced we should have ever supported completing sparse matrices with missing_values=0
You are currently missing a test for .transform.
I also think we should consider some parametrised common tests for imputers, which check that the only values modified are those which were missing, and that all values which were missing are now not missing.
I suppose those test improvements could come separately.
tests[i] = np.allclose(Xts[i], Xts[i-1]) | ||
assert not np.all(tests) | ||
|
||
assert np.mean(np.concatenate(Xts)) == pytest.approx(np.nanmean(X), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also check equality of variance
The failing test comes from a common test. I opened an issue (#11401) on how we should handle it because it also impacts the |
I agree that if issparse(X):
X = X.toarray() does not feel right :) but isn't it the user's responsibility to handle is data properly. |
Yes, and no. The problem for us is that it's a nuisance to maintain! |
and "0" is sort of ambiguous in sparse matrices: did they really want us to impute explicit zeros? |
Since the whole module has not been released yet, it's not too late to make important changes.
It would be really weird to have a matrix with implicit zeros and explicit zeros and that the missing values are only the explicit zeros. But this situation (not handled right now) could be easily handled. Just remove if sparse.issparse(X) and self.missing_values != 0: |
missing_values=0 is rarely a good idea, especially with sparse matrices. I
would be happy to see sparse matrix support with missing_values=0
disappear, as I think it would usually be a result of misuse. I
|
I think we should go ahead with this without multiple imputations. I think in the other imputers we decided to bail on multiple imputations, and I think this would be useful without it. |
from .utils.validation import FLOAT_DTYPES | ||
from .utils.fixes import _object_dtype_isnan | ||
from .utils.fixes import _uniques_counts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we bumped numpy depencencies, right?
for i in range(X.shape[1]): | ||
column = X[~mask[:, i], i] | ||
if column.size > 0: | ||
uniques[i], counts = _uniques_counts(column) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should default to binning to like 256 bins.
Maybe try on this one for an example? https://www.openml.org/d/40966 |
Or baseball: https://www.openml.org/d/185 |
Actually, without out multiple imputation, based on our study (Josse,
Pennec, Prost, Varoquaux, On the consistency of supervised learning with
missing values, arxiv 2019), I would believe that this is actually
detrimental to prediction.
The intuition being that for prediction, the supervised learner can learn
to have a specific behavior for the missing values if the imputed value
is consistent (th 3).
With multiple imputation at test time, there is potential benefit (th 2).
I do not want to be too hang up on the results of our paper: they rely on
assumptions (MAR, and consistent predictors). However, to include this in
scikit-learn, I would like a credible example that demonstrates its
value. Because theory seems to be saying that it does not bring value.
|
Presumably cross validation will effectively score multiple imputation. Is
your result saying that the right way to use the final estimator is with
multiple imputations too?
|
Presumably cross validation will effectively score multiple imputation. Is
your result saying that the right way to use the final estimator is with
multiple imputations too?
Yes, the result is saying that multiple imputation is important at test
time.
|
I would assume to have an indicator feature as well, so this argument doesn't really hold. |
I would have put this in as "common practice", not necessarily because it's the best thing to do. I think it would be interesting to run an experiment on CC-18, but wouldn't require it. |
I would assume to have an indicator feature as well, so this argument doesn't really hold.
Possibly. But that why we need an experiment demonstrating some value.
|
I would have put this in as "common practice", not necessarily because it's the
best thing to do.
Well, the more I go, the more I see common practice that is really bad.
What would you compare? mean + indicator with sampled + indicator on logistic
regression and random forest?
Sounds right. And only one of the two learners should display a benefit.
Note that our theoretical results do not apply to log-reg, as it is not
universal consistent.
how do you tune parameters?
I would say cross validation.
|
also see #5745.
I meant how to do you define the search space ;) |
If anyone wants to help with this, it would be good to come up with / run some benchmarks. |
Any updates on this? |
Fair question, @kacper-w. What do you think @jeremiedbb? |
I think it could be relevant in the context of clustering where we want to avoid learning clusters based on missing variables. In certain contexts, two observations missing the same feature doesn't imply the obs. are more similar/closer. Imputing missing values with a single |
In terms of example, I suspect that we will find what is needed in section 4.1.2 of https://arxiv.org/abs/1902.06931 |
This is WIP to add the sampling imputation strategy as discussed in #11209
I'm not sure if we want to make it a new strategy for the
SimpleImputer
, or a new class:SamplingImputer
, only dedicated to this transformation.I made a commit for both so you can see and tell me what you think is best.
To me, just make it new strategy for the
SimpleImputer
is fine, since it's a simple strategy and it doesn't change the code at all (except adding the strategy).There are still TODOs in there: