New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Ignore and pass-through NaN values in MaxAbsScaler and maxabs_scale #11011

Merged
merged 10 commits into from Jun 23, 2018

Conversation

Projects
None yet
8 participants
@LucijaGregov
Contributor

LucijaGregov commented Apr 22, 2018

Reference Issues/PRs

Related to #10404

What does this implement/fix? Explain your changes.

Any other comments?

@LucijaGregov

This comment has been minimized.

Contributor

LucijaGregov commented Apr 22, 2018

@jnothman

Travis is resorting test failures. You might need to use .A or equivalently .toarray() to assert things about sparse matrices in your tests

assert np.any(np.isnan(X_test), axis=0).all()
X = sparse_format_func(X)
X_test[:, 0] = np.nan # make sure this boundary case is tested

This comment has been minimized.

@jnothman

jnothman Apr 22, 2018

Member

You should do this before converting to sparse

assert np.any(np.isnan(X_train), axis=0).all()
assert np.any(np.isnan(X_test), axis=0).all()
X = sparse_format_func(X)

This comment has been minimized.

@jnothman

jnothman Apr 22, 2018

Member

You're only converting X to sparse, not X_train or X_test

@rth

This comment has been minimized.

Member

rth commented Apr 23, 2018

Thanks for this PR @LucijaGregov !

@ reviewers: there is an updated version of the test_missing_value_handling_sparse test in #11012 so it would probably make sense to merge that first than sync with master here.

@LucijaGregov

This comment has been minimized.

Contributor

LucijaGregov commented Apr 23, 2018

@rth I will wait then for the merge of test_missing_value_handling_sparse test in #11012 before continuing working on this.

@rth

This comment has been minimized.

Member

rth commented Apr 23, 2018

@LucijaGregov yes, I think, that would be simpler.

Also please add a link to the general issue about this under "Reference Issues/PRs" section of the issue description.

@LucijaGregov

This comment has been minimized.

Contributor

LucijaGregov commented Apr 23, 2018

@rth I am not sure if I understood you correctly. This is not mentioned on the 'Issues' list. Do you mean that I should add this as an issue?

@rth

This comment has been minimized.

Member

rth commented Apr 23, 2018

I meant to link to #10404 in the description to provide some context for reviewers as was done in your previous PR. No worries, I added it here.

@LucijaGregov

This comment has been minimized.

Contributor

LucijaGregov commented Apr 23, 2018

@rth Right, I missed that bit. Thank you.

@rth

This comment has been minimized.

Member

rth commented Apr 27, 2018

#11012 was merged; could you please click on the "Resolve conflicts" button to merge with master and also remove any redundant changes from this PR for sklearn/preprocessing/tests/test_common.py as that should be already addressed? Thanks.

@LucijaGregov

This comment has been minimized.

Contributor

LucijaGregov commented Apr 27, 2018

@rth Done. I will continue working on this.

@glemaitre

This comment has been minimized.

Contributor

glemaitre commented Apr 27, 2018

I was checking how to we implement the min/max detection for sparse and it does not seem that easy. we rely on the scipy implementation which use the maximum and minimum ufunc. We have a backport in fixes here.

The problem of those ufunc is that the let the nan pass through and we would like to skip them. Somehow, I think that we need to implement our own sparse_nanmin_nanmax which will use the nanmin and nanmax instead of np.minimum.reduce and np.maximum.reduce.

@glemaitre

This comment has been minimized.

Contributor

glemaitre commented Apr 27, 2018

@jnothman did I missed a functionality in scikit-learn which already provide such functions?

@LucijaGregov

This comment has been minimized.

Contributor

LucijaGregov commented Apr 30, 2018

@glemaitre @jnothman Is it safe to continue working on this as it is, or you have something else in mind to do first?

@jnothman

This comment has been minimized.

Member

jnothman commented Apr 30, 2018

Why would it be unsafe?

@LucijaGregov

This comment has been minimized.

Contributor

LucijaGregov commented Apr 30, 2018

@jnothman I meant whether if I need to wait further for things to be merged because of the comments above but I guess it is good to go.

@glemaitre

This comment has been minimized.

Contributor

glemaitre commented Apr 30, 2018

I meant whether if I need to wait further for things to be merged because of the comments above but I guess it is good to go.

Basically, the #11011 (comment) is something that you might need to go through to make thing work :) Then, whenever you think to have a potential solution you can implement or we can discuss it here.

@amueller

This comment has been minimized.

Member

amueller commented Apr 30, 2018

pep8 failing

@LucijaGregov

This comment has been minimized.

Contributor

LucijaGregov commented May 7, 2018

@amueller I know, it is work in progress.

@glemaitre

This comment has been minimized.

Contributor

glemaitre commented Jun 16, 2018

Note that you will need to change n_samples_seen_ as in #11206 to have a consistent API.

@jnothman

This comment has been minimized.

Member

jnothman commented Jun 16, 2018

Why do we have n_samples_seen_ in scalers other than StandardScaler anyway? They're not used in the statistics as far as I can tell, and RobustScaler lacks it. Should we consider deprecating it?

@glemaitre

This comment has been minimized.

Contributor

glemaitre commented Jun 16, 2018

@glemaitre glemaitre changed the title from [WIP] Passing NaN values through MaxAbsScaler to [MRG] Ignore and pass-through NaN values in MaxAbsScaler and maxabs_scale Jun 16, 2018

@glemaitre

This comment has been minimized.

Contributor

glemaitre commented Jun 16, 2018

Why do we have n_samples_seen_ in scalers other than StandardScaler anyway? They're not used in the statistics as far as I can tell, and RobustScaler lacks it. Should we consider deprecating it?

I opened the issue #11300 to have a consensus on what to do. I just want to be sure that partial_fit does not imply to have this attribute (even if we don't really use it).

@glemaitre

This comment has been minimized.

Contributor

glemaitre commented Jun 16, 2018

@LucijaGregov I made a quick push of what missing to make the PR works.

@glemaitre

This comment has been minimized.

Contributor

glemaitre commented Jun 16, 2018

@rth @jnothman You can have a look to this one.

@jnothman

LGTM

@@ -27,7 +29,8 @@ def _get_valid_samples_by_column(X, col):
@pytest.mark.parametrize(
"est, func, support_sparse",
[(MinMaxScaler(), minmax_scale, False),
[(MaxAbsScaler(), maxabs_scale, True),
(MinMaxScaler(), minmax_scale, False),
(QuantileTransformer(n_quantiles=10), quantile_transform, True)]
)
def test_missing_value_handling(est, func, support_sparse):

This comment has been minimized.

@jnothman

jnothman Jun 16, 2018

Member

I've realised we should have assert_no_warnings in here when fitting and transforming

This comment has been minimized.

@glemaitre

glemaitre Jun 17, 2018

Contributor

It is a good point. I just catched that the QuantileTransformer was still raising a warning at inverse_transform. Since it is a single line I would included here.

@glemaitre

This comment has been minimized.

Contributor

glemaitre commented Jun 18, 2018

@TomDLT

TomDLT approved these changes Jun 19, 2018

LGTM

@@ -44,12 +47,17 @@ def test_missing_value_handling(est, func, support_sparse):
assert np.any(np.isnan(X_test), axis=0).all()
X_test[:, 0] = np.nan # make sure this boundary case is tested
Xt = est.fit(X_train).transform(X_test)
with pytest.warns(None) as records:

This comment has been minimized.

@TomDLT

TomDLT Jun 19, 2018

Member

Why not using sklearn.utils.testing.assert_no_warnings?

This comment has been minimized.

@glemaitre

glemaitre Jun 21, 2018

Contributor

I wanted to stick to pytest awaiting for this feature: pytest-dev/pytest#1830
The second point is that I find more readable

with pytest.warns(None):
    X_t = est.whatever(X)

than

X_t = assert_no_warnings(est, whaterver, X)

@ogrisel are you also in favor of assert_no_warnings. If yes, 2 vs 1 and I will make the change :)

This comment has been minimized.

@ogrisel

ogrisel Jun 22, 2018

Member

No ok, this is fine as it is.

@ogrisel

This comment has been minimized.

Member

ogrisel commented Jun 21, 2018

@glemaitre This PR need a merge from the current master.

@jnothman

This comment has been minimized.

Member

jnothman commented Jun 21, 2018

@ogrisel

This comment has been minimized.

Member

ogrisel commented Jun 22, 2018

@glemaitre can you please merge master to check that the tests still pass?

@ogrisel

This comment has been minimized.

Member

ogrisel commented Jun 22, 2018

@glemaitre can you please merge master to check that the tests still pass?

Actually wrong mention. I meant @LucijaGregov instead of @glemaitre, sorry.

@jorisvandenbossche

This comment has been minimized.

Contributor

jorisvandenbossche commented Jun 22, 2018

@glemaitre can you please merge master to check that the tests still pass?

I think Guillaume will not be online most of the day, so if you want to merge this, might be easier to quickly to the merge of master yourself

@ogrisel

This comment has been minimized.

Member

ogrisel commented Jun 22, 2018

Actually, let me push the merge commit my-self.

@glemaitre

This comment has been minimized.

Contributor

glemaitre commented Jun 22, 2018

Xt_col = est.transform(X_test[:, [i]])
with pytest.warns(None) as records:
Xt_col = est.transform(X_test[:, [i]])
assert len(records) == 0

This comment has been minimized.

@ogrisel

ogrisel Jun 22, 2018

Member

This new test_common.py assertion breaks with the PowerTransform that complains with all-nan columns.

I am not sure if we should raise this warning or not (maybe not at transform time) but this should be consistent across all the transformers.

This comment has been minimized.

@glemaitre

glemaitre Jun 22, 2018

Contributor

I made a push. The reason was on the np.nanmin(X) to check that the matrix was strictly positive. This case will return a full NaN matrix as well so everything will be fine (or at least the problem is forwarded to the next step in the pipeline).

@glemaitre

This comment has been minimized.

Contributor

glemaitre commented Jun 22, 2018

@jnothman jnothman merged commit f43dd0e into scikit-learn:master Jun 23, 2018

8 checks passed

LGTM analysis: Python No alert changes
Details
ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: python2 Your tests passed on CircleCI!
Details
ci/circleci: python3 Your tests passed on CircleCI!
Details
codecov/patch 100% of diff hit (target 95.31%)
Details
codecov/project 95.32% (+<.01%) compared to 93382cc
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment