New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+2] Add a test for sample weights for estimators #11558

Merged
merged 13 commits into from Jul 17, 2018

Conversation

6 participants
@sergulaydore
Contributor

sergulaydore commented Jul 16, 2018

Reference Issues/PRs

Fixes the invariant test for sample weights as mentioned in issue #11316 (Refactor tests for sample weights).

What is new?

This is a generic test for estimators that makes sure the sample weights yield consistent results.

What does this implement/fix? Explain your changes.

The test checks if the output of the estimators are the same for sample_weight = None and sample_weight=np.ones().

Any other comments?

Pairwise methods are skipped as they require pairwise data.

@GaelVaroquaux

Looks good, aside from 2 minor comments.

X = np.array([[1, 3], [1, 3], [1, 3], [1, 3],
[2, 1], [2, 1], [2, 1], [2, 1],
[3, 3], [3, 3], [3, 3], [3, 3],
[4, 1], [4, 1], [4, 1], [4, 1]])

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Jul 16, 2018

Member

I would force the dtype to be float here.

This comment has been minimized.

@sergulaydore

sergulaydore Jul 16, 2018

Contributor

Done!

"sample_weight=ones" % name)
if hasattr(estimator_orig, "transform"):
X_pred1 = estimator1.transform(X)
X_pred2 = estimator2.transform(X)

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Jul 16, 2018

Member

nitpick: I would call these X_trans1 and X_trans2

This comment has been minimized.

@sergulaydore

sergulaydore Jul 16, 2018

Contributor

Done!

This comment has been minimized.

@sergulaydore

sergulaydore Jul 17, 2018

Contributor

Changed this back to X_pred1 and X_pred2 because I moved both methods inside a for loop.

@GaelVaroquaux GaelVaroquaux changed the title from Add a test for sample weights for estimators to [MRG+1] Add a test for sample weights for estimators Jul 16, 2018

@GaelVaroquaux GaelVaroquaux added this to the 0.20 milestone Jul 16, 2018

[3, 3], [3, 3], [3, 3], [3, 3],
[4, 1], [4, 1], [4, 1], [4, 1]], dtype=np.dtype('float'))
y = np.array([1, 1, 1, 1, 2, 2, 2, 2,
1, 1, 1, 1, 2, 2, 2, 2], dtype=np.dtype('float'))

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Jul 16, 2018

Member

Actually, I would have kept y as an integer. Only X as float.

This comment has been minimized.

@sergulaydore

sergulaydore Jul 17, 2018

Contributor

Done!

@glemaitre

I think that it could be great to add an entry in the what's new as well

@@ -553,6 +554,45 @@ def check_sample_weights_list(name, estimator_orig):
estimator.fit(X, y, sample_weight=sample_weight)
@ignore_warnings(category=(DeprecationWarning, FutureWarning))
def check_sample_weight_invariance(name, estimator_orig):

This comment has been minimized.

@glemaitre

glemaitre Jul 16, 2018

Contributor

It could be worth to add a one line comment regarding the purpose of the test

This comment has been minimized.

@sergulaydore

sergulaydore Jul 17, 2018

Contributor

Done!

y = np.array([1, 1, 1, 1, 2, 2, 2, 2,
1, 1, 1, 1, 2, 2, 2, 2], dtype=np.dtype('float'))
if has_fit_parameter(estimator_orig, "random_state"):

This comment has been minimized.

@glemaitre

glemaitre Jul 16, 2018

Contributor

you can replace those using set_random_state from sklearn.utils.testing it will check if the estimators has already the random state

This comment has been minimized.

@sergulaydore

sergulaydore Jul 17, 2018

Contributor

Done!

1, 1, 1, 1, 2, 2, 2, 2], dtype=np.dtype('float'))
if has_fit_parameter(estimator_orig, "random_state"):
estimator1.fit(X, y=y, sample_weight=np.ones(shape=len(y)), random_state=0)

This comment has been minimized.

@glemaitre

glemaitre Jul 16, 2018

Contributor

so basically:

from sklearn.utils.testing import set_random_state

set_random_state(estimator1, random_state=42)
set_random_state(estimator2, random_state=42)

estimator1.fit(X, y=y, sample_weight=np.ones(shape=len(y)))
estimator2.fit(X, y=y, sample_weight=None)

This comment has been minimized.

@sergulaydore

sergulaydore Jul 17, 2018

Contributor

Thanks! Used it as it is.

estimator1.fit(X, y=y, sample_weight=np.ones(shape=len(y)))
estimator2.fit(X, y=y, sample_weight=None)
if hasattr(estimator_orig, "predict"):

This comment has been minimized.

@glemaitre

glemaitre Jul 16, 2018

Contributor

make a loop here

for method in ('predict', 'transform'):
    if hasattr(estimator_orig, method):
        ...

This comment has been minimized.

@sergulaydore

sergulaydore Jul 17, 2018

Contributor

Done!

X_pred1 = estimator1.predict(X)
X_pred2 = estimator2.predict(X)
try:
assert_allclose(X_pred1, X_pred2, rtol=0.5)

This comment has been minimized.

@glemaitre

glemaitre Jul 16, 2018

Contributor

maybe you might want to use assert_allclose_dense_sparse if the output could be sparse.

This comment has been minimized.

@glemaitre

glemaitre Jul 16, 2018

Contributor

pass the err_msg to assert_allclose_dense_sparse avoiding the try .. except ... to raise the proper error.

This comment has been minimized.

@sergulaydore

sergulaydore Jul 17, 2018

Contributor

I don't think we expect the output to be sparse.
err_msg is done!

@TomDLT

This comment has been minimized.

Member

TomDLT commented Jul 16, 2018

I did not look into your code, but you might get inspired from the excellent test in #10803.
see here

sergulaydore and others added some commits Jul 17, 2018

@GaelVaroquaux

This comment has been minimized.

Member

GaelVaroquaux commented Jul 17, 2018

Test are passing! Congratulations.

But you had a PEP8 failure. I fixed it on your branch. We need to wait for the tests to run again.

@GaelVaroquaux

This comment has been minimized.

Member

GaelVaroquaux commented Jul 17, 2018

@glemaitre : can you reassess you review. We think that this is good to go.

@qinhanmin2014

This comment has been minimized.

Member

qinhanmin2014 commented Jul 17, 2018

I can't understand why we need to skip "KMeans" and "MiniBatchKMeans". It seems that the test passes locally on my PC with these two classes. Also, at least we have things like v_measure_score to use here?

# unit weights and no weights
if (has_fit_parameter(estimator_orig, "sample_weight") and
not (hasattr(estimator_orig, "_pairwise")
and estimator_orig._pairwise) and

This comment has been minimized.

@qinhanmin2014

qinhanmin2014 Jul 17, 2018

Member

Also, I'd prefer some comments here to show us what you're doing (why do you skip the test?).

This comment has been minimized.

@sergulaydore

sergulaydore Jul 17, 2018

Contributor

Done! Basically, the data we are testing is not pairwise. Estimator tests with _pariwise fails otherwise.

@GaelVaroquaux

This comment has been minimized.

Member

GaelVaroquaux commented Jul 17, 2018

Indeed, I think that it's good to add comments.

I canceled the corresponding travis builds, to save time in the queue.

and estimator_orig._pairwise) and
name not in ["KMeans", "MiniBatchKMeans"]):
# We skip pairwise because the data is not pairwise
# KMeans and MiniBatchKMeans were unstable; hence skipped.

This comment has been minimized.

@qinhanmin2014

qinhanmin2014 Jul 17, 2018

Member

I'm confused about it. Are they still unstable under a fixed random_state. If so, is there a bug?
(Sorry if I made a mistake :) Go ahead if you have enough confidence)

This comment has been minimized.

@sergulaydore

sergulaydore Jul 17, 2018

Contributor

This is strange. When I initially ran this without using set_random_state, they were failing. After using set_random_state, they seem to be stable. I wonder if estimator.fit(...,random_state=0) is missing something that set_random_state does not. I can include those estimators to the test. @GaelVaroquaux WDYT?

This comment has been minimized.

@glemaitre

glemaitre Jul 17, 2018

Contributor

It is weird to pass random_state in the fit. The state should be pass when instantiating an object or by setting the parameter which is the aim of set_random_state.

Regarding KMeans and MiniBatchKMeans, they rely on a random initialization and it might possible that we actually do not pass the random state when passing random state a fit.

@sergulaydore

This comment has been minimized.

Contributor

sergulaydore commented Jul 17, 2018

@qinhanmin2014 : "KMeans" and "MiniBatchKMeans" are skipped because they were unstable.

@qinhanmin2014

This comment has been minimized.

Member

qinhanmin2014 commented Jul 17, 2018

I would expect estimators in scikit-learn to be deterministic with same input and a fixed random_state, or am I wrong?

and estimator_orig._pairwise) and
name not in ["KMeans", "MiniBatchKMeans"]):
# We skip pairwise because the data is not pairwise
# KMeans and MiniBatchKMeans were unstable; hence skipped.

This comment has been minimized.

@glemaitre

glemaitre Jul 17, 2018

Contributor

It is weird to pass random_state in the fit. The state should be pass when instantiating an object or by setting the parameter which is the aim of set_random_state.

Regarding KMeans and MiniBatchKMeans, they rely on a random initialization and it might possible that we actually do not pass the random state when passing random state a fit.

@glemaitre

This comment has been minimized.

Contributor

glemaitre commented Jul 17, 2018

Oh you added back kmeans, cool so this is fine then.
Waiting for the CI to be green.

@qinhanmin2014

LGTM if CIs are green. Thanks @sergulaydore

@qinhanmin2014 qinhanmin2014 changed the title from [MRG+1] Add a test for sample weights for estimators to [MRG+2] Add a test for sample weights for estimators Jul 17, 2018

@GaelVaroquaux

This comment has been minimized.

Member

GaelVaroquaux commented Jul 17, 2018

OK, travis is green. Merging. Hurray!

@GaelVaroquaux GaelVaroquaux merged commit bcd6ff3 into scikit-learn:master Jul 17, 2018

3 of 5 checks passed

ci/circleci: deploy Your tests are queued behind your running builds
Details
continuous-integration/appveyor/pr Waiting for AppVeyor build to complete
Details
ci/circleci: python2 Your tests passed on CircleCI!
Details
ci/circleci: python3 Your tests passed on CircleCI!
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@GaelVaroquaux GaelVaroquaux added this to Done in scikit-learn 0.20 via automation Jul 17, 2018

@lesteve lesteve moved this from Done to PRs tagged in scikit-learn 0.20 Jul 17, 2018

@jnothman

This comment has been minimized.

Member

jnothman commented Jul 18, 2018

Why do we need an rtol as high as 0.5?

@jnothman

This comment has been minimized.

Member

jnothman commented Jul 18, 2018

This doesn't assert that weighting makes any difference ever, does it?

@sergulaydore

This comment has been minimized.

Contributor

sergulaydore commented Jul 18, 2018

@jnothman Actually we don't need rtol as 0.5. I just tested with the default value and the tests still passed. No, it only makes sure that the no weighting is equal to unit weighting. If you recommend, I can submit another PR with random weights.

@qinhanmin2014

This comment has been minimized.

Member

qinhanmin2014 commented Jul 18, 2018

+1 to use default rtol, @sergulaydore please submit a PR, thanks.

@sergulaydore

This comment has been minimized.

Contributor

sergulaydore commented Jul 18, 2018

Done! Please see #11621.

@sergulaydore sergulaydore deleted the sergulaydore:refactor_sample_weights branch Jul 18, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment