Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+2] Add a test for sample weights for estimators #11558

Merged

Conversation

@sergulaydore
Copy link
Contributor

@sergulaydore sergulaydore commented Jul 16, 2018

Reference Issues/PRs

Fixes the invariant test for sample weights as mentioned in issue #11316 (Refactor tests for sample weights).

What is new?

This is a generic test for estimators that makes sure the sample weights yield consistent results.

What does this implement/fix? Explain your changes.

The test checks if the output of the estimators are the same for sample_weight = None and sample_weight=np.ones().

Any other comments?

Pairwise methods are skipped as they require pairwise data.

Copy link
Member

@GaelVaroquaux GaelVaroquaux left a comment

Looks good, aside from 2 minor comments.

X = np.array([[1, 3], [1, 3], [1, 3], [1, 3],
[2, 1], [2, 1], [2, 1], [2, 1],
[3, 3], [3, 3], [3, 3], [3, 3],
[4, 1], [4, 1], [4, 1], [4, 1]])

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Jul 16, 2018
Member

I would force the dtype to be float here.

This comment has been minimized.

@sergulaydore

sergulaydore Jul 16, 2018
Author Contributor

Done!

"sample_weight=ones" % name)
if hasattr(estimator_orig, "transform"):
X_pred1 = estimator1.transform(X)
X_pred2 = estimator2.transform(X)

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Jul 16, 2018
Member

nitpick: I would call these X_trans1 and X_trans2

This comment has been minimized.

@sergulaydore

sergulaydore Jul 16, 2018
Author Contributor

Done!

This comment has been minimized.

@sergulaydore

sergulaydore Jul 17, 2018
Author Contributor

Changed this back to X_pred1 and X_pred2 because I moved both methods inside a for loop.

@GaelVaroquaux GaelVaroquaux changed the title Add a test for sample weights for estimators [MRG+1] Add a test for sample weights for estimators Jul 16, 2018
@GaelVaroquaux GaelVaroquaux added this to the 0.20 milestone Jul 16, 2018
[3, 3], [3, 3], [3, 3], [3, 3],
[4, 1], [4, 1], [4, 1], [4, 1]], dtype=np.dtype('float'))
y = np.array([1, 1, 1, 1, 2, 2, 2, 2,
1, 1, 1, 1, 2, 2, 2, 2], dtype=np.dtype('float'))

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Jul 16, 2018
Member

Actually, I would have kept y as an integer. Only X as float.

This comment has been minimized.

@sergulaydore

sergulaydore Jul 17, 2018
Author Contributor

Done!

Copy link
Contributor

@glemaitre glemaitre left a comment

I think that it could be great to add an entry in the what's new as well

@@ -553,6 +554,45 @@ def check_sample_weights_list(name, estimator_orig):
estimator.fit(X, y, sample_weight=sample_weight)


@ignore_warnings(category=(DeprecationWarning, FutureWarning))
def check_sample_weight_invariance(name, estimator_orig):

This comment has been minimized.

@glemaitre

glemaitre Jul 16, 2018
Contributor

It could be worth to add a one line comment regarding the purpose of the test

This comment has been minimized.

@sergulaydore

sergulaydore Jul 17, 2018
Author Contributor

Done!

y = np.array([1, 1, 1, 1, 2, 2, 2, 2,
1, 1, 1, 1, 2, 2, 2, 2], dtype=np.dtype('float'))

if has_fit_parameter(estimator_orig, "random_state"):

This comment has been minimized.

@glemaitre

glemaitre Jul 16, 2018
Contributor

you can replace those using set_random_state from sklearn.utils.testing it will check if the estimators has already the random state

This comment has been minimized.

@sergulaydore

sergulaydore Jul 17, 2018
Author Contributor

Done!

1, 1, 1, 1, 2, 2, 2, 2], dtype=np.dtype('float'))

if has_fit_parameter(estimator_orig, "random_state"):
estimator1.fit(X, y=y, sample_weight=np.ones(shape=len(y)), random_state=0)

This comment has been minimized.

@glemaitre

glemaitre Jul 16, 2018
Contributor

so basically:

from sklearn.utils.testing import set_random_state

set_random_state(estimator1, random_state=42)
set_random_state(estimator2, random_state=42)

estimator1.fit(X, y=y, sample_weight=np.ones(shape=len(y)))
estimator2.fit(X, y=y, sample_weight=None)

This comment has been minimized.

@sergulaydore

sergulaydore Jul 17, 2018
Author Contributor

Thanks! Used it as it is.

estimator1.fit(X, y=y, sample_weight=np.ones(shape=len(y)))
estimator2.fit(X, y=y, sample_weight=None)

if hasattr(estimator_orig, "predict"):

This comment has been minimized.

@glemaitre

glemaitre Jul 16, 2018
Contributor

make a loop here

for method in ('predict', 'transform'):
    if hasattr(estimator_orig, method):
        ...

This comment has been minimized.

@sergulaydore

sergulaydore Jul 17, 2018
Author Contributor

Done!

X_pred1 = estimator1.predict(X)
X_pred2 = estimator2.predict(X)
try:
assert_allclose(X_pred1, X_pred2, rtol=0.5)

This comment has been minimized.

@glemaitre

glemaitre Jul 16, 2018
Contributor

maybe you might want to use assert_allclose_dense_sparse if the output could be sparse.

This comment has been minimized.

@glemaitre

glemaitre Jul 16, 2018
Contributor

pass the err_msg to assert_allclose_dense_sparse avoiding the try .. except ... to raise the proper error.

This comment has been minimized.

@sergulaydore

sergulaydore Jul 17, 2018
Author Contributor

I don't think we expect the output to be sparse.
err_msg is done!

@TomDLT
Copy link
Member

@TomDLT TomDLT commented Jul 16, 2018

I did not look into your code, but you might get inspired from the excellent test in #10803.
see here

sergulaydore and others added 5 commits Jul 17, 2018
@GaelVaroquaux
Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Jul 17, 2018

Test are passing! Congratulations.

But you had a PEP8 failure. I fixed it on your branch. We need to wait for the tests to run again.

@GaelVaroquaux
Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Jul 17, 2018

@glemaitre : can you reassess you review. We think that this is good to go.

@qinhanmin2014
Copy link
Member

@qinhanmin2014 qinhanmin2014 commented Jul 17, 2018

I can't understand why we need to skip "KMeans" and "MiniBatchKMeans". It seems that the test passes locally on my PC with these two classes. Also, at least we have things like v_measure_score to use here?

# unit weights and no weights
if (has_fit_parameter(estimator_orig, "sample_weight") and
not (hasattr(estimator_orig, "_pairwise")
and estimator_orig._pairwise) and

This comment has been minimized.

@qinhanmin2014

qinhanmin2014 Jul 17, 2018
Member

Also, I'd prefer some comments here to show us what you're doing (why do you skip the test?).

This comment has been minimized.

@sergulaydore

sergulaydore Jul 17, 2018
Author Contributor

Done! Basically, the data we are testing is not pairwise. Estimator tests with _pariwise fails otherwise.

@GaelVaroquaux
Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Jul 17, 2018

Indeed, I think that it's good to add comments.

I canceled the corresponding travis builds, to save time in the queue.

and estimator_orig._pairwise) and
name not in ["KMeans", "MiniBatchKMeans"]):
# We skip pairwise because the data is not pairwise
# KMeans and MiniBatchKMeans were unstable; hence skipped.

This comment has been minimized.

@qinhanmin2014

qinhanmin2014 Jul 17, 2018
Member

I'm confused about it. Are they still unstable under a fixed random_state. If so, is there a bug?
(Sorry if I made a mistake :) Go ahead if you have enough confidence)

This comment has been minimized.

@sergulaydore

sergulaydore Jul 17, 2018
Author Contributor

This is strange. When I initially ran this without using set_random_state, they were failing. After using set_random_state, they seem to be stable. I wonder if estimator.fit(...,random_state=0) is missing something that set_random_state does not. I can include those estimators to the test. @GaelVaroquaux WDYT?

This comment has been minimized.

@glemaitre

glemaitre Jul 17, 2018
Contributor

It is weird to pass random_state in the fit. The state should be pass when instantiating an object or by setting the parameter which is the aim of set_random_state.

Regarding KMeans and MiniBatchKMeans, they rely on a random initialization and it might possible that we actually do not pass the random state when passing random state a fit.

@sergulaydore
Copy link
Contributor Author

@sergulaydore sergulaydore commented Jul 17, 2018

@qinhanmin2014 : "KMeans" and "MiniBatchKMeans" are skipped because they were unstable.

@qinhanmin2014
Copy link
Member

@qinhanmin2014 qinhanmin2014 commented Jul 17, 2018

I would expect estimators in scikit-learn to be deterministic with same input and a fixed random_state, or am I wrong?

and estimator_orig._pairwise) and
name not in ["KMeans", "MiniBatchKMeans"]):
# We skip pairwise because the data is not pairwise
# KMeans and MiniBatchKMeans were unstable; hence skipped.

This comment has been minimized.

@glemaitre

glemaitre Jul 17, 2018
Contributor

It is weird to pass random_state in the fit. The state should be pass when instantiating an object or by setting the parameter which is the aim of set_random_state.

Regarding KMeans and MiniBatchKMeans, they rely on a random initialization and it might possible that we actually do not pass the random state when passing random state a fit.

@glemaitre
Copy link
Contributor

@glemaitre glemaitre commented Jul 17, 2018

Oh you added back kmeans, cool so this is fine then.
Waiting for the CI to be green.

Copy link
Member

@qinhanmin2014 qinhanmin2014 left a comment

LGTM if CIs are green. Thanks @sergulaydore

@qinhanmin2014 qinhanmin2014 changed the title [MRG+1] Add a test for sample weights for estimators [MRG+2] Add a test for sample weights for estimators Jul 17, 2018
@GaelVaroquaux
Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Jul 17, 2018

OK, travis is green. Merging. Hurray!

@GaelVaroquaux GaelVaroquaux merged commit bcd6ff3 into scikit-learn:master Jul 17, 2018
3 of 5 checks passed
3 of 5 checks passed
ci/circleci: deploy Your tests are queued behind your running builds
Details
continuous-integration/appveyor/pr Waiting for AppVeyor build to complete
Details
ci/circleci: python2 Your tests passed on CircleCI!
Details
ci/circleci: python3 Your tests passed on CircleCI!
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@GaelVaroquaux GaelVaroquaux added this to Done in scikit-learn 0.20 via automation Jul 17, 2018
@lesteve lesteve moved this from Done to PRs tagged in scikit-learn 0.20 Jul 17, 2018
@jnothman
Copy link
Member

@jnothman jnothman commented Jul 18, 2018

Why do we need an rtol as high as 0.5?

@jnothman
Copy link
Member

@jnothman jnothman commented Jul 18, 2018

This doesn't assert that weighting makes any difference ever, does it?

@sergulaydore
Copy link
Contributor Author

@sergulaydore sergulaydore commented Jul 18, 2018

@jnothman Actually we don't need rtol as 0.5. I just tested with the default value and the tests still passed. No, it only makes sure that the no weighting is equal to unit weighting. If you recommend, I can submit another PR with random weights.

@qinhanmin2014
Copy link
Member

@qinhanmin2014 qinhanmin2014 commented Jul 18, 2018

+1 to use default rtol, @sergulaydore please submit a PR, thanks.

@sergulaydore
Copy link
Contributor Author

@sergulaydore sergulaydore commented Jul 18, 2018

Done! Please see #11621.

@sergulaydore sergulaydore deleted the sergulaydore:refactor_sample_weights branch Jul 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
scikit-learn 0.20
  
PRs tagged
Linked issues

Successfully merging this pull request may close these issues.

None yet

6 participants