[MRG+2] Add a test for sample weights for estimators #11558
Conversation
Looks good, aside from 2 minor comments. |
X = np.array([[1, 3], [1, 3], [1, 3], [1, 3], | ||
[2, 1], [2, 1], [2, 1], [2, 1], | ||
[3, 3], [3, 3], [3, 3], [3, 3], | ||
[4, 1], [4, 1], [4, 1], [4, 1]]) |
GaelVaroquaux
Jul 16, 2018
Member
I would force the dtype to be float here.
I would force the dtype to be float here.
sergulaydore
Jul 16, 2018
Author
Contributor
Done!
Done!
"sample_weight=ones" % name) | ||
if hasattr(estimator_orig, "transform"): | ||
X_pred1 = estimator1.transform(X) | ||
X_pred2 = estimator2.transform(X) |
GaelVaroquaux
Jul 16, 2018
Member
nitpick: I would call these X_trans1 and X_trans2
nitpick: I would call these X_trans1 and X_trans2
sergulaydore
Jul 16, 2018
Author
Contributor
Done!
Done!
sergulaydore
Jul 17, 2018
Author
Contributor
Changed this back to X_pred1 and X_pred2 because I moved both methods inside a for loop.
Changed this back to X_pred1 and X_pred2 because I moved both methods inside a for loop.
[3, 3], [3, 3], [3, 3], [3, 3], | ||
[4, 1], [4, 1], [4, 1], [4, 1]], dtype=np.dtype('float')) | ||
y = np.array([1, 1, 1, 1, 2, 2, 2, 2, | ||
1, 1, 1, 1, 2, 2, 2, 2], dtype=np.dtype('float')) |
GaelVaroquaux
Jul 16, 2018
Member
Actually, I would have kept y as an integer. Only X as float.
Actually, I would have kept y as an integer. Only X as float.
sergulaydore
Jul 17, 2018
Author
Contributor
Done!
Done!
I think that it could be great to add an entry in the what's new as well |
@@ -553,6 +554,45 @@ def check_sample_weights_list(name, estimator_orig): | |||
estimator.fit(X, y, sample_weight=sample_weight) | |||
|
|||
|
|||
@ignore_warnings(category=(DeprecationWarning, FutureWarning)) | |||
def check_sample_weight_invariance(name, estimator_orig): |
glemaitre
Jul 16, 2018
Contributor
It could be worth to add a one line comment regarding the purpose of the test
It could be worth to add a one line comment regarding the purpose of the test
sergulaydore
Jul 17, 2018
Author
Contributor
Done!
Done!
y = np.array([1, 1, 1, 1, 2, 2, 2, 2, | ||
1, 1, 1, 1, 2, 2, 2, 2], dtype=np.dtype('float')) | ||
|
||
if has_fit_parameter(estimator_orig, "random_state"): |
glemaitre
Jul 16, 2018
Contributor
you can replace those using set_random_state
from sklearn.utils.testing
it will check if the estimators has already the random state
you can replace those using set_random_state
from sklearn.utils.testing
it will check if the estimators has already the random state
sergulaydore
Jul 17, 2018
Author
Contributor
Done!
Done!
1, 1, 1, 1, 2, 2, 2, 2], dtype=np.dtype('float')) | ||
|
||
if has_fit_parameter(estimator_orig, "random_state"): | ||
estimator1.fit(X, y=y, sample_weight=np.ones(shape=len(y)), random_state=0) |
glemaitre
Jul 16, 2018
Contributor
so basically:
from sklearn.utils.testing import set_random_state
set_random_state(estimator1, random_state=42)
set_random_state(estimator2, random_state=42)
estimator1.fit(X, y=y, sample_weight=np.ones(shape=len(y)))
estimator2.fit(X, y=y, sample_weight=None)
so basically:
from sklearn.utils.testing import set_random_state
set_random_state(estimator1, random_state=42)
set_random_state(estimator2, random_state=42)
estimator1.fit(X, y=y, sample_weight=np.ones(shape=len(y)))
estimator2.fit(X, y=y, sample_weight=None)
sergulaydore
Jul 17, 2018
Author
Contributor
Thanks! Used it as it is.
Thanks! Used it as it is.
estimator1.fit(X, y=y, sample_weight=np.ones(shape=len(y))) | ||
estimator2.fit(X, y=y, sample_weight=None) | ||
|
||
if hasattr(estimator_orig, "predict"): |
glemaitre
Jul 16, 2018
Contributor
make a loop here
for method in ('predict', 'transform'):
if hasattr(estimator_orig, method):
...
make a loop here
for method in ('predict', 'transform'):
if hasattr(estimator_orig, method):
...
sergulaydore
Jul 17, 2018
Author
Contributor
Done!
Done!
X_pred1 = estimator1.predict(X) | ||
X_pred2 = estimator2.predict(X) | ||
try: | ||
assert_allclose(X_pred1, X_pred2, rtol=0.5) |
glemaitre
Jul 16, 2018
Contributor
maybe you might want to use assert_allclose_dense_sparse
if the output could be sparse.
maybe you might want to use assert_allclose_dense_sparse
if the output could be sparse.
glemaitre
Jul 16, 2018
Contributor
pass the err_msg to assert_allclose_dense_sparse avoiding the try .. except ... to raise the proper error.
pass the err_msg to assert_allclose_dense_sparse avoiding the try .. except ... to raise the proper error.
sergulaydore
Jul 17, 2018
Author
Contributor
I don't think we expect the output to be sparse.
err_msg is done!
I don't think we expect the output to be sparse.
err_msg is done!
Test are passing! Congratulations. But you had a PEP8 failure. I fixed it on your branch. We need to wait for the tests to run again. |
@glemaitre : can you reassess you review. We think that this is good to go. |
I can't understand why we need to skip "KMeans" and "MiniBatchKMeans". It seems that the test passes locally on my PC with these two classes. Also, at least we have things like |
# unit weights and no weights | ||
if (has_fit_parameter(estimator_orig, "sample_weight") and | ||
not (hasattr(estimator_orig, "_pairwise") | ||
and estimator_orig._pairwise) and |
qinhanmin2014
Jul 17, 2018
Member
Also, I'd prefer some comments here to show us what you're doing (why do you skip the test?).
Also, I'd prefer some comments here to show us what you're doing (why do you skip the test?).
sergulaydore
Jul 17, 2018
Author
Contributor
Done! Basically, the data we are testing is not pairwise. Estimator tests with _pariwise
fails otherwise.
Done! Basically, the data we are testing is not pairwise. Estimator tests with _pariwise
fails otherwise.
Indeed, I think that it's good to add comments. I canceled the corresponding travis builds, to save time in the queue. |
and estimator_orig._pairwise) and | ||
name not in ["KMeans", "MiniBatchKMeans"]): | ||
# We skip pairwise because the data is not pairwise | ||
# KMeans and MiniBatchKMeans were unstable; hence skipped. |
qinhanmin2014
Jul 17, 2018
Member
I'm confused about it. Are they still unstable under a fixed random_state. If so, is there a bug?
(Sorry if I made a mistake :) Go ahead if you have enough confidence)
I'm confused about it. Are they still unstable under a fixed random_state. If so, is there a bug?
(Sorry if I made a mistake :) Go ahead if you have enough confidence)
sergulaydore
Jul 17, 2018
Author
Contributor
This is strange. When I initially ran this without using set_random_state
, they were failing. After using set_random_state
, they seem to be stable. I wonder if estimator.fit(...,random_state=0)
is missing something that set_random_state
does not. I can include those estimators to the test. @GaelVaroquaux WDYT?
This is strange. When I initially ran this without using set_random_state
, they were failing. After using set_random_state
, they seem to be stable. I wonder if estimator.fit(...,random_state=0)
is missing something that set_random_state
does not. I can include those estimators to the test. @GaelVaroquaux WDYT?
glemaitre
Jul 17, 2018
Contributor
It is weird to pass random_state
in the fit
. The state should be pass when instantiating an object or by setting the parameter which is the aim of set_random_state
.
Regarding KMeans and MiniBatchKMeans, they rely on a random initialization and it might possible that we actually do not pass the random state when passing random state a fit
.
It is weird to pass random_state
in the fit
. The state should be pass when instantiating an object or by setting the parameter which is the aim of set_random_state
.
Regarding KMeans and MiniBatchKMeans, they rely on a random initialization and it might possible that we actually do not pass the random state when passing random state a fit
.
@qinhanmin2014 : "KMeans" and "MiniBatchKMeans" are skipped because they were unstable. |
I would expect estimators in scikit-learn to be deterministic with same input and a fixed random_state, or am I wrong? |
and estimator_orig._pairwise) and | ||
name not in ["KMeans", "MiniBatchKMeans"]): | ||
# We skip pairwise because the data is not pairwise | ||
# KMeans and MiniBatchKMeans were unstable; hence skipped. |
glemaitre
Jul 17, 2018
Contributor
It is weird to pass random_state
in the fit
. The state should be pass when instantiating an object or by setting the parameter which is the aim of set_random_state
.
Regarding KMeans and MiniBatchKMeans, they rely on a random initialization and it might possible that we actually do not pass the random state when passing random state a fit
.
It is weird to pass random_state
in the fit
. The state should be pass when instantiating an object or by setting the parameter which is the aim of set_random_state
.
Regarding KMeans and MiniBatchKMeans, they rely on a random initialization and it might possible that we actually do not pass the random state when passing random state a fit
.
Oh you added back kmeans, cool so this is fine then. |
LGTM if CIs are green. Thanks @sergulaydore |
OK, travis is green. Merging. Hurray! |
bcd6ff3
into
scikit-learn:master
Why do we need an rtol as high as 0.5? |
This doesn't assert that weighting makes any difference ever, does it? |
@jnothman Actually we don't need rtol as 0.5. I just tested with the default value and the tests still passed. No, it only makes sure that the no weighting is equal to unit weighting. If you recommend, I can submit another PR with random weights. |
+1 to use default rtol, @sergulaydore please submit a PR, thanks. |
Done! Please see #11621. |
Reference Issues/PRs
Fixes the invariant test for sample weights as mentioned in issue #11316 (Refactor tests for sample weights).
What is new?
This is a generic test for estimators that makes sure the sample weights yield consistent results.
What does this implement/fix? Explain your changes.
The test checks if the output of the estimators are the same for
sample_weight = None
andsample_weight=np.ones()
.Any other comments?
Pairwise methods are skipped as they require pairwise data.