New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+1] test method subset invariance (Fixes #10420) #10428
[MRG+1] test method subset invariance (Fixes #10420) #10428
Conversation
cool!
|
for now, make exceptions for these cases. Your test also needs an
instructive error message when the assertion fails, and an explicit test.
SparsePCA uses a ridge regression where each sample is a target. I don't
think this should affect things, but it might. It then uses a global
normalisation step which is problematic as in OvO/SVC. Please open a new
issue for this.
RBM I don't think we can easily fix. It randomly corrupts one feature in
each sample. For fixed random seed, the first sample will always have the
same feature corrupted... The only alternative I can think of is to make
the randoms a function of a cryptographic hash of the data, which is
overkill.
|
I have created the issue for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check if an increased tolerance works. It's still a bit surprising that that should be necessary.
sklearn/utils/estimator_checks.py
Outdated
res_all = res_all[0] | ||
res_one = list(map(lambda x: x[0], res_one)) | ||
# TODO remove cases when corrected | ||
if [name, method] in [['SVC', 'decision_function'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use tuples. Like if name, method in [('SVC', 'decision_function'), ...]
tuples are intended for struct-like objects where each field means a different thing. Arrays are usually for homogeneous semantics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably not be marked WIP anymore :P
Please add an entry to the change log at |
sklearn/utils/estimator_checks.py
Outdated
@ignore_warnings(category=(DeprecationWarning, FutureWarning)) | ||
def check_methods_subset_invariance(name, estimator_orig): | ||
# check that method gives invariant results if applied | ||
# one by one or on all elements together. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would slightly rephrase the comment by replacing one by one by something like "if applied on mini bathes or the whole set"
sklearn/utils/estimator_checks.py
Outdated
if hasattr(estimator, method): | ||
msg = ("{method} of {name} is not invariant when applied " | ||
"to a subset.").format(method=method, name=name) | ||
func = getattr(estimator, method) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would personally make a small private function to compute and unpack the data.
def _apply_func(func, X):
result_full = func(X)
n_features = X.shape[1]
result_by_batch = [func(batch.reshape(1, n_features))
for batch in X]
# func can output tuple (e.g. score_samples)
if type(res_all) == tuple:
result_full = result_full[0]
result_by_batch = list(map(lambda x: x[0], result_by_batch))
return np.ravel(result_full), np.ravel(result_by_batch)
def check_methods_subset_invariance(name, estimator_orig):
...
result_full, result_by_batch = _apply_func(get_attr(estimator, method))
...
assert_allclose(results_full, results_by_batch,
atol=1e-7, err_msg=msg)
LGTM apart of a change in a comment and some coding style. |
for method in ["predict", "transform", "decision_function", | ||
"score_samples", "predict_proba"]: | ||
|
||
msg = ("{method} of {name} is not invariant when applied " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can actually move this message before the assert_close in fact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@glemaitre you want to put it outside the block for and format it inside the assert_close and SkipTest ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh my bad I did not see the occurrence in SkipTest. Good as it is
sklearn/utils/estimator_checks.py
Outdated
raise SkipTest(msg) | ||
|
||
if hasattr(estimator, method): | ||
result_full, result_by_batch = _apply_func(getattr(estimator, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be easier to read as :
result_full, result_by_batch = _apply_func(
getattr(estimator, method), X)
@Johayon 2 small nitpicks and this is good to be merged once it will be green :) |
@Johayon Thanks!!! |
Fixes #10420
added a test for all estimators to check if any of the methods {
predict
,predict_proba
,decision_function
,score_samples
,transform
} produces a different result if applied on all the data or a subset (in this case all the elements one by one).the test currently fails in 4 cases
SVC
withdecision_function
(SVC and OneVsOneClassifier decision_function inconsistent on sub-sample #9174)SparsePCA
withtransform
(SparsePCA inconsistent on sub-sample #10431)MiniBatchSparsePCA
withtransform
(SparsePCA inconsistent on sub-sample #10431)BernoulliRBM
withscore_samples
(due to stochasticity; can't easily fix)