New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+1] Read-only input data in common tests #4807
[MRG+1] Read-only input data in common tests #4807
Conversation
sklearn/utils/estimator_checks.py
Outdated
yield check_estimators_partial_fit_n_features | ||
|
||
|
||
def _yield_all_checks(name, Estimator): | ||
#yield check_parameters_default_constructible, name, Estimator | ||
# yield check_parameters_default_constructible, name, Estimator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that wasn't you, but why is this commented? that seems... odd.. whoops.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
never mind, it is fine. just remove the line please.
a6894e4
to
b492b44
Compare
Thanks to the range of We observe no failure using files |
@@ -74,7 +73,7 @@ | |||
|
|||
print('Learning the dictionary...') | |||
t0 = time() | |||
dico = MiniBatchDictionaryLearning(n_components=100, alpha=1, n_iter=500) | |||
dico = MiniBatchDictionaryLearning(n_components=100, alpha=1, n_iter=500, batch_size=100, n_jobs=4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that on purpose? Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the noise
👍 for merge on my side. I just want you to add the comment on the copy aspect. Thanks, this is super useful. It may seem to be a detail, but it actually is an important step for scalability. |
Oops, other TODO: the renaming 'Y' to lowercase :). |
2e4776c
to
31269f8
Compare
I suspect a glitch in CI... (same problem as nilearn/nilearn#623) |
Restarted CI => glitch fixed. |
sklearn/utils/validation.py
Outdated
if not copy: | ||
array = np.asarray(array, dtype=dtype, order=order) | ||
else: | ||
array = np.array(array, dtype=dtype, order=order, copy=copy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if instead we should not do:
1- always use np.asarray
here.
2- set array_orig = array
at the beginning of the function
3- then at the end of this function, just before returning:
if copy and array is array_orig:
array = array.copy()
Change in |
sklearn/utils/estimator_checks.py
Outdated
def check_classifiers_train(name, Classifier): | ||
X_m, y_m = make_blobs(random_state=0) | ||
def check_classifiers_train(name, Classifier, readonly=False): | ||
X_m, y_m = _make_blobs_with_mode(random_state=0, readonly=readonly) | ||
X_m, y_m = shuffle(X_m, y_m, random_state=7) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blobs are shuffled, right?
I just resuscitated this PR, work needs to be done to fix numerous failures |
|
e057735
to
a664269
Compare
610c6a0
to
80ec85a
Compare
My bad. check_array replaced memory map by array , which is an unwanted behavior that prevent estimator to run out-of-core. This is now fixed with regression test provided. I use np.asarray which transform memory map into arrays but use the provided memory map as the To sum up these estimators fail on common tests as they have side effects on input
Transformers :
SkewedChi2Sampler fails on assertion, for a non trivial reason. I am filling an issue which should be tagged as Easy. |
sklearn/utils/validation.py
Outdated
@@ -399,7 +401,8 @@ def check_array(array, accept_sparse=None, dtype="numeric", order=None, | |||
# To ensure that array flags are maintained | |||
array = np.array(array, dtype=dtype, order=order, copy=copy) | |||
|
|||
# make sure we acually converted to numeric: | |||
array = np.asarray(array, dtype=dtype, order=order) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not trivial, I think it might benefits from a bit of explanation.
sklearn/utils/validation.py
Outdated
@@ -429,6 +432,10 @@ def check_array(array, accept_sparse=None, dtype="numeric", order=None, | |||
msg = ("Data with input dtype %s was converted to %s%s." | |||
% (dtype_orig, array.dtype, context)) | |||
warnings.warn(msg, DataConversionWarning_) | |||
|
|||
if copy and array is array_orig: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure but if array is not array_orig but array.base is array_orig and copy = true, shouldn't we create a real copy ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good remark. I think this should be:
if copy and np.may_share_memory(array, array_orig):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I think that might fix the transformer part for the #5481
Should this be labelled "need contributor", @arthurmensch? |
What needs to be done here? |
Do we consider this to be blocked by #5481? |
Could we wrap this up by bypassing the test on estimators where we expect it to fail, and make it an issue to fix each? @arthurmensch, are you willing to finish it off, or should we find someone else? |
FYI I have a (for now) WIP attempt of reviving this PR at #10663. |
Following PR #4775, I added checks in estimator checks in order to verify
Estimator
behavior on read only mem-mapped data.A few issues there :
sklearn/linear_model/cd_fast.pyx
still raise errors on some use cases. This is related to the fact that we cannot make Lasso fails using simple read-only memmap as input.