[MRG+1] Read-only input data in common tests #4807

arthurmensch · 2015-06-03T08:02:48Z

Following PR #4775, I added checks in estimator checks in order to verify Estimator behavior on read only mem-mapped data.

A few issues there :

registering _clean_temp_memory with atexit yield failure, as it called at the end of every test, and delete _TEMP_MEMORY whereas the next test has already begun.
I overloaded make_blobs into _make_blobs in order to be able to easily yield a read-only memmap X which is only positive. This looks quite messy though.
cf PR [MRG+1] Read-only data compatibility for Lasso #4775, this does not introduce tests that fails on current master, whereas sklearn/linear_model/cd_fast.pyx still raise errors on some use cases. This is related to the fact that we cannot make Lasso fails using simple read-only memmap as input.

amueller · 2015-06-03T14:07:41Z

sklearn/utils/estimator_checks.py

        yield check_estimators_partial_fit_n_features


 def _yield_all_checks(name, Estimator):
-    #yield check_parameters_default_constructible, name, Estimator
+    # yield check_parameters_default_constructible, name, Estimator


I know that wasn't you, but why is this commented? that seems... odd.. whoops.

never mind, it is fine. just remove the line please.

arthurmensch · 2015-07-01T09:54:09Z

Thanks to the range of check_array in validation.py, Transformer,Estimator,ClassifierandRegressor` are now tested on read only memmap input (I have checked it).

We observe no failure using files cd_fast.c from master, with which an error is raised in test_dict_learning.test_dict_learning_lassocd_readonly_data (see PR#4775)

amueller · 2015-07-01T20:25:23Z

examples/decomposition/plot_image_denoising.py

@@ -74,7 +73,7 @@

 print('Learning the dictionary...')
 t0 = time()
-dico = MiniBatchDictionaryLearning(n_components=100, alpha=1, n_iter=500)
+dico = MiniBatchDictionaryLearning(n_components=100, alpha=1, n_iter=500, batch_size=100, n_jobs=4)


Is that on purpose? Why?

Sorry for the noise

GaelVaroquaux · 2015-07-02T09:17:30Z

👍 for merge on my side. I just want you to add the comment on the copy aspect.

Thanks, this is super useful. It may seem to be a detail, but it actually is an important step for scalability.

GaelVaroquaux · 2015-07-02T09:18:12Z

Oops, other TODO: the renaming 'Y' to lowercase :).

arthurmensch · 2015-07-02T09:31:44Z

I rebased this PR so that it does not include changes from PR #4775. Therefore the Y case is now a problem in PR #4775 (sorry i missed your comment on being pragmatic :) )

arthurmensch · 2015-07-03T09:20:28Z

I suspect a glitch in CI... (same problem as nilearn/nilearn#623)

GaelVaroquaux · 2015-07-03T11:39:08Z

Restarted CI => glitch fixed.

ogrisel · 2015-07-03T14:16:56Z

sklearn/utils/validation.py

+        if not copy:
+            array = np.asarray(array, dtype=dtype, order=order)
+        else:
+            array = np.array(array, dtype=dtype, order=order, copy=copy)


I wonder if instead we should not do:

1- always use np.asarray here.
2- set array_orig = array at the beginning of the function
3- then at the end of this function, just before returning:

if copy and array is array_orig: array = array.copy()

arthurmensch · 2015-07-06T07:45:51Z

Change in sklearn.utils.validation triggers errors in test_common, mostly due to in place modification of input in fit methods

amueller · 2015-07-11T22:28:10Z

sklearn/utils/estimator_checks.py

-def check_classifiers_train(name, Classifier):
-    X_m, y_m = make_blobs(random_state=0)
+def check_classifiers_train(name, Classifier, readonly=False):
+    X_m, y_m = _make_blobs_with_mode(random_state=0, readonly=readonly)
    X_m, y_m = shuffle(X_m, y_m, random_state=7)


blobs are shuffled, right?

arthurmensch · 2015-10-19T12:51:17Z

I just resuscitated this PR, work needs to be done to fix numerous failures

arthurmensch · 2015-10-19T13:10:00Z

assert_greater(accuracy_score(y, y_pred), 0.83) fails on AdaBoostClassifier, any idea @ogrisel ?

arthurmensch · 2015-10-20T09:46:43Z

My bad. check_array replaced memory map by array , which is an unwanted behavior that prevent estimator to run out-of-core. This is now fixed with regression test provided. I use np.asarray which transform memory map into arrays but use the provided memory map as the base memory representation. @waterponey is it clearer ?

To sum up these estimators fail on common tests as they have side effects on input X: this should be trivial to fix :

the whole PLS family
Factor analysis
Incremental PCA
NuSVC

Transformers :

KernelCenterer
MaxAbsScaler
MinAbsScaler
RobustScaler
StandardScaler

SkewedChi2Sampler fails on assertion, for a non trivial reason.

I am filling an issue which should be tagged as Easy.

waterponey · 2015-10-20T09:56:33Z

sklearn/utils/validation.py

@@ -399,7 +401,8 @@ def check_array(array, accept_sparse=None, dtype="numeric", order=None,
            # To ensure that array flags are maintained
            array = np.array(array, dtype=dtype, order=order, copy=copy)

-        # make sure we acually converted to numeric:
+        array = np.asarray(array, dtype=dtype, order=order)


This is not trivial, I think it might benefits from a bit of explanation.

waterponey · 2015-10-20T16:40:42Z

sklearn/utils/validation.py

@@ -429,6 +432,10 @@ def check_array(array, accept_sparse=None, dtype="numeric", order=None,
        msg = ("Data with input dtype %s was converted to %s%s."
               % (dtype_orig, array.dtype, context))
        warnings.warn(msg, DataConversionWarning_)
+
+    if copy and array is array_orig:


I'm not sure but if array is not array_orig but array.base is array_orig and copy = true, shouldn't we create a real copy ?

good remark. I think this should be:

if copy and np.may_share_memory(array, array_orig):

Also I think that might fix the transformer part for the #5481

jnothman · 2017-06-18T15:11:04Z

Should this be labelled "need contributor", @arthurmensch?

raghavrv · 2017-07-16T16:50:36Z

What needs to be done here?

jnothman · 2018-02-04T09:13:14Z

Do we consider this to be blocked by #5481?

jnothman · 2018-02-11T23:50:22Z

Could we wrap this up by bypassing the test on estimators where we expect it to fail, and make it an issue to fix each? @arthurmensch, are you willing to finish it off, or should we find someone else?

lesteve · 2018-02-20T17:15:22Z

FYI I have a (for now) WIP attempt of reviving this PR at #10663.

arthurmensch referenced this pull request in arthurmensch/scikit-learn Jun 3, 2015

Added comment for PR discussion

976e28e

amueller reviewed Jun 3, 2015
View reviewed changes

arthurmensch changed the title ~~Read-only input data in common tests~~ [WIP] Read-only input data in common tests Jun 24, 2015

arthurmensch force-pushed the common_test_read_only_improvement branch 2 times, most recently from a6894e4 to b492b44 Compare July 1, 2015 09:36

arthurmensch changed the title ~~[WIP] Read-only input data in common tests~~ [MRG] Read-only input data in common tests Jul 1, 2015

amueller reviewed Jul 1, 2015
View reviewed changes

GaelVaroquaux changed the title ~~[MRG] Read-only input data in common tests~~ [MRG+1] Read-only input data in common tests Jul 2, 2015

arthurmensch force-pushed the common_test_read_only_improvement branch from 2e4776c to 31269f8 Compare July 2, 2015 09:27

ogrisel reviewed Jul 3, 2015
View reviewed changes

amueller reviewed Jul 11, 2015
View reviewed changes

arthurmensch force-pushed the common_test_read_only_improvement branch from e057735 to a664269 Compare October 19, 2015 13:45

Read only input checks in common tests

80ec85a

arthurmensch force-pushed the common_test_read_only_improvement branch from 610c6a0 to 80ec85a Compare October 20, 2015 09:01

waterponey reviewed Oct 20, 2015
View reviewed changes

arthurmensch added 2 commits October 20, 2015 13:25

Cleaning

75b093e

Cleaning

f72718c

arthurmensch mentioned this pull request Oct 20, 2015

Estimators should not try to modify X and y inplace in order to handle readonly memory maps #5481

Closed

waterponey reviewed Oct 20, 2015
View reviewed changes

amueller mentioned this pull request Oct 21, 2015

[WIP] Fix read only mmap tests #5507

Closed

arthurmensch added 3 commits October 21, 2015 16:11

FIX may share memory

ae088b0

FIX may share memory

bd869a6

test

0dfa874

amueller added the Waiting for Reviewer label Dec 10, 2015

arthurmensch mentioned this pull request Dec 11, 2015

ValueError: assignment destination is read-only, when paralleling with n_jobs > 1 #5956

Closed

jnothman added the Stalled label Jun 18, 2017

amueller added the Need Contributor label Jul 15, 2017

lesteve added help wanted and removed Need Contributor labels Oct 18, 2017

jnothman added the Blocker label Feb 4, 2018

lesteve mentioned this pull request Feb 20, 2018

[MRG+1] Read-only memmap input data in common tests #10663

Merged

3 tasks

glemaitre closed this in #10663 Apr 23, 2018

lesteve mentioned this pull request Jun 29, 2018

RFC On the relative harm of cosmetic changes #11336

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG+1] Read-only input data in common tests #4807

[MRG+1] Read-only input data in common tests #4807

arthurmensch commented Jun 3, 2015

amueller Jun 3, 2015

amueller Jun 3, 2015

arthurmensch commented Jul 1, 2015

amueller Jul 1, 2015

arthurmensch Jul 2, 2015

GaelVaroquaux commented Jul 2, 2015

GaelVaroquaux commented Jul 2, 2015

arthurmensch commented Jul 2, 2015

arthurmensch commented Jul 3, 2015

GaelVaroquaux commented Jul 3, 2015

ogrisel Jul 3, 2015

arthurmensch commented Jul 6, 2015

amueller Jul 11, 2015

arthurmensch commented Oct 19, 2015

arthurmensch commented Oct 19, 2015

arthurmensch commented Oct 20, 2015

waterponey Oct 20, 2015

waterponey Oct 20, 2015

ogrisel Oct 21, 2015

waterponey Oct 21, 2015

jnothman commented Jun 18, 2017

raghavrv commented Jul 16, 2017

jnothman commented Feb 4, 2018

jnothman commented Feb 11, 2018

lesteve commented Feb 20, 2018

[MRG+1] Read-only input data in common tests #4807

[MRG+1] Read-only input data in common tests #4807

Conversation

arthurmensch commented Jun 3, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arthurmensch commented Jul 1, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GaelVaroquaux commented Jul 2, 2015

GaelVaroquaux commented Jul 2, 2015

arthurmensch commented Jul 2, 2015

arthurmensch commented Jul 3, 2015

GaelVaroquaux commented Jul 3, 2015

Choose a reason for hiding this comment

arthurmensch commented Jul 6, 2015

Choose a reason for hiding this comment

arthurmensch commented Oct 19, 2015

arthurmensch commented Oct 19, 2015

arthurmensch commented Oct 20, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman commented Jun 18, 2017

raghavrv commented Jul 16, 2017

jnothman commented Feb 4, 2018

jnothman commented Feb 11, 2018

lesteve commented Feb 20, 2018