Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Read-only input data in common tests #4807

Conversation

arthurmensch
Copy link
Contributor

Following PR #4775, I added checks in estimator checks in order to verify Estimator behavior on read only mem-mapped data.

A few issues there :

  • registering _clean_temp_memory with atexit yield failure, as it called at the end of every test, and delete _TEMP_MEMORY whereas the next test has already begun.
  • I overloaded make_blobs into _make_blobs in order to be able to easily yield a read-only memmap X which is only positive. This looks quite messy though.
  • cf PR [MRG+1] Read-only data compatibility for Lasso #4775, this does not introduce tests that fails on current master, whereas sklearn/linear_model/cd_fast.pyx still raise errors on some use cases. This is related to the fact that we cannot make Lasso fails using simple read-only memmap as input.

arthurmensch referenced this pull request in arthurmensch/scikit-learn Jun 3, 2015
yield check_estimators_partial_fit_n_features


def _yield_all_checks(name, Estimator):
#yield check_parameters_default_constructible, name, Estimator
# yield check_parameters_default_constructible, name, Estimator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that wasn't you, but why is this commented? that seems... odd.. whoops.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

never mind, it is fine. just remove the line please.

@arthurmensch arthurmensch changed the title Read-only input data in common tests [WIP] Read-only input data in common tests Jun 24, 2015
@arthurmensch arthurmensch force-pushed the common_test_read_only_improvement branch 2 times, most recently from a6894e4 to b492b44 Compare July 1, 2015 09:36
@arthurmensch
Copy link
Contributor Author

Thanks to the range of check_array in validation.py, Transformer,Estimator,ClassifierandRegressor` are now tested on read only memmap input (I have checked it).

We observe no failure using files cd_fast.c from master, with which an error is raised in test_dict_learning.test_dict_learning_lassocd_readonly_data (see PR#4775)

@arthurmensch arthurmensch changed the title [WIP] Read-only input data in common tests [MRG] Read-only input data in common tests Jul 1, 2015
@@ -74,7 +73,7 @@

print('Learning the dictionary...')
t0 = time()
dico = MiniBatchDictionaryLearning(n_components=100, alpha=1, n_iter=500)
dico = MiniBatchDictionaryLearning(n_components=100, alpha=1, n_iter=500, batch_size=100, n_jobs=4)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that on purpose? Why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the noise

@GaelVaroquaux GaelVaroquaux changed the title [MRG] Read-only input data in common tests [MRG+1] Read-only input data in common tests Jul 2, 2015
@GaelVaroquaux
Copy link
Member

👍 for merge on my side. I just want you to add the comment on the copy aspect.

Thanks, this is super useful. It may seem to be a detail, but it actually is an important step for scalability.

@GaelVaroquaux
Copy link
Member

Oops, other TODO: the renaming 'Y' to lowercase :).

@arthurmensch arthurmensch force-pushed the common_test_read_only_improvement branch from 2e4776c to 31269f8 Compare July 2, 2015 09:27
@arthurmensch
Copy link
Contributor Author

I rebased this PR so that it does not include changes from PR #4775. Therefore the Y case is now a problem in PR #4775 (sorry i missed your comment on being pragmatic :) )

@arthurmensch
Copy link
Contributor Author

I suspect a glitch in CI... (same problem as nilearn/nilearn#623)

@GaelVaroquaux
Copy link
Member

Restarted CI => glitch fixed.

if not copy:
array = np.asarray(array, dtype=dtype, order=order)
else:
array = np.array(array, dtype=dtype, order=order, copy=copy)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if instead we should not do:

1- always use np.asarray here.
2- set array_orig = array at the beginning of the function
3- then at the end of this function, just before returning:

if copy and array is array_orig:
    array = array.copy()

@arthurmensch
Copy link
Contributor Author

Change in sklearn.utils.validation triggers errors in test_common, mostly due to in place modification of input in fit methods

def check_classifiers_train(name, Classifier):
X_m, y_m = make_blobs(random_state=0)
def check_classifiers_train(name, Classifier, readonly=False):
X_m, y_m = _make_blobs_with_mode(random_state=0, readonly=readonly)
X_m, y_m = shuffle(X_m, y_m, random_state=7)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blobs are shuffled, right?

@arthurmensch
Copy link
Contributor Author

I just resuscitated this PR, work needs to be done to fix numerous failures

@arthurmensch
Copy link
Contributor Author

assert_greater(accuracy_score(y, y_pred), 0.83) fails on AdaBoostClassifier, any idea @ogrisel ?

@arthurmensch arthurmensch force-pushed the common_test_read_only_improvement branch from e057735 to a664269 Compare October 19, 2015 13:45
@arthurmensch arthurmensch force-pushed the common_test_read_only_improvement branch from 610c6a0 to 80ec85a Compare October 20, 2015 09:01
@arthurmensch
Copy link
Contributor Author

My bad. check_array replaced memory map by array , which is an unwanted behavior that prevent estimator to run out-of-core. This is now fixed with regression test provided. I use np.asarray which transform memory map into arrays but use the provided memory map as the base memory representation. @waterponey is it clearer ?

To sum up these estimators fail on common tests as they have side effects on input X: this should be trivial to fix :

  • the whole PLS family
  • Factor analysis
  • Incremental PCA
  • NuSVC

Transformers :

  • KernelCenterer
  • MaxAbsScaler
  • MinAbsScaler
  • RobustScaler
  • StandardScaler

SkewedChi2Sampler fails on assertion, for a non trivial reason.

I am filling an issue which should be tagged as Easy.

@@ -399,7 +401,8 @@ def check_array(array, accept_sparse=None, dtype="numeric", order=None,
# To ensure that array flags are maintained
array = np.array(array, dtype=dtype, order=order, copy=copy)

# make sure we acually converted to numeric:
array = np.asarray(array, dtype=dtype, order=order)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not trivial, I think it might benefits from a bit of explanation.

@@ -429,6 +432,10 @@ def check_array(array, accept_sparse=None, dtype="numeric", order=None,
msg = ("Data with input dtype %s was converted to %s%s."
% (dtype_orig, array.dtype, context))
warnings.warn(msg, DataConversionWarning_)

if copy and array is array_orig:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure but if array is not array_orig but array.base is array_orig and copy = true, shouldn't we create a real copy ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good remark. I think this should be:

if copy and np.may_share_memory(array, array_orig):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I think that might fix the transformer part for the #5481

@jnothman
Copy link
Member

Should this be labelled "need contributor", @arthurmensch?

@raghavrv
Copy link
Member

What needs to be done here?

@jnothman
Copy link
Member

jnothman commented Feb 4, 2018

Do we consider this to be blocked by #5481?

@jnothman
Copy link
Member

Could we wrap this up by bypassing the test on estimators where we expect it to fail, and make it an issue to fix each? @arthurmensch, are you willing to finish it off, or should we find someone else?

@lesteve
Copy link
Member

lesteve commented Feb 20, 2018

FYI I have a (for now) WIP attempt of reviving this PR at #10663.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants