Change default copy value from True to None #13986

rth · 2019-05-29T20:46:57Z

A fair amount of estimators currently have copy=True (or copy_X=True) by default. In practice, this means that the code looks something like,

X = check_array(X, copy=copy)

and then some other calculations that may change or not X inplace. In the case when the following operations are not done inplace, we have just made a wasteful copy with no good reason.

As discussed in #13923, an example is for instance Ridge(fit_intercept=False) that will copy X, although it is not needed. ~~Actually, I can't find any inplace operations of X in Ridge even with fit_intercept=True, but maybe I am missing something.~~ (found it)

I think in general it would be better to avoid the,

X = check_array(X, copy=copy)

pattern, and instead make a copy explicitly where it is needed. Maybe it could be OK to not make a copy with copy=True if no copy is needed. Alternatively we could introduce copy=None by default.

Adding a common test that checks that Estimator(copy=True).fit(X, y) doesn't change X.

The text was updated successfully, but these errors were encountered:

amueller · 2019-05-29T21:01:39Z

Actually we have test with read-only X which probably will enforce that we don't change X by default.

rth · 2019-05-30T09:06:39Z

Actually we have test with read-only X which probably will enforce that we don't change X by default.

Very good point. I made #13987 to address this issue in preprocessing (e.g. StandardScaler).

For future reference, to find estimators that potentially have this issue, one can use a common test that checks that an exception is raised when one tries to use an estimator with copy=False on read-only array. If it is not raised, it is likely that with copy=True a copy is not actually necessary (though there are false positives).

@ignore_warnings(category=(DeprecationWarning, FutureWarning))  
def check_transformer_extra_copy(name, transformer):    
    X, y = make_blobs(n_samples=30, centers=[[0, 0, 0], [1, 1, 1]],  
                      random_state=0, n_features=2, cluster_std=0.1)  
    X = StandardScaler().fit_transform(X)  
    X -= X.min()  
  
    X, y = create_memmap_backed_data([X, y])  
  
    estimator = clone(transformer)
    sig = signature(transformer.__class__.__init__)
    if "copy" in sig.parameters:
    	estimator.set_params(copy=False)  
  
    	assert_raise_message(ValueError, "is read-only", estimator.fit, X, y)

It's not reliable enough to add it to common tests, but as a detection method, it works reasonably well. The same could be done for classifiers etc.

Also, it should be noted, that for more complex estimators with numerous options it is sometimes hard to decide whether a copy is needed (e.g. Birch.fit). In that case, it's probably better to keep the copy to be safe, particularly when the performance gained by avoiding the copy is negligible with respect to the fit or transform time.

amueller · 2019-05-30T16:58:11Z

awesome. though that will "only" detect spurious copies if there's a copy parameter (and that copy parameter is what causes the spurious copy).
I'm sure there's many other spurious copies that don't fulfill these requirements.

adrinjalali · 2019-06-03T13:04:41Z

I think I'd worry about not copying even if the user explicitly says so. Imagine you have a live system which is updating some matrix on the go, and every now and then does some training using a snapshot of the data. In that case, the user would probably want to have the model to use a copy of the data instead of the data which is changing in another thread as the model trains.

That said, I'd say the user should probably not do what I just described, which means it's pretty okay to avoid those copies if not needed, and probably raising an error if copy=True and it's not needed, instead of silently not creating a copy?

rth · 2019-06-14T08:32:54Z

I think I'd worry about not copying even if the user explicitly says so. Imagine you have a live system which is updating some matrix on the go, and every now and then does some training using a snapshot of the data. In that case, the user would probably want to have the model to use a copy of the data instead of the data which is changing in another thread as the model trains.

My point is that instead of using copy=True and making a copy when it's not needed, the default should be copy=None so that a copy is only made when the data is changed inplace. From the user perspective, this shouldn't change anything (if they didn't rely on scikit-learn to make that copy for his user code). The problem is that ideally, this would involve changing the default of the copy parameters in estimators.

For instance take StandardScaler() the copy parameters (default: True) says,

If False, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned.

While actually, currently after #13987 even with copy=True no copy will be made (if this doesn't change the data), and that's actually not a good behavior . That should be the behavior for copy=None.

So the questions is do we,

change the default copy=True to copy=None without a deprecation warning, considering this a bugfix since copying the data for the user is outside of the responsibility of estimators IMO, the only thing estimator should do is either pass the data through unmodified (default) or modify it inplace if copy=False.
the default copy=True to copy=None with a deprecation warning. This will pretty be annoying for the user as most estimator will then raise a warning.

Neither is ideal, but at the same time I feel this is something that we need to fix, as when you have a pipeline with N estimators, the overhead of copying the data N time even when it's not necessary can become non-negligible.

adrinjalali · 2019-06-14T12:59:02Z

The whole copy semantics is kinda not clean at the moment IMO, and I like your proposal. So it'd be:

None: only copy if there is going to be a change in the data, avoid inplace modificatoins.
True: always create a copy before doing anything, no matter what.
False: do not copy and do modifications in place. With the side note that we still create copies because some operations create a copy, or some modules require the data in a specific dtype (like float64), and therefore we still create a copy, even though we're trying to fix them?

rth · 2019-06-14T13:04:43Z

Agreed. Any ideas how we should change the default value?

True: always create a copy before doing anything, no matter what.

To make matters more complicated, the above should be true for transformers, but say estimator.fit doesn't need to copy the data (if no inplace modifications happen) even with copy=True.

adrinjalali · 2019-06-14T13:15:36Z

To make matters more complicated, the above should be true for transformers, but say estimator.fit doesn't need to copy the data (if no inplace modifications happen) even with copy=True.

But that's why we want to introduce None, isn't it? Why would we give the same behavior to True if we have None for it?

rth · 2019-06-14T13:26:01Z

But that's why we want to introduce None, isn't it? Why would we give the same behavior to True if we have None for it?

We need copy=None to avoid needless copies in estimator.transform.

For estimator.fit(X), the only case when the user may want to copy the data is if X is stored as a private attribute of the estimator (e.g. brute force nearest-neighbors). For other cases, if no inplace operations happen, it doesn't really make sense to guarantee that a copy will be made internally IMO.

jnothman · 2019-06-18T11:20:59Z

I think I had just assumed that copy=True meant copy=None. I'd prefer copy='on-write' or something...

rth · 2019-06-18T12:55:52Z

I think I had just assumed that copy=True meant copy=None.

That is another possibility to avoid a deprecation cycle, but then we need to clearly state it in the docstring.

thomasjpfan · 2019-06-19T14:40:54Z

I'd prefer copy='on-write' or something...

I am +1 on this idea. People with a slight CS background can quickly understand "copy-on-write" over None.

adrinjalali · 2019-07-29T09:25:33Z

@jeremiedbb we probably should fix this one first, and then we would know what to show the user in case of failure in #14481

amueller · 2019-07-29T20:25:21Z

I hadn't seen this discussion before, my 2c are in #14481 (comment)

I don't think there's three options that we care about. I think the two options that we care about on estimator level is "allow in-place operations" and "don't allow in-place operations".

I think maybe part of the confusion is that the copy keyword of check_array means something different from the copy parameter of an estimator.

The copy parameter of check_array says whether to force a copy, the copy parameter of an estimator says whether to allow inplace operations in X.

We could deprecate the copy in all estimators and rename it inplace to make the meaning more clear? I think for check_array the copy keyword makes sense: it's whether to make a copy or not.

X = check_array(X, copy=self.copy)

on the other hand makes little sense really.

When you say you want to add copy='on-write', is that for check_array or for estimators? Or for both?
Right now, I think for estimators, the (only sensible) meaning of True is 'on-write'.

rth mentioned this issue May 29, 2019

Linear regression unnecessarily slow #13923

Open

rth changed the title ~~Avoid copy=True by default~~ Avoid copy=True by default in check_array May 29, 2019

rth mentioned this issue May 30, 2019

[MRG] Avoid uncessary copies in sklearn.preprocessing #13987

Merged

rth mentioned this issue May 30, 2019

Avoid memory copies in linear_model when fit_intercept=False #13988

Open

rth changed the title ~~Avoid copy=True by default in check_array~~ Change default copy value from True to None Jun 14, 2019

rth mentioned this issue Jul 26, 2019

Read-only X + copy=False Fails for most estimators #14481

Open

thomasjpfan added the Needs Decision Requires decision label Oct 26, 2019

adrinjalali mentioned this issue Jul 9, 2020

lars_path should not overwrite X when copy_X=True and Gram='auto' #17789

Closed

jeremiedbb mentioned this issue Apr 12, 2024

RFC Trigger a copy when copy=False and X is read-only #28824

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change default copy value from True to None #13986

Change default copy value from True to None #13986

rth commented May 29, 2019 •

edited

amueller commented May 29, 2019

rth commented May 30, 2019 •

edited

amueller commented May 30, 2019

adrinjalali commented Jun 3, 2019

rth commented Jun 14, 2019 •

edited

adrinjalali commented Jun 14, 2019

rth commented Jun 14, 2019

adrinjalali commented Jun 14, 2019

rth commented Jun 14, 2019

jnothman commented Jun 18, 2019

rth commented Jun 18, 2019

thomasjpfan commented Jun 19, 2019

adrinjalali commented Jul 29, 2019

amueller commented Jul 29, 2019 •

edited

Change default copy value from True to None #13986

Change default copy value from True to None #13986

Comments

rth commented May 29, 2019 • edited

amueller commented May 29, 2019

rth commented May 30, 2019 • edited

amueller commented May 30, 2019

adrinjalali commented Jun 3, 2019

rth commented Jun 14, 2019 • edited

adrinjalali commented Jun 14, 2019

rth commented Jun 14, 2019

adrinjalali commented Jun 14, 2019

rth commented Jun 14, 2019

jnothman commented Jun 18, 2019

rth commented Jun 18, 2019

thomasjpfan commented Jun 19, 2019

adrinjalali commented Jul 29, 2019

amueller commented Jul 29, 2019 • edited

rth commented May 29, 2019 •

edited

rth commented May 30, 2019 •

edited

rth commented Jun 14, 2019 •

edited

amueller commented Jul 29, 2019 •

edited