Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change default copy value from True to None #13986

Open
rth opened this issue May 29, 2019 · 14 comments
Open

Change default copy value from True to None #13986

rth opened this issue May 29, 2019 · 14 comments
Labels
Needs Decision Requires decision

Comments

@rth
Copy link
Member

rth commented May 29, 2019

A fair amount of estimators currently have copy=True (or copy_X=True) by default. In practice, this means that the code looks something like,

X = check_array(X, copy=copy)

and then some other calculations that may change or not X inplace. In the case when the following operations are not done inplace, we have just made a wasteful copy with no good reason.

As discussed in #13923, an example is for instance Ridge(fit_intercept=False) that will copy X, although it is not needed. Actually, I can't find any inplace operations of X in Ridge even with fit_intercept=True, but maybe I am missing something. (found it)

I think in general it would be better to avoid the,

X = check_array(X, copy=copy)

pattern, and instead make a copy explicitly where it is needed. Maybe it could be OK to not make a copy with copy=True if no copy is needed. Alternatively we could introduce copy=None by default.

Adding a common test that checks that Estimator(copy=True).fit(X, y) doesn't change X.

@rth rth changed the title Avoid copy=True by default Avoid copy=True by default in check_array May 29, 2019
@amueller
Copy link
Member

Actually we have test with read-only X which probably will enforce that we don't change X by default.

@rth
Copy link
Member Author

rth commented May 30, 2019

Actually we have test with read-only X which probably will enforce that we don't change X by default.

Very good point. I made #13987 to address this issue in preprocessing (e.g. StandardScaler).

For future reference, to find estimators that potentially have this issue, one can use a common test that checks that an exception is raised when one tries to use an estimator with copy=False on read-only array. If it is not raised, it is likely that with copy=True a copy is not actually necessary (though there are false positives).

@ignore_warnings(category=(DeprecationWarning, FutureWarning))  
def check_transformer_extra_copy(name, transformer):    
    X, y = make_blobs(n_samples=30, centers=[[0, 0, 0], [1, 1, 1]],  
                      random_state=0, n_features=2, cluster_std=0.1)  
    X = StandardScaler().fit_transform(X)  
    X -= X.min()  
  
    X, y = create_memmap_backed_data([X, y])  
  
    estimator = clone(transformer)
    sig = signature(transformer.__class__.__init__)
    if "copy" in sig.parameters:
    	estimator.set_params(copy=False)  
  
    	assert_raise_message(ValueError, "is read-only", estimator.fit, X, y)

It's not reliable enough to add it to common tests, but as a detection method, it works reasonably well. The same could be done for classifiers etc.

Also, it should be noted, that for more complex estimators with numerous options it is sometimes hard to decide whether a copy is needed (e.g. Birch.fit). In that case, it's probably better to keep the copy to be safe, particularly when the performance gained by avoiding the copy is negligible with respect to the fit or transform time.

@amueller
Copy link
Member

awesome. though that will "only" detect spurious copies if there's a copy parameter (and that copy parameter is what causes the spurious copy).
I'm sure there's many other spurious copies that don't fulfill these requirements.

@adrinjalali
Copy link
Member

I think I'd worry about not copying even if the user explicitly says so. Imagine you have a live system which is updating some matrix on the go, and every now and then does some training using a snapshot of the data. In that case, the user would probably want to have the model to use a copy of the data instead of the data which is changing in another thread as the model trains.

That said, I'd say the user should probably not do what I just described, which means it's pretty okay to avoid those copies if not needed, and probably raising an error if copy=True and it's not needed, instead of silently not creating a copy?

@rth rth changed the title Avoid copy=True by default in check_array Change default copy value from True to None Jun 14, 2019
@rth
Copy link
Member Author

rth commented Jun 14, 2019

I think I'd worry about not copying even if the user explicitly says so. Imagine you have a live system which is updating some matrix on the go, and every now and then does some training using a snapshot of the data. In that case, the user would probably want to have the model to use a copy of the data instead of the data which is changing in another thread as the model trains.

My point is that instead of using copy=True and making a copy when it's not needed, the default should be copy=None so that a copy is only made when the data is changed inplace. From the user perspective, this shouldn't change anything (if they didn't rely on scikit-learn to make that copy for his user code). The problem is that ideally, this would involve changing the default of the copy parameters in estimators.

For instance take StandardScaler() the copy parameters (default: True) says,

If False, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned.

While actually, currently after #13987 even with copy=True no copy will be made (if this doesn't change the data), and that's actually not a good behavior . That should be the behavior for copy=None.

So the questions is do we,

  • change the default copy=True to copy=None without a deprecation warning, considering this a bugfix since copying the data for the user is outside of the responsibility of estimators IMO, the only thing estimator should do is either pass the data through unmodified (default) or modify it inplace if copy=False.
  • the default copy=True to copy=None with a deprecation warning. This will pretty be annoying for the user as most estimator will then raise a warning.

Neither is ideal, but at the same time I feel this is something that we need to fix, as when you have a pipeline with N estimators, the overhead of copying the data N time even when it's not necessary can become non-negligible.

@adrinjalali
Copy link
Member

The whole copy semantics is kinda not clean at the moment IMO, and I like your proposal. So it'd be:

  • None: only copy if there is going to be a change in the data, avoid inplace modificatoins.
  • True: always create a copy before doing anything, no matter what.
  • False: do not copy and do modifications in place. With the side note that we still create copies because some operations create a copy, or some modules require the data in a specific dtype (like float64), and therefore we still create a copy, even though we're trying to fix them?

@rth
Copy link
Member Author

rth commented Jun 14, 2019

Agreed. Any ideas how we should change the default value?

True: always create a copy before doing anything, no matter what.

To make matters more complicated, the above should be true for transformers, but say estimator.fit doesn't need to copy the data (if no inplace modifications happen) even with copy=True.

@adrinjalali
Copy link
Member

To make matters more complicated, the above should be true for transformers, but say estimator.fit doesn't need to copy the data (if no inplace modifications happen) even with copy=True.

But that's why we want to introduce None, isn't it? Why would we give the same behavior to True if we have None for it?

@rth
Copy link
Member Author

rth commented Jun 14, 2019

But that's why we want to introduce None, isn't it? Why would we give the same behavior to True if we have None for it?

We need copy=None to avoid needless copies in estimator.transform.

For estimator.fit(X), the only case when the user may want to copy the data is if X is stored as a private attribute of the estimator (e.g. brute force nearest-neighbors). For other cases, if no inplace operations happen, it doesn't really make sense to guarantee that a copy will be made internally IMO.

@jnothman
Copy link
Member

I think I had just assumed that copy=True meant copy=None. I'd prefer copy='on-write' or something...

@rth
Copy link
Member Author

rth commented Jun 18, 2019

I think I had just assumed that copy=True meant copy=None.

That is another possibility to avoid a deprecation cycle, but then we need to clearly state it in the docstring.

@thomasjpfan
Copy link
Member

I'd prefer copy='on-write' or something...

I am +1 on this idea. People with a slight CS background can quickly understand "copy-on-write" over None.

@adrinjalali
Copy link
Member

@jeremiedbb we probably should fix this one first, and then we would know what to show the user in case of failure in #14481

@amueller
Copy link
Member

amueller commented Jul 29, 2019

I hadn't seen this discussion before, my 2c are in #14481 (comment)

I don't think there's three options that we care about. I think the two options that we care about on estimator level is "allow in-place operations" and "don't allow in-place operations".

I think maybe part of the confusion is that the copy keyword of check_array means something different from the copy parameter of an estimator.

The copy parameter of check_array says whether to force a copy, the copy parameter of an estimator says whether to allow inplace operations in X.

We could deprecate the copy in all estimators and rename it inplace to make the meaning more clear? I think for check_array the copy keyword makes sense: it's whether to make a copy or not.

X = check_array(X, copy=self.copy)

on the other hand makes little sense really.

When you say you want to add copy='on-write', is that for check_array or for estimators? Or for both?
Right now, I think for estimators, the (only sensible) meaning of True is 'on-write'.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Decision Requires decision
Projects
None yet
Development

No branches or pull requests

5 participants