New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change default copy value from True to None #13986
Comments
Actually we have test with read-only X which probably will enforce that we don't change X by default. |
Very good point. I made #13987 to address this issue in preprocessing (e.g. For future reference, to find estimators that potentially have this issue, one can use a common test that checks that an exception is raised when one tries to use an estimator with @ignore_warnings(category=(DeprecationWarning, FutureWarning))
def check_transformer_extra_copy(name, transformer):
X, y = make_blobs(n_samples=30, centers=[[0, 0, 0], [1, 1, 1]],
random_state=0, n_features=2, cluster_std=0.1)
X = StandardScaler().fit_transform(X)
X -= X.min()
X, y = create_memmap_backed_data([X, y])
estimator = clone(transformer)
sig = signature(transformer.__class__.__init__)
if "copy" in sig.parameters:
estimator.set_params(copy=False)
assert_raise_message(ValueError, "is read-only", estimator.fit, X, y) It's not reliable enough to add it to common tests, but as a detection method, it works reasonably well. The same could be done for classifiers etc. Also, it should be noted, that for more complex estimators with numerous options it is sometimes hard to decide whether a copy is needed (e.g. |
awesome. though that will "only" detect spurious copies if there's a |
I think I'd worry about not copying even if the user explicitly says so. Imagine you have a live system which is updating some matrix on the go, and every now and then does some training using a snapshot of the data. In that case, the user would probably want to have the model to use a copy of the data instead of the data which is changing in another thread as the model trains. That said, I'd say the user should probably not do what I just described, which means it's pretty okay to avoid those copies if not needed, and probably raising an error if |
My point is that instead of using For instance take
While actually, currently after #13987 even with So the questions is do we,
Neither is ideal, but at the same time I feel this is something that we need to fix, as when you have a pipeline with N estimators, the overhead of copying the data N time even when it's not necessary can become non-negligible. |
The whole
|
Agreed. Any ideas how we should change the default value?
To make matters more complicated, the above should be true for transformers, but say |
But that's why we want to introduce |
We need For |
I think I had just assumed that copy=True meant copy=None. I'd prefer |
That is another possibility to avoid a deprecation cycle, but then we need to clearly state it in the docstring. |
I am +1 on this idea. People with a slight CS background can quickly understand "copy-on-write" over |
@jeremiedbb we probably should fix this one first, and then we would know what to show the user in case of failure in #14481 |
I hadn't seen this discussion before, my 2c are in #14481 (comment) I don't think there's three options that we care about. I think the two options that we care about on estimator level is "allow in-place operations" and "don't allow in-place operations". I think maybe part of the confusion is that the The copy parameter of We could deprecate the X = check_array(X, copy=self.copy) on the other hand makes little sense really. When you say you want to add |
A fair amount of estimators currently have
copy=True
(orcopy_X=True
) by default. In practice, this means that the code looks something like,and then some other calculations that may change or not X inplace. In the case when the following operations are not done inplace, we have just made a wasteful copy with no good reason.
As discussed in #13923, an example is for instance
Ridge(fit_intercept=False)
that will copy X, although it is not needed.Actually, I can't find any inplace operations of(found it)X
inRidge
even withfit_intercept=True
, but maybe I am missing something.I think in general it would be better to avoid the,
pattern, and instead make a copy explicitly where it is needed. Maybe it could be OK to not make a copy with
copy=True
if no copy is needed. Alternatively we could introducecopy=None
by default.Adding a common test that checks that
Estimator(copy=True).fit(X, y)
doesn't changeX
.The text was updated successfully, but these errors were encountered: