Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AnnData raw is a reference when a copy is expected #1139

Closed
3 tasks done
gtca opened this issue Sep 12, 2023 · 8 comments
Closed
3 tasks done

AnnData raw is a reference when a copy is expected #1139

gtca opened this issue Sep 12, 2023 · 8 comments
Labels

Comments

@gtca
Copy link

gtca commented Sep 12, 2023

Please make sure these conditions are met

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of anndata.
  • (optional) I have confirmed this bug exists on the master branch of anndata.

Report

Code:

import numpy as np
from anndata import AnnData

adata = AnnData(np.arange(1000).reshape(-1, 100))
adata.raw = adata

id(adata.raw.X) == id(adata.X)  # => True

With anndata v0.8, for instance, the last line returns False.

This causes issues like this one: scverse/muon#126.

Versions

Version info
-----
anndata             0.10.0rc1
muon                0.1.5
numpy               1.24.4
session_info        1.0.0
-----
PIL                 10.0.0
colorama            0.4.6
cycler              0.10.0
cython_runtime      NA
dateutil            2.8.2
exceptiongroup      1.1.3
h5py                3.9.0
joblib              1.3.2
kiwisolver          1.4.5
llvmlite            0.40.1
matplotlib          3.7.2
mpl_toolkits        NA
mudata              0.2.3
natsort             8.4.0
numba               0.57.1
packaging           23.1
pandas              2.1.0
patsy               0.5.3
pkg_resources       NA
pynndescent         0.5.10
pyparsing           3.0.9
pytz                2023.3.post1
scanpy              1.9.3
scipy               1.11.2
seaborn             0.12.2
six                 1.16.0
sklearn             1.3.0
statsmodels         0.14.0
threadpoolctl       3.2.0
tqdm                4.66.1
typing_extensions   NA
umap                0.5.3
yaml                6.0
zoneinfo            NA
-----
Python 3.10.0 | packaged by conda-forge | (default, Nov 20 2021, 02:25:38) [Clang 11.1.0 ]
macOS-11.7.4-x86_64-i386-64bit
-----
@ivirshup
Copy link
Member

ivirshup commented Oct 4, 2023

I'm not sure I would consider this a bug. Assigning to other attributes of an AnnData doesn't cause a copy, and this would be a pretty large copy to make.

You can instead do adata.raw = adata.copy() if you need data to be copied.

@ivirshup ivirshup added question and removed Bug 🐛 labels Oct 4, 2023
@gtca
Copy link
Author

gtca commented Oct 4, 2023

I guess it's mainly the (unintended?) behaviour change that is a bit of an issue for the users, e.g. as seen in the linked issue.

Scanpy tutorials like the PBMC 3k one feature the code adata.raw = adata, which creates a problem when it is not a copy.

@flying-sheep
Copy link
Member

Would be easy to fix:

@raw.setter
def _set_raw(self, raw: AnnData | Raw) -> None:
    if self is raw:
        raw = raw.copy()
    ...

I don’t things adata.raw is adata ever makes sense

@ivirshup
Copy link
Member

ivirshup commented Oct 6, 2023

I think this behavior is expected, and the old behavior was wrong. We generally try not to make too many defensive copies due to the large performance hit.

adata.raw is adata

I don't think that happens? The behavior being described is that adata.raw.X is adata.X can happen, which I think is expected. Similarly: AnnData(X).X is X, adata.X = X; adata.X is X, adata.obsm = d; adata.obsm["mtx"] is d["mtx"].

@ivirshup ivirshup closed this as not planned Won't fix, can't repro, duplicate, stale Oct 6, 2023
@gtca
Copy link
Author

gtca commented Oct 6, 2023

@ivirshup Nevertheless I think the issue for the user persists as adata.raw = adata is in many tutorials, even on scanpy.rtfd.io. And considering the in-place modifications in workflows this is quite an issue in the end.

Given this, I think this behaviour change should come with some note or a warning (not sure there was one?).

Should we now open a cascade of issues to fix the tutorials? That's another topic but shouldn't we aim towards deprecating raw altogether?

Just searching adata.raw = adata line on GitHub returns 1.1k files that seem to have been relying on the old behaviour.

@flying-sheep
Copy link
Member

flying-sheep commented Oct 6, 2023

The behavior being described is that adata.raw.X is adata.X can happen

When X is a mutable DataFrame, that’s a problem (one of the reasons I didn’t want to allow data frames as .X)

shouldn't we aim towards deprecating raw altogether?

I’d say yes

@ivirshup
Copy link
Member

ivirshup commented Oct 6, 2023

Shouldn't we aim towards deprecating raw altogether?

Yes.

Should we now open a cascade of issues to fix the tutorials?

Maybe a comment in the tutorial would be appropriate? I don't think we should be recommending that people make big copies when it's not necessary.

I think typically there is no unintended mutation here since people subset the anndata after assigning to raw, which makes adata.X is not adata.raw.X. Sure this can lead to unintentional mutation when adata.raw = adata and then you immediately mutate X, but then you're skipping the use case for raw (different shape).

I would like to replace the scanpy tutorial with the new tutorial, but the tutorial workflow got changed and needs fixing.

@ivirshup
Copy link
Member

ivirshup commented Oct 6, 2023

When X is a mutable DataFrame

@flying-sheep Can X be a dataframe? I'm pretty sure we aren't supporting that in X or layers?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants