-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[proof of concept/ wip] obsm, varm, layers refactor #140
Conversation
5421a4e
to
06b8532
Compare
The only ones which don't are related to `Raw`. Next steps now include: * `obsm`, `varm` as groups in on disk rep * Make sure that sparse and dataframe values for `obsm`, `varm` work okay * Removing dead code paths
This is pretty close to being a real implementation now. All non- Still to go:
|
Good one! I think we should figure out how to call the type “any matrix type we support”: >>> import numpy, scipy.sparse, zarr, xarray
>>> {c: c.mro() for c in [numpy.ndarray, scipy.sparse.spmatrix, zarr.Array, xarray.DataArray]}
{numpy.ndarray: [numpy.ndarray, object],
scipy.sparse.base.spmatrix: [scipy.sparse.base.spmatrix, object],
zarr.core.Array: [zarr.core.Array, object],
xarray.core.dataarray.DataArray:
[xarray.core.dataarray.DataArray,
xarray.core.common.AbstractArray,
xarray.core.common.ImplementsArrayReduce,
xarray.core.common.DataWithCoords,
xarray.core.arithmetic.SupportsArithmetic,
xarray.core.common.AttrAccessMixin,
object]} So they don’t share any ancestor. >>> from functools import reduce
>>> reduce(set.intersection, [
... set(dir(c))
... for c in [numpy.ndarray, scipy.sparse.csr_matrix, zarr.Array, xarray.DataArray],
... ]) - set(dir(object))
...
{'__getitem__', '__len__', '__setitem__', 'astype', 'dtype', 'ndim', 'shape'}
>>> {
... c: set(dir(c)) & {'toarray', '__array__'}
... for c in [numpy.ndarray, scipy.sparse.csr_matrix, zarr.Array, xarray.DataArray]
... }
...
{numpy.ndarray: {'__array__'},
scipy.sparse.csr.csr_matrix: {'toarray'},
zarr.core.Array: {'__array__'},
xarray.core.dataarray.DataArray: {'__array__'}} but they do all have
|
I was thinking I'd go for more of a duck typing approach for now, with types that we explicitly support covered by tests. A large part of that is because I'd like to reduce the total amount of code in AnnData (especially Question about |
ABCs are duck typing. You define an instance hook that can use whatever code you want to check if an object is an instance of your new ABC. My idea is that we’re using a check like that already somewhere else, and not moving it to a central position (a And the only difference between such a function and an ABC is that we can use the ABC in a type position. |
Looks really interesting: Thoughts on treating PS: Brave undertaking for something that is mostly about simplifying the code. But, I, of course, don't mind if you have the bandwidth. :) |
Further comments (hopefully not demotivating, just to make sure we're all on the same page):
|
@flying-sheep Just to preface, the aphorism for duck-typing I'm familiar with is "if it looks like a duck, and acts like a duck, it's a duck." I googled to make sure I was remembering it right, but it turns out there's a bunch of ways of saying it. I think creating an ABC would definitely cover "looks like a duck", but I'm not sure if covers "acts like a duck". It lets us know it has those methods, but not necessarily that they result in what we want. If the types were actually subclassed from common ABC I'd be more comfortable with having expectations about their behaviour, like how we expect I also don't want to get overly ambitious with this. I definitely want this to support sparse arrays, ndarrays, and dataframes. I'll worry about the others once that's accomplished. Could you point me towards the other section you mentioned? I'd be interested in seeing that, as it could give me a better idea of the areas of application. |
@falexwolf Yeah, I think simplifying the code will eventually make it easier to build on top of it. I'd also like to understand how AnnDatas are initialized, and right now it's kinda hard to follow. This seems like a (relatively) simple step towards that. Also simplifying now could reduce additional complexity later, like that this could remove the need for a special attribute for sparse annotations. I definitely think On your specific points:
|
@falexwolf and @flying-sheep I've got a little code sample that could be illustrative. I think it could be useful for you to look at it and decide what you think should happen, then try and see if what does happen matches up. import scanpy as sc
import numpy as np
pbmc = sc.datasets.pbmc68k_reduced()
v = pbmc[pbmc.obs["louvain"] == "1"]
pbmc.obsm["zeros"] = np.zeros_like(pbmc.obsm["X_pca"])
v._isview # True
v.obsm["zeros"] # What should this be?
frozen_obsm = v.obsm.copy()
v.obsm["ones"] = np.ones_like(v.obsm["X_pca"]) # Edited: originally put "ones" instead of "X_pca"
v._isview # Should this be a view?
v = pbmc[pbmc.obs["louvain"] == "1"]
v # now has "zeros" field
v.obsm = frozen_obsm # What should this do? Is v still a view? If so, what's in `obsm`?
v.obsm = v.obsm # What about this? |
Everything else above also makes a ton of sense! I'm aware that
You shouldn't be able to set any value on views, except for writing to the data matrix. This is what is crucially necessary in the backed case (where you see a portion of the huge data on disk in a view and want to be able to modify it) and adds a lot of convenience in the memory-case (otherwise one wouldn't be able to write into the data matrix using indexing based on obs and var names... But all other write operations are meaningless (I think), and that's why
This is should not be a view anymore (btw, there should be a public
Cool! There is a special case about |
Setting values on views
I hadn't been thinking of this. I'll have to look into it for a bit, but will mostly talking about the in-memory case in this comment.
It's good to have this formalized. The current implementation of the I do think there could be API benefits to allowing modification via views. Some examples: # Run a pca just on expression data
sc.pp.pca(adata[:, adata.var["modality"] == "rna"])
# Differential expression, but only within one batch
sc.tl.rank_genes_groups(adata[adata.obs["batch"] = "1"], groupby="celltype") Another case is raised by scverse/scanpy#612 and is relates to
If the model is numpy arrays, should As a side note, I personally don't like the automatic switching in numpy arrays. I think it should be explicit whether an operation returns a view or a copy but for numpy arrays the following code (without type checking) is ambiguous: Subsetting
|
The first thing is exactly what you can do in an instance check. The python standard lib is all about duck typing, and they went a long way to formalize this using ABCs and all the I agree that the slicing check would be harder to do there than just to try it out inline.
Sorry, I don’t follow. What section?
FYI: that’s called a “copy on write” structure. A “view” is something that modifies the original when written to. I think our views were read-only originally, but it’s time to rename them like |
@flying-sheep by "this section" I meant whatever you were referring to by this:
|
Mostly regular instance checks for different things we support: |
Just removed Questions
pbmc = sc.datasets.pbmc68k_reduced()
v = pbmc[pbmc.obs["louvain"] == "1"]
pbmc.obsm["zeros"] = np.zeros_like(pbmc.obsm["X_pca"])
v._isview # True
v.obsm["zeros"] # What should this return? Currently is a KeyError
|
@flying-sheep, I've been thinking about the renaming, and I think I'd like to keep this PR to non-breaking changes. My goal here is just some backend centralization of code, and adding a few features, but not yet worthy of a breaking version bump. By splitting out that discussion we can also get more opinions on appropriate naming and behavior, which I think would be valuable. We could also bundle any api changes (like swapping out Here's what my goals for this PR are:
What I think is left to do:
Kinda related things that could go here
Issues this could close |
Changing the type names is a non-breaking change, as they’re private types and aren’t documented. Anyone doing |
Oh, I'd thought you'd wanted to rename Also, isn't Either way, it could happen in another PR. I'd also like to play around with making as much of AnnData as possible be backed, and having views on most elements be mutable. If that goes well, we might not want to change the name. |
I didn’t think about it, it was just an example 😅
You mean you want them no longer be COW, but instead real views? |
I wanna try. I’ll give a shot at explaining my rationale for this: If your dataset is really big, it’s not just X that’s gonna use a lot of memory. So are any layers, obsm, varm, etc. If we can already deal with arrays being backed on disk, then it probably shouldn’t matter which array it is. I also think I’m just as likely (if not more) to modify any of the other arrays as I am X. I also think it'd be convenient if scanpy would do the work of translating indices when I want to modify some data in the object. I think this is what was being tried in #148. |
Having trouble describing it, but here's code summary of what this does: ```python v = adata[idx1] v.obsm["k"][idx2] = 1 v.isview == False ```
Just getting back to this, (had my confirmation, got real sick right after), but I think it's getting close to done. Were we going to go back to 3.5 compatibility for scanpy? If not, are we gonna keep 3.5 compatibility for anndata? |
I'm fine dropping 3.5 compat for AnnData. |
Question about this PR. I'd like to merge it soon, but it will add features that aren't totally supported. You can now put a dataframe or sparse matrix in obsm, varm, and layers, but those won't survive a roundtrip to disk. What's the best way to handle this until then? I'm thinking setting an unsupported value should either be a warning or an error. |
That is a perfect solution! I'm fine if we go ahead with this. Any necessity to release 0.6.21 before? This here should become 0.7, I'd say. |
Since it's a big code change, I think it'd be nice to have it sit on master for a little bit before a release. Just a chance to catch any bugs I've missed. I also ended up getting started on the io parts, and have made better progress than I expected. It's possible these can come out together. I'll have to work on it for another day or two to know for sure. |
Great. Completely agree that we shouldn't move quick with the next release once this is on master. Should your other PR ( |
It would be great to have a release of the I think this is pretty much ready to go, but I think it's going to have to be merged from another PR since there's only so much time I'm willing to spend on managing a git history. |
Good, I'll make a release as soon as Volker has answered some trivial questions around Feel free to merge all of this stuff here in any way you like after |
@flying-sheep based on our conversation in scverse/scanpy#562 (comment), I figured I'd give it a try. This branch is an attempt at replacing
BoundRecArray
with aMutableMapping
subclass.I think this is beneficial largely from a code simplicity point of view.
recarray
s are poorly documented by numpy, and I personally find more difficult than alternativesdict
This is WIP since it's not fully featured, and would definitely require a reworking of tests if implemented. One definitely broken thing is IO, but I don't think this will be difficult to solve. Currently this seems to work with the core of scanpy's api (pretty sure all the commands from to seurat based tutorial for
sc.datasets.pbmc3k
work).