Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[proof of concept/ wip] obsm, varm, layers refactor #140

Closed
wants to merge 43 commits into from

Conversation

ivirshup
Copy link
Member

@ivirshup ivirshup commented Apr 16, 2019

@flying-sheep based on our conversation in scverse/scanpy#562 (comment), I figured I'd give it a try. This branch is an attempt at replacing BoundRecArray with a MutableMapping subclass.

I think this is beneficial largely from a code simplicity point of view.

This is WIP since it's not fully featured, and would definitely require a reworking of tests if implemented. One definitely broken thing is IO, but I don't think this will be difficult to solve. Currently this seems to work with the core of scanpy's api (pretty sure all the commands from to seurat based tutorial for sc.datasets.pbmc3k work).

The only ones which don't are related to `Raw`. Next steps now include:

* `obsm`, `varm` as groups in on disk rep
* Make sure that sparse and dataframe values for `obsm`, `varm` work okay
* Removing dead code paths
@ivirshup
Copy link
Member Author

ivirshup commented Apr 16, 2019

This is pretty close to being a real implementation now. All non-Raw tests now pass for both anndata and scanpy.

Still to go:

  • Tests for sparse matrices, dataframes, etc. in .obsm, .varm
    • Xarray, zarr?
  • Investigate on disk representation
  • Think about compatibility
    • Since the on disk representation will likely change, how are changes to forwards compatibility handled?
  • Clear out dead code
  • Docs

@flying-sheep
Copy link
Member

flying-sheep commented Apr 16, 2019

Good one!

I think we should figure out how to call the type “any matrix type we support”:

>>> import numpy, scipy.sparse, zarr, xarray
>>> {c: c.mro() for c in [numpy.ndarray, scipy.sparse.spmatrix, zarr.Array, xarray.DataArray]}
{numpy.ndarray: [numpy.ndarray, object],
 scipy.sparse.base.spmatrix: [scipy.sparse.base.spmatrix, object],
 zarr.core.Array: [zarr.core.Array, object],
 xarray.core.dataarray.DataArray:
 [xarray.core.dataarray.DataArray,
  xarray.core.common.AbstractArray,
  xarray.core.common.ImplementsArrayReduce,
  xarray.core.common.DataWithCoords,
  xarray.core.arithmetic.SupportsArithmetic,
  xarray.core.common.AttrAccessMixin,
  object]}

So they don’t share any ancestor.

>>> from functools import reduce
>>> reduce(set.intersection, [
...     set(dir(c))
...     for c in [numpy.ndarray, scipy.sparse.csr_matrix, zarr.Array, xarray.DataArray],
... ]) - set(dir(object))               
...                                                                            
{'__getitem__', '__len__', '__setitem__', 'astype', 'dtype', 'ndim', 'shape'}

>>> {
...     c: set(dir(c)) & {'toarray', '__array__'}
...     for c in [numpy.ndarray, scipy.sparse.csr_matrix, zarr.Array, xarray.DataArray]
... }
... 
{numpy.ndarray: {'__array__'},
 scipy.sparse.csr.csr_matrix: {'toarray'},
 zarr.core.Array: {'__array__'},
 xarray.core.dataarray.DataArray: {'__array__'}}

but they do all have shape, __getitem__, and either __array__ or toarray! (and a few more)

  • So we could accept anything that has that. We could create an ABC for it and use it wherever we accept one of these arrays.
  • When slicing the whole thing and that fails, we need to emit a good error message telling people which obs exact entry in which annotation failed to be slicable.

@ivirshup
Copy link
Member Author

I was thinking I'd go for more of a duck typing approach for now, with types that we explicitly support covered by tests. A large part of that is because I'd like to reduce the total amount of code in AnnData (especially base.py!), apart from tests. I'd be up for introducing that kind of abstract type once I see how it will reduce complexity/lines of code.

Question about Raw, think that we might remove it in favor of layers? I'm trying to figure out how much effort I should put into making it work (i.e., workaround vs refactor).

@flying-sheep
Copy link
Member

flying-sheep commented Apr 16, 2019

ABCs are duck typing. You define an instance hook that can use whatever code you want to check if an object is an instance of your new ABC.

My idea is that we’re using a check like that already somewhere else, and not moving it to a central position (a is_supported_array function or an ABC) would make the spots diverge.

And the only difference between such a function and an ABC is that we can use the ABC in a type position.

@falexwolf
Copy link
Member

Looks really interesting: Raw has a different functional role (different number of variables, only row slicing) than layers (same dimensions as .X, row and colum slicing).

Thoughts on treating connectivities etc. within that instead of in .uns? It's a big flaw of views on .uns that one needs a deepcopy. And this is just due to slicing of the n_obs x n_obs graph adjacency matrices.

PS: Brave undertaking for something that is mostly about simplifying the code. But, I, of course, don't mind if you have the bandwidth. :)

@falexwolf
Copy link
Member

Further comments (hopefully not demotivating, just to make sure we're all on the same page):

  • You mean n>2 and not "any n>1 dimensional values", right? obsm already holds n>1 dimensional values like X_pca etc.
  • We made BoundRecArray so that it can be used like a dict. It has .keys, you can del .obsm[key] and you can append etc. So, I'd be surprised if usage became even simpler as it is right now.
  • There could be some advantage to having a dedicated slot for dense representations and another one for sparse ones. I'd need to think a little about different arguments.

@ivirshup
Copy link
Member Author

@flying-sheep Just to preface, the aphorism for duck-typing I'm familiar with is "if it looks like a duck, and acts like a duck, it's a duck." I googled to make sure I was remembering it right, but it turns out there's a bunch of ways of saying it.

I think creating an ABC would definitely cover "looks like a duck", but I'm not sure if covers "acts like a duck". It lets us know it has those methods, but not necessarily that they result in what we want. If the types were actually subclassed from common ABC I'd be more comfortable with having expectations about their behaviour, like how we expect .update and .items to work for a mutable mapping. For now I think it's easier to say both "it should have shape", and "when I index into it, it'll return something of the shape I expect" via runtime checks.

I also don't want to get overly ambitious with this. I definitely want this to support sparse arrays, ndarrays, and dataframes. I'll worry about the others once that's accomplished. Could you point me towards the other section you mentioned? I'd be interested in seeing that, as it could give me a better idea of the areas of application.

@ivirshup
Copy link
Member Author

@falexwolf Yeah, I think simplifying the code will eventually make it easier to build on top of it. I'd also like to understand how AnnDatas are initialized, and right now it's kinda hard to follow. This seems like a (relatively) simple step towards that. Also simplifying now could reduce additional complexity later, like that this could remove the need for a special attribute for sparse annotations.

I definitely think connectivities could fit in obsm. That said, adata.uns["neighbors"] has a nested structure which is useful for keeping those related values together. I'm not sure what the best way to handle this is.

On your specific points:

@ivirshup
Copy link
Member Author

ivirshup commented Apr 17, 2019

@falexwolf and @flying-sheep
More specifically, I've hit a snag with figuring out what to do with setting values on views of an AnnData. I'd like to get your opinions on what the spec for this is.

I've got a little code sample that could be illustrative. I think it could be useful for you to look at it and decide what you think should happen, then try and see if what does happen matches up.

import scanpy as sc
import numpy as np

pbmc = sc.datasets.pbmc68k_reduced()
v = pbmc[pbmc.obs["louvain"] == "1"]
pbmc.obsm["zeros"] = np.zeros_like(pbmc.obsm["X_pca"])
v._isview # True
v.obsm["zeros"] # What should this be?

frozen_obsm = v.obsm.copy()

v.obsm["ones"] = np.ones_like(v.obsm["X_pca"]) # Edited: originally put "ones" instead of "X_pca"
v._isview # Should this be a view?

v = pbmc[pbmc.obs["louvain"] == "1"]
v # now has "zeros" field
v.obsm = frozen_obsm # What should this do? Is v still a view? If so, what's in `obsm`?
v.obsm = v.obsm # What about this?

@falexwolf
Copy link
Member

I think simplifying the code will eventually make it easier to build on top of it.
Yes, you're absolutely right!

Everything else above also makes a ton of sense! I'm aware that anndata/base.py is way too long and was developed way too iteratively just to satisfy certain feature requirements. It can be much simplified and better structured.

setting values on views

You shouldn't be able to set any value on views, except for writing to the data matrix. This is what is crucially necessary in the backed case (where you see a portion of the huge data on disk in a view and want to be able to modify it) and adds a lot of convenience in the memory-case (otherwise one wouldn't be able to write into the data matrix using indexing based on obs and var names...

But all other write operations are meaningless (I think), and that's why DictView, ArrayView etc. exist. They listen to whether you are modifying them and auto-generate a copy if you do so. This is also inspired by the "auto-view-switch" in numpy arrays.

v.obsm["ones"] = np.ones_like(v.obsm["ones"])
v._isview # Should this be a view?

This is should not be a view anymore (btw, there should be a public v.isview). Should also be in tests; if it's a view, it's a bug. I think this should also answer your other questions.


I definitely think connectivities could fit in obsm. That said, adata.uns["neighbors"] has a nested structure which is useful for keeping those related values together. I'm not sure what the best way to handle this is.

Cool! There is a special case about n_obs x n_obs sparse matrices though, as they need to be both row- and column-sliced, which is evidently not the default for fields in .obsm. Don't worry about a nested structure, we don't need it. The idea is to continue to have each tool 'x' write a dict of annotations to .uns['x'] as long as these things are nothing that we would want to slice. The only case where this occurs is sc.pp.neighbors with annotations distances and connectivities; params would stay in .uns['neighbors'], for instance.

@ivirshup
Copy link
Member Author

Setting values on views

backed case

I hadn't been thinking of this. I'll have to look into it for a bit, but will mostly talking about the in-memory case in this comment.

You shouldn't be able to set any value on views, except for writing to the data matrix.

It's good to have this formalized. The current implementation of the .obsm property looks like it has code to update the parent if it's a view, though it doesn't seem to do this in practice.

I do think there could be API benefits to allowing modification via views. Some examples:

# Run a pca just on expression data
sc.pp.pca(adata[:, adata.var["modality"] == "rna"])

# Differential expression, but only within one batch
sc.tl.rank_genes_groups(adata[adata.obs["batch"] = "1"], groupby="celltype")

Another case is raised by scverse/scanpy#612 and is relates to

"auto-view-switch" in numpy arrays

If the model is numpy arrays, should adata[subset, :].{some subsetable attribute} = {values} work? This could be nice for when you do want to modify some array inplace, but want the indexing semantics of an AnnData.

As a side note, I personally don't like the automatic switching in numpy arrays. I think it should be explicit whether an operation returns a view or a copy but for numpy arrays the following code (without type checking) is ambiguous: b = a[idx].

Subsetting connectivities

That's a good point. Subsetting a graph can be pretty different from subsetting a matrix, in the sense that you can subset a graph on edges or nodes.

@flying-sheep
Copy link
Member

flying-sheep commented Apr 17, 2019

For now I think it's easier to say both "it should have shape", and "when I index into it, it'll return something of the shape I expect" via runtime checks.

The first thing is exactly what you can do in an instance check. The python standard lib is all about duck typing, and they went a long way to formalize this using ABCs and all the __dunder__ methods. Please read this so we’re on one page.

I agree that the slicing check would be harder to do there than just to try it out inline.

Could you point me towards the other section you mentioned?

Sorry, I don’t follow. What section?

But all other write operations are meaningless (I think), and that's why DictView, ArrayView etc. exist. They listen to whether you are modifying them and auto-generate a copy if you do so.

FYI: that’s called a “copy on write” structure. A “view” is something that modifies the original when written to. I think our views were read-only originally, but it’s time to rename them like DictViewCOWDict to avoid confusion.

@ivirshup
Copy link
Member Author

@flying-sheep by "this section" I meant whatever you were referring to by this:

we’re using a check like that already somewhere else

@ivirshup
Copy link
Member Author

Just removed BoundRecArrays entirely from the codebase. Probably still have some cleaning up to do however. Next on my list is adding some tests to make sure views are working, and getting .raw to work. I also want to come up with a better name that DictM, since that was when I wasn't sure if I'd try to merge this. I'm thinking something like: AlignedAnnot?

Questions

  • There was one case I wasn't clear on for views:
pbmc = sc.datasets.pbmc68k_reduced()
v = pbmc[pbmc.obs["louvain"] == "1"]
pbmc.obsm["zeros"] = np.zeros_like(pbmc.obsm["X_pca"])
v._isview # True
v.obsm["zeros"] #  What should this return? Currently is a KeyError
  • Do you have opinions on changing the on disk representation? I think making obsm and varm groups instead of datasets will be necessary for supporting more container types, particularly sparse arrays.

@scverse scverse deleted a comment from codecov-io May 8, 2019
@scverse scverse deleted a comment from codecov bot May 8, 2019
anndata/utils.py Outdated Show resolved Hide resolved
@ivirshup
Copy link
Member Author

ivirshup commented May 9, 2019

@flying-sheep, I've been thinking about the renaming, and I think I'd like to keep this PR to non-breaking changes. My goal here is just some backend centralization of code, and adding a few features, but not yet worthy of a breaking version bump.

By splitting out that discussion we can also get more opinions on appropriate naming and behavior, which I think would be valuable. We could also bundle any api changes (like swapping out isview) with any other api changes we'd like to make.

Here's what my goals for this PR are:

  • Code reorganization, and consolidation of redundant code for mappings with aligned values
  • Increased consistency of api, mostly that views have copy-on-write behavior
  • Work towards a more flexible AnnData by allowing more types of array-like annotations for obsm, varm, and layers.
    • I think fully implementing this is out of scope for this PR, since I'd like IO changes to be separate.

What I think is left to do:

  • Return _SetItemMixin derived classes from operations like adata.obsm["x"][idx] and test they work as expected
  • Add narrative docs specifying that views are copy on write
  • Remove demonstrative {obs,var}p attribute
  • Remove usage of as_dict from this codebase

Kinda related things that could go here

Issues this could close

@flying-sheep
Copy link
Member

Changing the type names is a non-breaking change, as they’re private types and aren’t documented. Anyone doing from anndata import ArrayView now does so at their own risk.

@ivirshup
Copy link
Member Author

ivirshup commented May 9, 2019

Oh, I'd thought you'd wanted to rename .isview also? If not, it'd be fine.

Also, isn't ArrayView one we wouldn't want to rename? Based on Alex's comments above, don't views have a mutable .X?

Either way, it could happen in another PR. I'd also like to play around with making as much of AnnData as possible be backed, and having views on most elements be mutable. If that goes well, we might not want to change the name.

@flying-sheep
Copy link
Member

flying-sheep commented May 9, 2019

Also, isn't ArrayView one we wouldn't want to rename? Based on Alex's comments above, don't views have a mutable .X?

I didn’t think about it, it was just an example 😅

I'd also like to play around with making as much of AnnData be backed, and having views on them all be mutable. If that goes well, we might not want to change the name.

You mean you want them no longer be COW, but instead real views?

@ivirshup
Copy link
Member Author

I wanna try. I’ll give a shot at explaining my rationale for this:

If your dataset is really big, it’s not just X that’s gonna use a lot of memory. So are any layers, obsm, varm, etc. If we can already deal with arrays being backed on disk, then it probably shouldn’t matter which array it is. I also think I’m just as likely (if not more) to modify any of the other arrays as I am X.

I also think it'd be convenient if scanpy would do the work of translating indices when I want to modify some data in the object. I think this is what was being tried in #148.

Having trouble describing it, but here's code summary of what this does:

```python
v = adata[idx1]
v.obsm["k"][idx2] = 1
v.isview == False
```
@ivirshup
Copy link
Member Author

Just getting back to this, (had my confirmation, got real sick right after), but I think it's getting close to done.

Were we going to go back to 3.5 compatibility for scanpy? If not, are we gonna keep 3.5 compatibility for anndata?

@falexwolf
Copy link
Member

I'm fine dropping 3.5 compat for AnnData.

@ivirshup
Copy link
Member Author

ivirshup commented Jun 3, 2019

Question about this PR. I'd like to merge it soon, but it will add features that aren't totally supported.

You can now put a dataframe or sparse matrix in obsm, varm, and layers, but those won't survive a roundtrip to disk. What's the best way to handle this until then? I'm thinking setting an unsupported value should either be a warning or an error.

@falexwolf
Copy link
Member

falexwolf commented Jun 3, 2019

I'm thinking setting an unsupported value should either be a warning or an error.

That is a perfect solution! I'm fine if we go ahead with this. Any necessity to release 0.6.21 before? This here should become 0.7, I'd say.

@ivirshup
Copy link
Member Author

ivirshup commented Jun 4, 2019

Since it's a big code change, I think it'd be nice to have it sit on master for a little bit before a release. Just a chance to catch any bugs I've missed.

I also ended up getting started on the io parts, and have made better progress than I expected. It's possible these can come out together. I'll have to work on it for another day or two to know for sure.

@falexwolf
Copy link
Member

Great. Completely agree that we shouldn't move quick with the next release once this is on master. Should your other PR (obs_vector) go into a release 0.6.21 before or is it not necessary?

@ivirshup
Copy link
Member Author

ivirshup commented Jun 5, 2019

It would be great to have a release of the obs_vector stuff first, so I can merge some long standing PRs in scanpy.

I think this is pretty much ready to go, but I think it's going to have to be merged from another PR since there's only so much time I'm willing to spend on managing a git history.

@falexwolf
Copy link
Member

Good, I'll make a release as soon as Volker has answered some trivial questions around read_loom(..., cleanup...). I'll let you know. Should be later today.

Feel free to merge all of this stuff here in any way you like after 0.6.21 is made. 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants