Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AnnData split-apply-combine #556

Closed
ivirshup opened this issue Apr 19, 2021 · 2 comments · Fixed by scverse/scanpy#2590
Closed

AnnData split-apply-combine #556

ivirshup opened this issue Apr 19, 2021 · 2 comments · Fixed by scverse/scanpy#2590

Comments

@ivirshup
Copy link
Member

anndata should get a split apply combine framework. This is something we’ve wanted for a while, but this issue is just establishing a concrete place to track it.

Questions

API

There are a number of split-apply-combine frameworks/ API's out there. Which do we want to emulate, and what are the most important features?

  • Access to intermediate elements (e.g. for k, sub_adata in adata.groupby(obs="celltype"))
  • Efficient reductions and combination
  • Nested groupby, group over multiple dimensions
  • Lagged/ windowed grouping (e.g. groupby neighbors of a cell, moving average over time)

What is the correct default return type of a reduction? Can we be flexible about this?

  • AnnData would be nice for consistency, but do we want to keep all that metadata?
    • Also, how would we handle reductions over non-X data? E.g. mean point in PCA space for a cluster
  • DataFrames would let us keep labels, but should they be used as "labelled arrays"?
  • xarray.DataArrays make sense for just labelled arrays

Reductions and sparse data

How can we make reductions efficient for sparse matrices? One of the most annoying problems when writing code for scanpy and anndata is getting operations to work on both dense and sparse arrays, since sparse arrays have a different API. If reductions and combinatons are included here, then we'll have to be sure sparse arrays work, which could be a lot of extra code. My personal preference for addressing this is to get better support for sparse data with an array API in upstream libraries like pydata/sparse.

Current and previous attempts/ discussion

Ping @jbloom22, who has also done some work on this.

Other implementations

@grst
Copy link
Contributor

grst commented May 2, 2021

Another (IMO interesting) implementation: siuba's group_by: https://siuba.readthedocs.io/en/latest/intro.html#Basic-use.
Behavior for other data types can be added via singledispatch.

@jbloom22
Copy link

jbloom22 commented May 21, 2021

We’ve found this approach to be a useful starting point at Cellarity, hope it might also benefit the wider community. Lots of ways to generalize, happy to discuss!
#564

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging a pull request may close this issue.

5 participants