AnnData split-apply-combine #556

ivirshup · 2021-04-19T07:10:23Z

anndata should get a split apply combine framework. This is something we’ve wanted for a while, but this issue is just establishing a concrete place to track it.

Questions

API

There are a number of split-apply-combine frameworks/ API's out there. Which do we want to emulate, and what are the most important features?

Access to intermediate elements (e.g. for k, sub_adata in adata.groupby(obs="celltype"))
Efficient reductions and combination
Nested groupby, group over multiple dimensions
Lagged/ windowed grouping (e.g. groupby neighbors of a cell, moving average over time)

What is the correct default return type of a reduction? Can we be flexible about this?

AnnData would be nice for consistency, but do we want to keep all that metadata?
- Also, how would we handle reductions over non-X data? E.g. mean point in PCA space for a cluster
DataFrames would let us keep labels, but should they be used as "labelled arrays"?
xarray.DataArrays make sense for just labelled arrays

Reductions and sparse data

How can we make reductions efficient for sparse matrices? One of the most annoying problems when writing code for scanpy and anndata is getting operations to work on both dense and sparse arrays, since sparse arrays have a different API. If reductions and combinatons are included here, then we'll have to be sure sparse arrays work, which could be a lot of extra code. My personal preference for addressing this is to get better support for sparse data with an array API in upstream libraries like pydata/sparse.

Current and previous attempts/ discussion

Ping @jbloom22, who has also done some work on this.

Other implementations

xarray
- Modeled on pandas, but has multidimensional groupby
pandas
hail
- MatrixTable groupby
- Aggregations
SplitApplyCombine.jl

The text was updated successfully, but these errors were encountered:

grst · 2021-05-02T08:22:10Z

Another (IMO interesting) implementation: siuba's group_by: https://siuba.readthedocs.io/en/latest/intro.html#Basic-use.
Behavior for other data types can be added via singledispatch.

jbloom22 · 2021-05-21T02:26:12Z

We’ve found this approach to be a useful starting point at Cellarity, hope it might also benefit the wider community. Lots of ways to generalize, happy to discuss!
#564

ivirshup mentioned this issue May 11, 2021

Common plotting library for the Scanpy ecosystem scverse/scanpy#1832

Open

github-actions bot added the stale label Jun 26, 2023

flying-sheep added the enhancement label Jun 26, 2023

scverse deleted a comment from github-actions bot Jun 26, 2023

flying-sheep added topic: api and removed stale labels Jun 26, 2023

flying-sheep assigned ilan-gold Aug 3, 2023

ilan-gold mentioned this issue Aug 4, 2023

(feat): Aggregation via group-by in sc.get scverse/scanpy#2590

Merged

12 tasks

ivirshup closed this as completed in scverse/scanpy#2590 Feb 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AnnData split-apply-combine #556

AnnData split-apply-combine #556

ivirshup commented Apr 19, 2021

grst commented May 2, 2021

jbloom22 commented May 21, 2021 •

edited

Loading

AnnData split-apply-combine #556

AnnData split-apply-combine #556

Comments

ivirshup commented Apr 19, 2021

Questions

API

What is the correct default return type of a reduction? Can we be flexible about this?

Reductions and sparse data

Current and previous attempts/ discussion

Other implementations

grst commented May 2, 2021

jbloom22 commented May 21, 2021 • edited Loading

jbloom22 commented May 21, 2021 •

edited

Loading