Cellarity: GroupBy with sparse-matrix agg, scores #564

jbloom22 · 2021-05-21T01:10:30Z

Cellarity benefits from anndata / scanpy open-source development and is proud to contribute a class I wrote in July 2020 to speed up our Platform.

GroupBy class supports grouping and aggregating AnnData observations by key, per variable and computing scores between pairs of groups.

Features:

For groups, supports count, sum, mean, and variance. Observations may be weighted.
For pairs of groups, supports difference of means, ratio of means (fold-change), t-score, v-score, pooled t-score, and pooled v-score. While t-score detects non-zero difference of mean, v-score is a stable measure of distance between means on a variance scale as the number of observations in each group grows.
Each observation may belong to multiple groups (and to each of these, multiple times) using a tuple key and explode.
Sparse-matrix aggregation approach works natively with ndarray or scipy sparse formats. Also includes slower Pandas-based implementations, which require dense data but mask NaN instead of propagating.

Implementation/Performance:

No view or copy overhead on runtime or memory (even when filtering keys, using key_set). If the data loads, you can aggregate it.
Sums of groups of observations are computed as A * X where A is a coordinate-sparse matrix encoding group memberships and weights, exposed by sparse_aggregator.
To compute scores, sufficient statistics are computed per group, from which scores are computed per pair. In practice, runtime is dominated by aggregation, so effectively independent of the number of pairs.

So it’s plenty fast to compute representations like t-scores on the fly for many pairs of groups of cells. E.g., computing t-scores for 500,000 cells over 1000 genes for 1800 pairs is reduced from 30m using scanpy.tl.rank_genes_groups to 1.9s (~1000x speedup). For reference, included pandas implementation is 30s, sparse-agg on densified adata is 5s.

Over the full 43K genes (which don't fit densely in RAM), t-scores take 13s for 148 pairs and 19s for 1800 pairs at finer resolution.

While this was designed to be extensible and support computing and optimizing single-cell representations as the data and models grow, I’m hopeful folks will get creative with use-cases on other data types too!

codecov · 2021-05-21T01:16:41Z

Codecov Report

Merging #564 (1d423cd) into master (e7a3a3e) will increase coverage by 0.62%.
The diff coverage is 95.08%.

@@            Coverage Diff             @@
##           master     #564      +/-   ##
==========================================
+ Coverage   86.49%   87.12%   +0.62%     
==========================================
  Files          25       26       +1     
  Lines        3532     3728     +196     
==========================================
+ Hits         3055     3248     +193     
- Misses        477      480       +3

Impacted Files	Coverage Δ
anndata/_core/groupby.py	`95.05% <95.05%> (ø)`
anndata/__init__.py	`100.00% <100.00%> (ø)`
anndata/_core/anndata.py	`82.75% <0.00%> (+0.02%)`	⬆️
anndata/_core/merge.py	`93.25% <0.00%> (+1.03%)`	⬆️
anndata/utils.py	`83.92% <0.00%> (+1.57%)`	⬆️
anndata/_io/write.py	`77.27% <0.00%> (+2.54%)`	⬆️

flying-sheep · 2021-05-23T11:40:32Z

Looks excellent. I’d probably personally implement it as a method available on AnnData too, making constructing it more convenient.

Also needs some minor details like

type annotations moved from docstring to code
make autosummary generate docs for it
a changelog entry

About the content:

Should the pd_ variants be a keyword argument on GroupBy instead, like mask_nans: bool = False?

Then we could have two GroupBy subclasses implementing each variant. We could do it like pathlib: An abstract base class GroupBy which has a __new__ method that returns GroupByNumpy and GroupByPandas depending on the passed value:

class GroupBy(ABC):
    """..."""
    def __new__(..., mask_nans: bool = False):
        return GroupByPandas(...) if mask_nans else GroupByNumpy()

    @abstractmethod
    def mean(...): ...

class GroupByNumpy(GroupBy):
    ...

@ivirshup what do you think?

ivirshup

Thanks for the PR! This looks really great, and like a huge performance improvement! Really excited to see Cellarity contributing back. As we discussed offline (well, sorta offline):

Plan for merging:

I think there's a lot of value here, and I think there is (probably long) discussion to have around the API. I think we could merge as private with relativley few changes – then implement the interface in a seperate PRs. This would of course be used as internals for an actual GroupBy object's reduction interface. Multiple people could then hack on API implementations working from a common (and up to date) codebase.

I believe this is the @jbloom22's prefered path as well. While it would be nice to have all this done in one PR, I think waiting on API decisions and implementing whatever consensus we come to is too much to ask of a non-core contributor. Sound reasonable @flying-sheep?

Naming and API

I'm going to put a short version of this here, and will open an issue with the longer version if the proposed "plan for merging" sounds good.

Naming

I don't think this quite fits the generality of groupby, since that should allow more custom access to the grouped object and doesn't require complete reduction over the groups. @gokceneraslan proposed a function with some similar functionality over in scverse/scanpy#1390 called summarized_expression_df. I think this works well for computing the reduction values. Not so sure it works for computing the test statistics.

API

What's the right way to expose this to users? Since I don't think this quite fulfills the full split-apply-combine paradigm, I don't think it should be through a GroupBy object. Two main ideas:

Could this be a function?
Could we split the reductions and the comparisons?

I'd be interested to hear thoughts from people who have commented on this in the past/ made similar PRs

Right now I'm thinking something like:

stats = ad.get.summarized_df(adata, key, layer="lognorm", stats=["mean", "var", "count"])
# could also be called `grouped_stats`
# Where stats is a dict[str, pd.DataFrame]
# for df in stats: 
#     assert ((df.index == adata.var_names) & (adata.columns == ["mean", "var", "count"])).all()

Score calculation could then look something like:

scores = pd.DataFrame({
    "logfold": ad.stats.fold_score(dfs["group_1"], dfs["group_2"]),
    "t_score": ad.stats.t_score(dfs["group_1"], dfs["group_2"]),
})

# Or scores could operate on groups
scores = ad.get.grouped_scores(stats, "group_1", "group_2", scores=["t", "fold"])

Other topics (to address seperately from API)

Docs

Could you add some examples to the doc strings? Would be nice to see how this is intended to be used.

pd_methods

@jbloom22, could you remind me why you left in the pd methods? IIRC it was about differences in handling null values? Could you provide some examples for cases which you know are different between the two?

Multilevel group encoding

We should have a different way to encode multilevel groupings since we can't write a column of tuples to disk. I'd propose just passing the indicator matrix directly here, since this is a reasonably common way to represent these.

This should also allow for simplification of _extract_indices down to equivalent to sklearn.preprocessing.OneHotEncoder(adata.obs[key], sparse=True).

Null values in groups

It would be nice to have this handled. Currenly if any of the values in adata.obs[key] are null, this errors. I think the question is whether to consider this a group or not. I'd say no. I'd also say that null values shouldn't be allowed if an indicator matrix is passed, as I can't think of what that could mean.

Testing

I would like the test to get split up a bit so if it fails we can see all the components that are affected instead of just the first one.

Could you break up the tests so they're parameterized?
Could you test for inputs of different dtypes? I'd like to see that this works for integers and floats in particular.
Could you also add a test ensuring that you're handling numerical instability? I'd hate for handling of this to break without us noticing.

anndata/_core/groupby.py

flying-sheep · 2021-05-24T11:32:52Z

While it would be nice to have all this done in one PR, I think waiting on API decisions and implementing whatever consensus we come to is too much to ask of a non-core contributor. Sound reasonable @flying-sheep?

Totally, if the API decisions do turn out to take a while. I like things as they are, except for the pd_* methods not having the same name as the base one. Whenever that happens, it should be a keyword argument instead.

regarding them: from what I understood they’re skipping instead of propagating nans – depending on what one wants, mean([1, 2, nan]) == 1.5 is more useful than it being nan.

joshua-gould · 2021-05-27T19:28:00Z

Might also be useful to compute num_expressed_ = _toarray(A * (X != 0)), however this does create an additional copy of X

Co-authored-by: Isaac Virshup <ivirshup@gmail.com>

flying-sheep · 2021-06-10T10:24:15Z

I implemented some of the outstanding changes, and an idea about the API in jbloom22#1

Alternatively we can stay with dicts here, but I’d still get rid of the _unpack_stats and make those functions just all operate on stat bundles.

The tests fail because I used Python 3.7 stuff (since we’re going to depend on Python 3.7 anyway)

Create a CountMeanVar type

ivirshup · 2022-01-31T14:27:47Z

My current plan for this PR:

I would like to split this PR out into a separate repo. This allows iterating on the API quickly and being able to add dependencies (some likely temporary).

Likely name: anndata-stats

Co-authored-by: Jon Bloom <jbloom@cellarity.com>

ivirshup · 2024-04-16T11:43:33Z

Most of this feature has been released in scanpy 1.10 via sc.get.aggregate: https://scanpy.readthedocs.io/en/stable/generated/scanpy.get.aggregate.html#scanpy.get.aggregate

thanks for getting this started and apologies for how long it took to get out!

jbloom22 force-pushed the master branch 2 times, most recently from dfa1c27 to 29cbe8e Compare May 21, 2021 01:40

jbloom22 mentioned this pull request May 21, 2021

AnnData split-apply-combine #556

Closed

Cellarity: GroupBy with sparse-matrix agg, scores

1d248d3

jbloom22 force-pushed the master branch from 29cbe8e to 1d248d3 Compare May 21, 2021 12:20

ivirshup requested changes May 24, 2021

View reviewed changes

anndata/_core/groupby.py Outdated Show resolved Hide resolved

anndata/_core/groupby.py Outdated Show resolved Hide resolved

anndata/_core/groupby.py Outdated Show resolved Hide resolved

anndata/_core/groupby.py Outdated Show resolved Hide resolved

flying-sheep and others added 9 commits June 10, 2021 10:14

Typo

9d01f44

Co-authored-by: Isaac Virshup <ivirshup@gmail.com>

Matmul operator

bac9508

Co-authored-by: Isaac Virshup <ivirshup@gmail.com>

Use utils.asarray

6a4d523

Types

fb5a6b0

Create a CountMeanVar type

6e833cb

Some missed typing and docs

017a9a0

Merge branch 'jbloom-master' into cmv-type

fb675ef

Fix doc formatting

1034611

Merge branch 'jbloom-master' into cmv-type

e86edee

Merge pull request #1 from theislab/cmv-type

1d423cd

Create a CountMeanVar type

flying-sheep requested a review from ivirshup August 26, 2021 07:45

ivirshup added a commit to ivirshup/anndata-stats that referenced this pull request May 30, 2022

Add content of scverse/anndata#564

679b64b

Co-authored-by: Jon Bloom <jbloom@cellarity.com>

ivirshup mentioned this pull request Jan 18, 2023

Multithreading for scanpy.tl.rank_genes_group? scverse/scanpy#2390

Open

1 task

ilan-gold mentioned this pull request Aug 3, 2023

(feat): Aggregation via group-by in sc.get scverse/scanpy#2590

Merged

12 tasks

ivirshup closed this Apr 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cellarity: GroupBy with sparse-matrix agg, scores #564

Cellarity: GroupBy with sparse-matrix agg, scores #564

jbloom22 commented May 21, 2021 •

edited

codecov bot commented May 21, 2021 •

edited

flying-sheep commented May 23, 2021 •

edited

ivirshup left a comment •

edited

flying-sheep commented May 24, 2021

joshua-gould commented May 27, 2021

flying-sheep commented Jun 10, 2021 •

edited

ivirshup commented Jan 31, 2022

ivirshup commented Apr 16, 2024

Cellarity: GroupBy with sparse-matrix agg, scores #564

Cellarity: GroupBy with sparse-matrix agg, scores #564

Conversation

jbloom22 commented May 21, 2021 • edited

codecov bot commented May 21, 2021 • edited

Codecov Report

flying-sheep commented May 23, 2021 • edited

ivirshup left a comment • edited

Choose a reason for hiding this comment

Plan for merging:

Naming and API

Naming

API

Other topics (to address seperately from API)

Docs

pd_methods

Multilevel group encoding

Null values in groups

Testing

flying-sheep commented May 24, 2021

joshua-gould commented May 27, 2021

flying-sheep commented Jun 10, 2021 • edited

ivirshup commented Jan 31, 2022

ivirshup commented Apr 16, 2024

jbloom22 commented May 21, 2021 •

edited

codecov bot commented May 21, 2021 •

edited

flying-sheep commented May 23, 2021 •

edited

ivirshup left a comment •

edited

flying-sheep commented Jun 10, 2021 •

edited