Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cellarity: GroupBy with sparse-matrix agg, scores #564

Closed
wants to merge 11 commits into from

Conversation

jbloom22
Copy link

@jbloom22 jbloom22 commented May 21, 2021

Cellarity benefits from anndata / scanpy open-source development and is proud to contribute a class I wrote in July 2020 to speed up our Platform.

GroupBy class supports grouping and aggregating AnnData observations by key, per variable and computing scores between pairs of groups.

Features:

  • For groups, supports count, sum, mean, and variance. Observations may be weighted.
  • For pairs of groups, supports difference of means, ratio of means (fold-change), t-score, v-score, pooled t-score, and pooled v-score. While t-score detects non-zero difference of mean, v-score is a stable measure of distance between means on a variance scale as the number of observations in each group grows.
  • Each observation may belong to multiple groups (and to each of these, multiple times) using a tuple key and explode.
  • Sparse-matrix aggregation approach works natively with ndarray or scipy sparse formats. Also includes slower Pandas-based implementations, which require dense data but mask NaN instead of propagating.

Implementation/Performance:

  • No view or copy overhead on runtime or memory (even when filtering keys, using key_set). If the data loads, you can aggregate it.
  • Sums of groups of observations are computed as A * X where A is a coordinate-sparse matrix encoding group memberships and weights, exposed by sparse_aggregator.
  • To compute scores, sufficient statistics are computed per group, from which scores are computed per pair. In practice, runtime is dominated by aggregation, so effectively independent of the number of pairs.

So it’s plenty fast to compute representations like t-scores on the fly for many pairs of groups of cells. E.g., computing t-scores for 500,000 cells over 1000 genes for 1800 pairs is reduced from 30m using scanpy.tl.rank_genes_groups to 1.9s (~1000x speedup). For reference, included pandas implementation is 30s, sparse-agg on densified adata is 5s.

Over the full 43K genes (which don't fit densely in RAM), t-scores take 13s for 148 pairs and 19s for 1800 pairs at finer resolution.

While this was designed to be extensible and support computing and optimizing single-cell representations as the data and models grow, I’m hopeful folks will get creative with use-cases on other data types too!

@codecov
Copy link

codecov bot commented May 21, 2021

Codecov Report

Merging #564 (1d423cd) into master (e7a3a3e) will increase coverage by 0.62%.
The diff coverage is 95.08%.

@@            Coverage Diff             @@
##           master     #564      +/-   ##
==========================================
+ Coverage   86.49%   87.12%   +0.62%     
==========================================
  Files          25       26       +1     
  Lines        3532     3728     +196     
==========================================
+ Hits         3055     3248     +193     
- Misses        477      480       +3     
Impacted Files Coverage Δ
anndata/_core/groupby.py 95.05% <95.05%> (ø)
anndata/__init__.py 100.00% <100.00%> (ø)
anndata/_core/anndata.py 82.75% <0.00%> (+0.02%) ⬆️
anndata/_core/merge.py 93.25% <0.00%> (+1.03%) ⬆️
anndata/utils.py 83.92% <0.00%> (+1.57%) ⬆️
anndata/_io/write.py 77.27% <0.00%> (+2.54%) ⬆️

@jbloom22 jbloom22 force-pushed the master branch 2 times, most recently from dfa1c27 to 29cbe8e Compare May 21, 2021 01:40
@flying-sheep
Copy link
Member

flying-sheep commented May 23, 2021

Looks excellent. I’d probably personally implement it as a method available on AnnData too, making constructing it more convenient.

Also needs some minor details like

  • type annotations moved from docstring to code
  • make autosummary generate docs for it
  • a changelog entry

About the content:

  • Should the pd_ variants be a keyword argument on GroupBy instead, like mask_nans: bool = False?

Then we could have two GroupBy subclasses implementing each variant. We could do it like pathlib: An abstract base class GroupBy which has a __new__ method that returns GroupByNumpy and GroupByPandas depending on the passed value:

class GroupBy(ABC):
    """..."""
    def __new__(..., mask_nans: bool = False):
        return GroupByPandas(...) if mask_nans else GroupByNumpy()

    @abstractmethod
    def mean(...): ...

class GroupByNumpy(GroupBy):
    ...

@ivirshup what do you think?

Copy link
Member

@ivirshup ivirshup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! This looks really great, and like a huge performance improvement! Really excited to see Cellarity contributing back. As we discussed offline (well, sorta offline):

Plan for merging:

I think there's a lot of value here, and I think there is (probably long) discussion to have around the API. I think we could merge as private with relativley few changes – then implement the interface in a seperate PRs. This would of course be used as internals for an actual GroupBy object's reduction interface. Multiple people could then hack on API implementations working from a common (and up to date) codebase.

I believe this is the @jbloom22's prefered path as well. While it would be nice to have all this done in one PR, I think waiting on API decisions and implementing whatever consensus we come to is too much to ask of a non-core contributor. Sound reasonable @flying-sheep?

Naming and API

I'm going to put a short version of this here, and will open an issue with the longer version if the proposed "plan for merging" sounds good.

Naming

I don't think this quite fits the generality of groupby, since that should allow more custom access to the grouped object and doesn't require complete reduction over the groups. @gokceneraslan proposed a function with some similar functionality over in scverse/scanpy#1390 called summarized_expression_df. I think this works well for computing the reduction values. Not so sure it works for computing the test statistics.

API

What's the right way to expose this to users? Since I don't think this quite fulfills the full split-apply-combine paradigm, I don't think it should be through a GroupBy object. Two main ideas:

  • Could this be a function?
  • Could we split the reductions and the comparisons?

I'd be interested to hear thoughts from people who have commented on this in the past/ made similar PRs

Right now I'm thinking something like:

stats = ad.get.summarized_df(adata, key, layer="lognorm", stats=["mean", "var", "count"])
# could also be called `grouped_stats`
# Where stats is a dict[str, pd.DataFrame]
# for df in stats: 
#     assert ((df.index == adata.var_names) & (adata.columns == ["mean", "var", "count"])).all()

Score calculation could then look something like:

scores = pd.DataFrame({
    "logfold": ad.stats.fold_score(dfs["group_1"], dfs["group_2"]),
    "t_score": ad.stats.t_score(dfs["group_1"], dfs["group_2"]),
})

# Or scores could operate on groups
scores = ad.get.grouped_scores(stats, "group_1", "group_2", scores=["t", "fold"])

Other topics (to address seperately from API)

Docs

Could you add some examples to the doc strings? Would be nice to see how this is intended to be used.

pd_methods

@jbloom22, could you remind me why you left in the pd methods? IIRC it was about differences in handling null values? Could you provide some examples for cases which you know are different between the two?

Multilevel group encoding

We should have a different way to encode multilevel groupings since we can't write a column of tuples to disk. I'd propose just passing the indicator matrix directly here, since this is a reasonably common way to represent these.

This should also allow for simplification of _extract_indices down to equivalent to sklearn.preprocessing.OneHotEncoder(adata.obs[key], sparse=True).

Null values in groups

It would be nice to have this handled. Currenly if any of the values in adata.obs[key] are null, this errors. I think the question is whether to consider this a group or not. I'd say no. I'd also say that null values shouldn't be allowed if an indicator matrix is passed, as I can't think of what that could mean.

Testing

I would like the test to get split up a bit so if it fails we can see all the components that are affected instead of just the first one.

  • Could you break up the tests so they're parameterized?
  • Could you test for inputs of different dtypes? I'd like to see that this works for integers and floats in particular.
  • Could you also add a test ensuring that you're handling numerical instability? I'd hate for handling of this to break without us noticing.

anndata/_core/groupby.py Outdated Show resolved Hide resolved
anndata/_core/groupby.py Outdated Show resolved Hide resolved
anndata/_core/groupby.py Outdated Show resolved Hide resolved
anndata/_core/groupby.py Outdated Show resolved Hide resolved
@flying-sheep
Copy link
Member

While it would be nice to have all this done in one PR, I think waiting on API decisions and implementing whatever consensus we come to is too much to ask of a non-core contributor. Sound reasonable @flying-sheep?

Totally, if the API decisions do turn out to take a while. I like things as they are, except for the pd_* methods not having the same name as the base one. Whenever that happens, it should be a keyword argument instead.

regarding them: from what I understood they’re skipping instead of propagating nans – depending on what one wants, mean([1, 2, nan]) == 1.5 is more useful than it being nan.

@joshua-gould
Copy link
Contributor

Might also be useful to compute num_expressed_ = _toarray(A * (X != 0)), however this does create an additional copy of X

@flying-sheep
Copy link
Member

flying-sheep commented Jun 10, 2021

I implemented some of the outstanding changes, and an idea about the API in jbloom22#1

Alternatively we can stay with dicts here, but I’d still get rid of the _unpack_stats and make those functions just all operate on stat bundles.

The tests fail because I used Python 3.7 stuff (since we’re going to depend on Python 3.7 anyway)

Create a CountMeanVar type
@ivirshup
Copy link
Member

My current plan for this PR:

I would like to split this PR out into a separate repo. This allows iterating on the API quickly and being able to add dependencies (some likely temporary).

Likely name: anndata-stats

ivirshup added a commit to ivirshup/anndata-stats that referenced this pull request May 30, 2022
Co-authored-by: Jon Bloom <jbloom@cellarity.com>
@ivirshup
Copy link
Member

Most of this feature has been released in scanpy 1.10 via sc.get.aggregate: https://scanpy.readthedocs.io/en/stable/generated/scanpy.get.aggregate.html#scanpy.get.aggregate

thanks for getting this started and apologies for how long it took to get out!

@ivirshup ivirshup closed this Apr 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants