Add pts and pts_rest to rank_genes_groups_df and allow multiple groups #1388

gokceneraslan · 2020-08-21T20:05:51Z

No description provided.

ivirshup · 2020-08-23T07:22:48Z

I had to look up what "pts" was for this (its the proportion of samples with non-zero expression). Maybe it could have a better name? Also, this measurement is a lot like the mean or variance. Should those be included here too?

gokceneraslan · 2020-08-23T07:29:34Z

I had to look up what "pts" was for this (its the proportion of samples with non-zero expression). Maybe it could have a better name?

I agree. I use fraction_expressed in my custom code, we can use it also here. But this should be changed in sc.tl.rank_genes_groups, I guess. This is also the name of the option now rank_genes_groups(..., pts=True). @Koncopd wdyt?

Also, this measurement is a lot like the mean or variance. Should those be included here too?

Yes, I fully agree. I have ugly scripts to add these actually, but it'd be great to have them here too. Again, maybe we should do this also in rank_genes_groups? I'd also love to have something like groupby for anndata to easily calculate mean, var etc. of X.

Koncopd · 2020-08-23T16:46:17Z

Yes, fraction_expressed seems to be a better name for this, i agree.

Maybe it is convenient to have groups' mean or variance calculation somewhere, i'm not sure it should be in rank_genes_groups.

gokceneraslan · 2020-08-23T21:55:11Z

I renamed pts and pts_rest to fraction_group and fraction_rest. I'd like to merge this PR if there is no other feedback. We can think about how to provide aggregate statistics somewhere else. Then we can revisit this function and merge this info too.

ivirshup

Mostly looks good! Two minor changes:

`group` argument

I'm not sure I like the group argument accepting multiple groups. Could that get removed? If you'd like to keep it, why is this the right api?

New column name

I don't think fraction_group and fraction_rest are obvious either. fraction_expressed_group and fraction_expressed_rest? Those are a bit wordy. Also, is rest the right term?

The idea behind this function was to get the basic summary of differential expression that you'd expect to get from a DE package. Do other packages provide this value? If so, what do they call this column?

scanpy/get.py

gokceneraslan · 2020-10-10T23:10:04Z

group argument

I'm not sure I like the group argument accepting multiple groups. Could that get removed? If you'd like to keep it, why is this the right api?

I wanna keep it this way, if possible. There are two reasons I can think of:

When I want to save the results of DE (e.g. as a table for a publication or to send to collaborators), I always need to loop over all categories which is not so hard but gets super annoying since I need it all the time. This is something almost all scanpy users I know also need, because they too need a table for DE results with all groups in it in almost every project.
We don't have such an argument limiting the function to a specific category in a mandatory way in any DE-related function e.g. we don't have a mandatory group argument in sc.tl.rank_genes_groups, right? Why? Because it is more common to do with all groups. We could have made sc.tl.rank_genes_groups also like existing API of sc.get.rank_genes_groups_df, but we didn't. Another example is the groups argument in sc.pl.umap, it's more common to look at all groups in UMAPs rather than a specific group, so not mandatory. So maybe it makes sense to ask the question other way around: why do we limit it to single group here in sc.get.rank_genes_groups_df and why is it the right API?

New column name

I don't think fraction_group and fraction_rest are obvious either. fraction_expressed_group and fraction_expressed_rest? Those are a bit wordy. Also, is rest the right term?

The idea behind this function was to get the basic summary of differential expression that you'd expect to get from a DE package. Do other packages provide this value? If so, what do they call this column?

Again, mean expression and fraction of cells "expressing" a gene is expected by A LOT OF people running DE in any single cell tool. I am not asking for these features randomly or only bcs I personally need them, it is indeed needed by many people. Seurat users expect them too. scanpy API should prioritize the users a little bit more, in my opinion.

If it sounds too subjective let's make an experiment: let's make a poll on Twitter using the scanpy account ask people if they think fractions are needed or not and how they feel about these names :)

Main suggestion was indeed influenced by Seurat (see #1081 (comment)). But I don't really mind what the name will be. I think most people in the field are familiar with these names (even mu and alpha are used a lot for mean expression and fraction of cells expressing a gene, even though it is even more cryptic).

I do not think there is a super easy way to find a short and obvious name for the column, but I think using fraction_group and fraction_rest (or fraction_ref) are good enough. Reason I suggest fraction_rest is because "rest" is the default value of the reference argument in rank_genes_groups. We can also make it f"fraction_{reference}", if this is the way it is implemented.

Speaking of the column names, for example, do you think score is really obvious :) Try to ask a few regular scanpy users what score means in the DE results we generate. Even our documentation is wrong: Structured array to be indexed by group id storing the z-score underlying the computation of a p-value for each gene for each group. It is the logistic regression beta coef for logreg and t-statistic for t-test, it's not z-score at all...

So, even if the column name is not obvious, I think it's ok to explain it properly in the documentation and for fraction_group, it is easy (also easier than score) to explain.

LuckyMD · 2020-10-12T10:45:16Z

I just want to quickly agree with @gokceneraslan that fraction and mean expression of a DE gene in the group and in the rest is frequently asked for by data analysts or their wetlab collaborators.

gokceneraslan · 2020-11-10T17:42:31Z

Ping @ivirshup. I wanna merge this if possible, it'd be great if you can have a look at the reply above.

Together with this PR and #1488, it would be great to do gene-set enrichment of all cell types at once without loops \o/

ivirshup

Sorry for the late response on this!

`groups`

I still don't think this is quite right. When I'm sharing DE results, it's not going to be every comparison stacked together in one table. It would be a table per comparison, either in separate files or in a spreadsheet with a page per comparison.

But how about this for a compromise, groups stays a required argument. You can pass a list of groups, and a groups column will be added. You can also pass None, and all groups will be used. But you have to pass something. This means you can't just forget to pass a parameter and then open a bug report about how genes are showing up multiple times in your DE results. You had to opt in to either behavior.

New column name

I wasn't clear here. We should definitely include these values. I just think the names could be better and was wondering what other packages use as column names for these values.

AFAICT there is no agreed upon way to name these. Seems weird, since you'd think there'd be a technical name for "when logFC is positive the xxxx group had higher expression".

I would go for f"fraction_{reference}", but then you can't pass the output directly to a plotting function without also passing the value for reference.

How about:

pct_nz_group and pct_nz_reference/ pct_nz_ref? I could also go for lhs/ rhs instead of group/ reference, and fraction instead of pct. But group/reference is consistent with rank_genes_groups and pct is consistent with calculate_qc_metrics. I like having nz in there since otherwise it's not super clear what fraction we're talking about. Could be fraction of total expression, or something about proportion of the dataset? This way it's more clear in the table you show to a collaborator.

I agree score is a bit weird. Maybe statistic is a better choice? @davidsebfischer could probably be more authoritative on this. And yeah, we should change those z-score docs.

Performance

General question about performance. Is this faster than calling the previous function separately on each group, then concatenating the results?

scanpy/get.py

davidsebfischer · 2020-11-20T09:53:16Z

I agree, statistic is better than score in my opinion.

gokceneraslan · 2020-12-07T18:20:05Z

groups

I still don't think this is quite right. When I'm sharing DE results, it's not going to be every comparison stacked together in one table. It would be a table per comparison, either in separate files or in a spreadsheet with a page per comparison.

But how about this for a compromise, groups stays a required argument. You can pass a list of groups, and a groups column will be added. You can also pass None, and all groups will be used. But you have to pass something. This means you can't just forget to pass a parameter and then open a bug report about how genes are showing up multiple times in your DE results. You had to opt in to either behavior.

OK, sounds good. Done.

New column name

I wasn't clear here. We should definitely include these values. I just think the names could be better and was wondering what other packages use as column names for these values.

AFAICT there is no agreed upon way to name these. Seems weird, since you'd think there'd be a technical name for "when logFC is positive the xxxx group had higher expression".

I would go for f"fraction_{reference}", but then you can't pass the output directly to a plotting function without also passing the value for reference.

How about:

pct_nz_group and pct_nz_reference/ pct_nz_ref? I could also go for lhs/ rhs instead of group/ reference, and fraction instead of pct. But group/reference is consistent with rank_genes_groups and pct is consistent with calculate_qc_metrics. I like having nz in there since otherwise it's not super clear what fraction we're talking about. Could be fraction of total expression, or something about proportion of the dataset? This way it's more clear in the table you show to a collaborator.

Sounds good, done.

I agree score is a bit weird. Maybe statistic is a better choice? @davidsebfischer could probably be more authoritative on this. And yeah, we should change those z-score docs.

Shall we change this in this function or in sc.tl.rank_genes_groups? I feel like renaming it here is not the best way.

Performance

General question about performance. Is this faster than calling the previous function separately on each group, then concatenating the results?

I think so:

ivirshup

Looks good to me, sans a minor bug with adding the "pts" (see suggestion).

Shall we change this in this function or in sc.tl.rank_genes_groups? I feel like renaming it here is not the best way.

I think this goes on the list of things to change in 2.0. Probably keep score for now.

scanpy/get.py

Co-authored-by: Isaac Virshup <ivirshup@gmail.com>

Add pts and pts_rest to rank_genes_groups_df

a77bb9e

gokceneraslan requested a review from Koncopd August 21, 2020 20:05

gokceneraslan added 3 commits August 21, 2020 16:18

Add test for pts and pts_rest

dab6f4e

Add pts

430b028

Allow multiple groups in rank_genes_groups_df

4b64aae

gokceneraslan changed the title ~~Add pts and pts_rest to rank_genes_groups_df~~ Add pts and pts_rest to rank_genes_groups_df and allow multiple groups Aug 23, 2020

gokceneraslan added 5 commits August 22, 2020 22:43

Fix pos arguments

2a0e880

Fix concat

da8a498

remove group column for backward compat if len(group) == 1

c1cf164

Update _utils.py

2bf7027

Preserve order

5ad6685

gokceneraslan added 2 commits August 23, 2020 14:38

Rename pts and pts_rest to fraction_group and fraction_rest

0b26156

Fix test

1694383

gokceneraslan added 4 commits August 23, 2020 18:23

Update test_get.py

b6b6898

Fix ref

8b180ae

Merge branch 'master' into rank_genes_groups_df-pts

2dc7032

Merge branch 'master' into rank_genes_groups_df-pts

2176d47

gokceneraslan mentioned this pull request Aug 25, 2020

Implement sc.get.summarized_expression_df #1390

Open

ivirshup self-requested a review August 25, 2020 06:28

gokceneraslan added 2 commits August 25, 2020 12:58

Merge branch 'master' into rank_genes_groups_df-pts

35fa4b6

Merge branch 'master' into rank_genes_groups_df-pts

3ec1b04

ivirshup requested changes Aug 31, 2020

View reviewed changes

scanpy/get.py Show resolved Hide resolved

ivirshup reviewed Nov 20, 2020

View reviewed changes

scanpy/get.py Show resolved Hide resolved

Make group parameter mandatory and reduce code duplication

88ae992

Gokcen Eraslan added 3 commits December 7, 2020 13:23

Fix pts tests

ab38954

Black

c276c91

rename pts

5e6b694

ivirshup requested changes Dec 8, 2020

View reviewed changes

scanpy/get.py Outdated Show resolved Hide resolved

scanpy/get.py Show resolved Hide resolved

gokceneraslan and others added 2 commits December 8, 2020 11:38

Update scanpy/get.py

b71f63a

Co-authored-by: Isaac Virshup <ivirshup@gmail.com>

Add var_names.name test

4b7717e

gokceneraslan merged commit 8d9eec4 into master Dec 8, 2020

gokceneraslan deleted the rank_genes_groups_df-pts branch December 8, 2020 19:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add pts and pts_rest to rank_genes_groups_df and allow multiple groups #1388

Add pts and pts_rest to rank_genes_groups_df and allow multiple groups #1388

gokceneraslan commented Aug 21, 2020

ivirshup commented Aug 23, 2020

gokceneraslan commented Aug 23, 2020 •

edited

Koncopd commented Aug 23, 2020

gokceneraslan commented Aug 23, 2020

ivirshup left a comment •

edited

gokceneraslan commented Oct 10, 2020 •

edited

`group` argument

New column name

LuckyMD commented Oct 12, 2020

gokceneraslan commented Nov 10, 2020

ivirshup left a comment

davidsebfischer commented Nov 20, 2020

gokceneraslan commented Dec 7, 2020

`groups`

New column name

Performance

ivirshup left a comment

Add pts and pts_rest to rank_genes_groups_df and allow multiple groups #1388

Add pts and pts_rest to rank_genes_groups_df and allow multiple groups #1388

Conversation

gokceneraslan commented Aug 21, 2020

ivirshup commented Aug 23, 2020

gokceneraslan commented Aug 23, 2020 • edited

Koncopd commented Aug 23, 2020

gokceneraslan commented Aug 23, 2020

ivirshup left a comment • edited

Choose a reason for hiding this comment

group argument

New column name

gokceneraslan commented Oct 10, 2020 • edited

group argument

New column name

LuckyMD commented Oct 12, 2020

gokceneraslan commented Nov 10, 2020

ivirshup left a comment

Choose a reason for hiding this comment

groups

New column name

Performance

davidsebfischer commented Nov 20, 2020

gokceneraslan commented Dec 7, 2020

groups

New column name

Performance

ivirshup left a comment

Choose a reason for hiding this comment

gokceneraslan commented Aug 23, 2020 •

edited

ivirshup left a comment •

edited

`group` argument

gokceneraslan commented Oct 10, 2020 •

edited

`group` argument

`groups`

`groups`