Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pts and pts_rest to rank_genes_groups_df and allow multiple groups #1388

Merged
merged 23 commits into from Dec 8, 2020

Conversation

gokceneraslan
Copy link
Collaborator

No description provided.

@gokceneraslan gokceneraslan changed the title Add pts and pts_rest to rank_genes_groups_df Add pts and pts_rest to rank_genes_groups_df and allow multiple groups Aug 23, 2020
@ivirshup
Copy link
Member

I had to look up what "pts" was for this (its the proportion of samples with non-zero expression). Maybe it could have a better name? Also, this measurement is a lot like the mean or variance. Should those be included here too?

@gokceneraslan
Copy link
Collaborator Author

gokceneraslan commented Aug 23, 2020

I had to look up what "pts" was for this (its the proportion of samples with non-zero expression). Maybe it could have a better name?

I agree. I use fraction_expressed in my custom code, we can use it also here. But this should be changed in sc.tl.rank_genes_groups, I guess. This is also the name of the option now rank_genes_groups(..., pts=True). @Koncopd wdyt?

Also, this measurement is a lot like the mean or variance. Should those be included here too?

Yes, I fully agree. I have ugly scripts to add these actually, but it'd be great to have them here too. Again, maybe we should do this also in rank_genes_groups? I'd also love to have something like groupby for anndata to easily calculate mean, var etc. of X.

@Koncopd
Copy link
Member

Koncopd commented Aug 23, 2020

Yes, fraction_expressed seems to be a better name for this, i agree.

Maybe it is convenient to have groups' mean or variance calculation somewhere, i'm not sure it should be in rank_genes_groups.

@gokceneraslan
Copy link
Collaborator Author

I renamed pts and pts_rest to fraction_group and fraction_rest. I'd like to merge this PR if there is no other feedback. We can think about how to provide aggregate statistics somewhere else. Then we can revisit this function and merge this info too.

Copy link
Member

@ivirshup ivirshup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good! Two minor changes:

group argument

I'm not sure I like the group argument accepting multiple groups. Could that get removed? If you'd like to keep it, why is this the right api?

New column name

I don't think fraction_group and fraction_rest are obvious either. fraction_expressed_group and fraction_expressed_rest? Those are a bit wordy. Also, is rest the right term?

The idea behind this function was to get the basic summary of differential expression that you'd expect to get from a DE package. Do other packages provide this value? If so, what do they call this column?

scanpy/get.py Show resolved Hide resolved
@gokceneraslan
Copy link
Collaborator Author

gokceneraslan commented Oct 10, 2020

group argument

I'm not sure I like the group argument accepting multiple groups. Could that get removed? If you'd like to keep it, why is this the right api?

I wanna keep it this way, if possible. There are two reasons I can think of:

  1. When I want to save the results of DE (e.g. as a table for a publication or to send to collaborators), I always need to loop over all categories which is not so hard but gets super annoying since I need it all the time. This is something almost all scanpy users I know also need, because they too need a table for DE results with all groups in it in almost every project.

  2. We don't have such an argument limiting the function to a specific category in a mandatory way in any DE-related function e.g. we don't have a mandatory group argument in sc.tl.rank_genes_groups, right? Why? Because it is more common to do with all groups. We could have made sc.tl.rank_genes_groups also like existing API of sc.get.rank_genes_groups_df, but we didn't. Another example is the groups argument in sc.pl.umap, it's more common to look at all groups in UMAPs rather than a specific group, so not mandatory. So maybe it makes sense to ask the question other way around: why do we limit it to single group here in sc.get.rank_genes_groups_df and why is it the right API?

New column name

I don't think fraction_group and fraction_rest are obvious either. fraction_expressed_group and fraction_expressed_rest? Those are a bit wordy. Also, is rest the right term?

The idea behind this function was to get the basic summary of differential expression that you'd expect to get from a DE package. Do other packages provide this value? If so, what do they call this column?

Again, mean expression and fraction of cells "expressing" a gene is expected by A LOT OF people running DE in any single cell tool. I am not asking for these features randomly or only bcs I personally need them, it is indeed needed by many people. Seurat users expect them too. scanpy API should prioritize the users a little bit more, in my opinion.

If it sounds too subjective let's make an experiment: let's make a poll on Twitter using the scanpy account ask people if they think fractions are needed or not and how they feel about these names :)

Main suggestion was indeed influenced by Seurat (see #1081 (comment)). But I don't really mind what the name will be. I think most people in the field are familiar with these names (even mu and alpha are used a lot for mean expression and fraction of cells expressing a gene, even though it is even more cryptic).

I do not think there is a super easy way to find a short and obvious name for the column, but I think using fraction_group and fraction_rest (or fraction_ref) are good enough. Reason I suggest fraction_rest is because "rest" is the default value of the reference argument in rank_genes_groups. We can also make it f"fraction_{reference}", if this is the way it is implemented.

Speaking of the column names, for example, do you think score is really obvious :) Try to ask a few regular scanpy users what score means in the DE results we generate. Even our documentation is wrong: Structured array to be indexed by group id storing the z-score underlying the computation of a p-value for each gene for each group. It is the logistic regression beta coef for logreg and t-statistic for t-test, it's not z-score at all...

So, even if the column name is not obvious, I think it's ok to explain it properly in the documentation and for fraction_group, it is easy (also easier than score) to explain.

@LuckyMD
Copy link
Contributor

LuckyMD commented Oct 12, 2020

I just want to quickly agree with @gokceneraslan that fraction and mean expression of a DE gene in the group and in the rest is frequently asked for by data analysts or their wetlab collaborators.

@gokceneraslan
Copy link
Collaborator Author

Ping @ivirshup. I wanna merge this if possible, it'd be great if you can have a look at the reply above.

Together with this PR and #1488, it would be great to do gene-set enrichment of all cell types at once without loops \o/

Copy link
Member

@ivirshup ivirshup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late response on this!

groups

I still don't think this is quite right. When I'm sharing DE results, it's not going to be every comparison stacked together in one table. It would be a table per comparison, either in separate files or in a spreadsheet with a page per comparison.

But how about this for a compromise, groups stays a required argument. You can pass a list of groups, and a groups column will be added. You can also pass None, and all groups will be used. But you have to pass something. This means you can't just forget to pass a parameter and then open a bug report about how genes are showing up multiple times in your DE results. You had to opt in to either behavior.

New column name

I wasn't clear here. We should definitely include these values. I just think the names could be better and was wondering what other packages use as column names for these values.

AFAICT there is no agreed upon way to name these. Seems weird, since you'd think there'd be a technical name for "when logFC is positive the xxxx group had higher expression".

I would go for f"fraction_{reference}", but then you can't pass the output directly to a plotting function without also passing the value for reference.

How about:

pct_nz_group and pct_nz_reference/ pct_nz_ref? I could also go for lhs/ rhs instead of group/ reference, and fraction instead of pct. But group/reference is consistent with rank_genes_groups and pct is consistent with calculate_qc_metrics. I like having nz in there since otherwise it's not super clear what fraction we're talking about. Could be fraction of total expression, or something about proportion of the dataset? This way it's more clear in the table you show to a collaborator.

I agree score is a bit weird. Maybe statistic is a better choice? @davidsebfischer could probably be more authoritative on this. And yeah, we should change those z-score docs.

Performance

General question about performance. Is this faster than calling the previous function separately on each group, then concatenating the results?

scanpy/get.py Show resolved Hide resolved
@davidsebfischer
Copy link

I agree, statistic is better than score in my opinion.

@gokceneraslan
Copy link
Collaborator Author

groups

I still don't think this is quite right. When I'm sharing DE results, it's not going to be every comparison stacked together in one table. It would be a table per comparison, either in separate files or in a spreadsheet with a page per comparison.

But how about this for a compromise, groups stays a required argument. You can pass a list of groups, and a groups column will be added. You can also pass None, and all groups will be used. But you have to pass something. This means you can't just forget to pass a parameter and then open a bug report about how genes are showing up multiple times in your DE results. You had to opt in to either behavior.

OK, sounds good. Done.

New column name

I wasn't clear here. We should definitely include these values. I just think the names could be better and was wondering what other packages use as column names for these values.

AFAICT there is no agreed upon way to name these. Seems weird, since you'd think there'd be a technical name for "when logFC is positive the xxxx group had higher expression".

I would go for f"fraction_{reference}", but then you can't pass the output directly to a plotting function without also passing the value for reference.

How about:

pct_nz_group and pct_nz_reference/ pct_nz_ref? I could also go for lhs/ rhs instead of group/ reference, and fraction instead of pct. But group/reference is consistent with rank_genes_groups and pct is consistent with calculate_qc_metrics. I like having nz in there since otherwise it's not super clear what fraction we're talking about. Could be fraction of total expression, or something about proportion of the dataset? This way it's more clear in the table you show to a collaborator.

Sounds good, done.

I agree score is a bit weird. Maybe statistic is a better choice? @davidsebfischer could probably be more authoritative on this. And yeah, we should change those z-score docs.

Shall we change this in this function or in sc.tl.rank_genes_groups? I feel like renaming it here is not the best way.

Performance

General question about performance. Is this faster than calling the previous function separately on each group, then concatenating the results?

I think so:

image

Copy link
Member

@ivirshup ivirshup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, sans a minor bug with adding the "pts" (see suggestion).

Shall we change this in this function or in sc.tl.rank_genes_groups? I feel like renaming it here is not the best way.

I think this goes on the list of things to change in 2.0. Probably keep score for now.

scanpy/get.py Outdated Show resolved Hide resolved
scanpy/get.py Show resolved Hide resolved
gokceneraslan and others added 2 commits December 8, 2020 11:38
Co-authored-by: Isaac Virshup <ivirshup@gmail.com>
@gokceneraslan gokceneraslan merged commit 8d9eec4 into master Dec 8, 2020
@gokceneraslan gokceneraslan deleted the rank_genes_groups_df-pts branch December 8, 2020 19:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants