New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add pts and pts_rest to rank_genes_groups_df and allow multiple groups #1388
Conversation
I had to look up what |
I agree. I use
Yes, I fully agree. I have ugly scripts to add these actually, but it'd be great to have them here too. Again, maybe we should do this also in rank_genes_groups? I'd also love to have something like groupby for anndata to easily calculate mean, var etc. of X. |
Yes, Maybe it is convenient to have groups' mean or variance calculation somewhere, i'm not sure it should be in |
I renamed pts and pts_rest to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks good! Two minor changes:
group
argument
I'm not sure I like the group argument accepting multiple groups. Could that get removed? If you'd like to keep it, why is this the right api?
New column name
I don't think fraction_group
and fraction_rest
are obvious either. fraction_expressed_group
and fraction_expressed_rest
? Those are a bit wordy. Also, is rest
the right term?
The idea behind this function was to get the basic summary of differential expression that you'd expect to get from a DE package. Do other packages provide this value? If so, what do they call this column?
I wanna keep it this way, if possible. There are two reasons I can think of:
Again, mean expression and fraction of cells "expressing" a gene is expected by A LOT OF people running DE in any single cell tool. I am not asking for these features randomly or only bcs I personally need them, it is indeed needed by many people. Seurat users expect them too. scanpy API should prioritize the users a little bit more, in my opinion. If it sounds too subjective let's make an experiment: let's make a poll on Twitter using the scanpy account ask people if they think fractions are needed or not and how they feel about these names :) Main suggestion was indeed influenced by Seurat (see #1081 (comment)). But I don't really mind what the name will be. I think most people in the field are familiar with these names (even mu and alpha are used a lot for mean expression and fraction of cells expressing a gene, even though it is even more cryptic). I do not think there is a super easy way to find a short and obvious name for the column, but I think using fraction_group and fraction_rest (or fraction_ref) are good enough. Reason I suggest fraction_rest is because "rest" is the default value of the reference argument in rank_genes_groups. We can also make it Speaking of the column names, for example, do you think So, even if the column name is not obvious, I think it's ok to explain it properly in the documentation and for |
I just want to quickly agree with @gokceneraslan that fraction and mean expression of a DE gene in the group and in the rest is frequently asked for by data analysts or their wetlab collaborators. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late response on this!
groups
I still don't think this is quite right. When I'm sharing DE results, it's not going to be every comparison stacked together in one table. It would be a table per comparison, either in separate files or in a spreadsheet with a page per comparison.
But how about this for a compromise, groups
stays a required argument. You can pass a list of groups, and a groups column will be added. You can also pass None
, and all groups will be used. But you have to pass something. This means you can't just forget to pass a parameter and then open a bug report about how genes are showing up multiple times in your DE results. You had to opt in to either behavior.
New column name
I wasn't clear here. We should definitely include these values. I just think the names could be better and was wondering what other packages use as column names for these values.
AFAICT there is no agreed upon way to name these. Seems weird, since you'd think there'd be a technical name for "when logFC is positive the xxxx group had higher expression".
I would go for f"fraction_{reference}"
, but then you can't pass the output directly to a plotting function without also passing the value for reference
.
How about:
pct_nz_group
and pct_nz_reference
/ pct_nz_ref
? I could also go for lhs
/ rhs
instead of group
/ reference
, and fraction
instead of pct
. But group
/reference
is consistent with rank_genes_groups
and pct
is consistent with calculate_qc_metrics
. I like having nz
in there since otherwise it's not super clear what fraction we're talking about. Could be fraction of total expression, or something about proportion of the dataset? This way it's more clear in the table you show to a collaborator.
I agree score
is a bit weird. Maybe statistic
is a better choice? @davidsebfischer could probably be more authoritative on this. And yeah, we should change those z-score
docs.
Performance
General question about performance. Is this faster than calling the previous function separately on each group, then concatenating the results?
I agree, statistic is better than score in my opinion. |
OK, sounds good. Done.
Sounds good, done.
Shall we change this in this function or in sc.tl.rank_genes_groups? I feel like renaming it here is not the best way.
I think so: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, sans a minor bug with adding the "pts"
(see suggestion).
Shall we change this in this function or in sc.tl.rank_genes_groups? I feel like renaming it here is not the best way.
I think this goes on the list of things to change in 2.0. Probably keep score
for now.
Co-authored-by: Isaac Virshup <ivirshup@gmail.com>
No description provided.