New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split rank_genes_groups #1081
Split rank_genes_groups #1081
Conversation
Thank you! This looks great! What do you think, @ivirshup? Now, @Koncopd, can you also add a And actually change the value? Finally, can you test this on the Scanpy default tutorial and a a test based on the numerical values in addition to the image based tests? https://github.com/theislab/scanpy/blob/7e058a1a6a082e34a101d65fc7ac5e9cb6563220/scanpy/tests/notebooks/test_pbmc3k.py#L109-L111 Thank you very much! Otherwise, this is good to be merged, IMO. |
@falexwolf , i'll do. |
Guys, now that you're dissecting rank_genes_groups, do you mind adding additional info to the results which is the fraction of cells expressing the genes (similar to what we have in dotplots) People calculate it manually ever time and it can be painful for those who are not familiar with pandas. |
@gokceneraslan could you point me to code or an example? |
Test fails seem unrelated to this PR. |
From Seurat tutorial: pct.1 and pct.2 are the ones I mentioned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This really needed doing, thanks for diving into it!
@Koncopd, how clean would you like to get this? I've noted a few initial things which I think would make this a bit more simple, but I'd like to hear your goals here. Are you interested in doing more with the differential expression tooling?
scanpy/tools/_rank_genes_groups.py
Outdated
if corr_method == 'benjamini-hochberg': | ||
_, pvals_adj, _, _ = multipletests(pvals, alpha=0.05, method='fdr_bh') | ||
elif corr_method == 'bonferroni': | ||
pvals_adj = np.minimum(pvals * n_genes, 1.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of doing this in the testing function (i.e. _t_test
, _wilcoxon
, etc.), could this be done afterwards since it's the same regardless of test type?
This would also let us remove the corr_method
argument from these functions.
scanpy/tools/_rank_genes_groups.py
Outdated
foldchanges = (expm1_func(mean_group) + 1e-9) / ( | ||
expm1_func(mean_rest) + 1e-9 | ||
) # add small value to remove 0's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this change for each testing function? If not, could it be separated?
scanpy/tools/_rank_genes_groups.py
Outdated
scores_sort = np.abs(scores) if rankby_abs else scores | ||
global_indices = _select_top_n(scores_sort, n_genes_user, n_genes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, can these be split out from these functions?
scanpy/tools/_rank_genes_groups.py
Outdated
|
||
if issparse(X): | ||
merge = lambda tpl: vstack(tpl).todense() | ||
adapt = lambda X: X.todense() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we avoid using np.matrix
. Could these be X.toarray()
?
@ivirshup thanks for catching these things, i'll fix them. |
@gokceneraslan i'll add these things. |
) # add small value to remove 0's | ||
if 'logfoldchanges' not in d: | ||
d['logfoldchanges'] = {} | ||
d['logfoldchanges'][group_name] = np.log2(foldchanges[global_indices]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for catching this. I'll check why it happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh sorry for misunderstanding. Error I posted is from stable scanpy, not from the PR. I was asking if it's fixed now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the problem with sc.get
. logreg only has names
and scores
keys, but sc.get
tries to query nonexistent 'logfoldchanges' and so on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should i add logfoldchanges
for logreg?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to interrupt the PR, but I think logfoldchange and percentages should be available for every method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the percentage for the reference group be also calculated and stored?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so. It might look unnecessary for one-vs-rest kind of tests but for clusterA-vs-clusterB kind of tests it'd be useful.
This is still very messy and very inefficient with this separate calculation of pts. I should work on it more to restructure completely... |
Splits rank_genes_groups into helper functions.
Related to 1), 2) of this.
I don't see any point in using dataframe internally (the second point), dict is more convenient here.