-
Notifications
You must be signed in to change notification settings - Fork 602
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Calculate differential expression between user defined groups of clusters #397
Comments
Hi, You could just create a new
as there are only two groups the top-ranked genes for either groups will be the up-regulated genes in that group (and down-regulated in the other group) that are most differentially expressed between the groups. You should however note that |
Hi, Thank you so much for the prompt response. I was able to make the comparisons following your method. As you suggested, I am going to try using MAST or limma for DEG testing in the future. Thanks. |
Thank you so much for fielding so many of the issues, @LuckyMD! 😄 Can we elaborate a bit further on this one, though? For simple two-group comparisons, Can you point me to a reference that shows that a Wilcoxon-Rank-Sum test is less sensitive? How is this even a useful statement if you don't talk about the false positives you buy in? We should look at an AUC that scans different p-values, right? A bit more than a year ago, @tcallies and I had a full paper draft discussing AUCs for marker gene detection formulated as a classification problem, but we never finished it. In the general setting, it's not at all straightforward to make the evaluation a well-defined problem and other people will for sure have done a better job. Unfortunately, I have never fully caught up with the literature, I fear... |
And, I think, we'd want to start advertising diffxpy in addition to the usual suspects, shouldn't we? Any comments from your side, @davidsebfischer? |
@falexwolf I try to answer where I can. I should probably have clarified a bit above. I would argue that most real data DE tests benefit from accounting for technical covariates. For example, you should probably not perform batch correction on your data and then do a wilcoxon rank sum test, but instead take the normalized (and log transformed) data or the raw counts and include a batch covariate in the test. This also holds for technical covariates that describe the complexity of the data (such as size factors or n_genes). Often these factors are not sufficiently accounted for by simple normalization techniques (especially for plate-based data), and are thus included in the DE testing framework. This is done in MAST (and MAST performs better with this Also, according to the comparison paper you mention, there are not more false positives when using MAST or limma compared to t-tests or Wilcoxon rank sum tests. |
I agree with @LuckyMD about the points regarding covariates. With respect to two group comparisons without confounding, rank-sum tests have less statitical power than t-tests (https://stats.stackexchange.com/questions/130562/why-is-the-asymptotic-relative-efficiency-of-the-wilcoxon-test-3-pi-compared), disclaimer I haven't checked this proof, I think this is a standard statistics result though, this is also discussed here https://stats.stackexchange.com/questions/121852/how-to-choose-between-t-test-or-non-parametric-test-e-g-wilcoxon-in-small-sampl. I havent run simulations to check how big the influence of the difference in power is on the kind of data we encounter. However, as also pointed out by the second link, violations of the distributional assumptions for t-test impact these results and these violations will be major on scRNAseq. Intuitively I would therefore tend to rank-sum tests. With respect to diffxpy: We can account for other noise models in the two-group comparisons by performing model fitting, tutorial here. The bioarxiv will hopefully be up in the next few weeks. |
Also, I just found time to read the links you posted @davidsebfischer. Shouldn't we always use a welch t-test instead of a t-test in marker gene detection according to your second stackexchange link? They state that If you don't have a good reason to assume equal variances in the groups, then use the Welch correction... if we have a I think the default is currently |
Thank you for the whole in-depth discussion. It makes a lot of sense! 😄 To your question: Scanpy has used Welch's adaption of Student's t-test from the very beginning. Thank you! I guess it would be nice to have a single-cell tutorial in diffxpy that shows higher sensitivity by accounting for technical covariates. |
It's always a nice feeling when your understanding catches up to someone else's decision making a long time ago... ;) |
Any news when diffxpy will be up on biorxiv? The package looks super interesting and I'd like to learn more about the background to the methods you implemented. |
Now seurat performs DE analysis using alternative tests including MAST and DESeq2 in a convinent way, such as FindMarkers(pbmc, ident.1 = "CD14+ Mono", ident.2 = "FCGR3A+ Mono", test.use = "MAST"). So I hope that Scanpy could interated more methods too, such as diffxpy in this way: |
I agree with wangjiawen2013, Scanpy should add MAST and DESeq2 into sc.tl.rank_gene_groups too. |
This field is developing very fast, more and more advanced DE test methos are emerging, it's better to adopt these powerful methods. |
One of the shortcomings of scanpy's default DE testing is that p-values (or FDR) of a few genes are very significant (equal 0 or approximately 0 in some datasets), then it's impossible to execute -log transformation, even there is only one 0. The volcano plot will be not beautiful because of the high significance. |
@davidsebfischer is developing https://github.com/theislab/diffxpy, which is a great way to do differential expression in a scanpy workflow. We should document this! Do you think the p-values are inflated? If so, how could we propagate uncertainty to them? |
please refer to this: |
Thanks all for the interesting discussion- did a consensus emerge on the 'best' way to do differential expression testing in scanpy? |
We typically use |
Thank you @LuckyMD. Naive question, but what is the advantage of |
|
Hello,
I want to be able to use sc.tl.rank_genes_groups() to calculate differential expression between two groups of my choice. For example, if I have 16 clusters in my UMAP plot and I want to compare group 1 (all cells in clusters 1 to 8) to group 2 (all cells in clusters 9 to 16) how can I do this ?
Thank you in advance for any help.
The text was updated successfully, but these errors were encountered: