Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent avg_log2FC in FindMarkers outputs depending on the test used after SCTransform: workaround and correction suggestion #7095

Closed
lepriolc opened this issue Mar 31, 2023 · 2 comments
Labels
bug Something isn't working

Comments

@lepriolc
Copy link

Hi Seurat dev team,

Thanks for the amazing work you have been doing to enable the community to perform single-cell data analysis.

I have noticed discrepancies in FindMarkers outputs regarding the average log2FC values in the different tests I did. Using the same input data and dealing with the same genes, I noticed that the average log2FC are different whether the test used in FindMarkers is "wilcox" or "negbinom". This problem has already been raised in different issues (#6701, #6654, #6976, #4892). Solutions have been proposed as in @caodudu's comment (#6701 (comment)). As suggested by @nathanhaigh (#6701 (comment)), I propose here a reproducible example to expose these discrepancies by reproducing SCTransform v2 vigenette: https://satijalab.org/seurat/articles/sctransform_v2_vignette.html

library(Seurat)
library(SeuratData)
library(patchwork)
library(dplyr)
library(ggplot2)

data(ifnb)
ifnb

# split the dataset into a list of two seurat objects (stim and CTRL)
ifnb.list <- SplitObject(ifnb, split.by="stim")
ctrl <- ifnb.list[["CTRL"]]
stim <- ifnb.list[["STIM"]]

# normalize and run dimensionality reduction on control dataset
ctrl <- SCTransform(ctrl, vst.flavor="v2", verbose=FALSE) %>%
    RunPCA(npcs=30, verbose=FALSE) %>%
    RunUMAP(reduction="pca", dims=1:30, verbose=FALSE) %>%
    FindNeighbors(reduction="pca", dims=1:30, verbose=FALSE) %>%
    FindClusters(resolution=0.7, verbose=FALSE)

p1 <- DimPlot(ctrl, label=TRUE, repel=TRUE) + ggtitle("Unsupervised clustering")
p2 <- DimPlot(ctrl, label=TRUE, repel=TRUE, group.by="seurat_annotations") + ggtitle("Annotated celltypes")
p1 | p2

# perform integration using Pearson residuals
stim <- SCTransform(stim, vst.flavor="v2", verbose=FALSE) %>%
    RunPCA(npcs=30, verbose=FALSE)
ifnb.list <- list(ctrl=ctrl, stim=stim)
features <- SelectIntegrationFeatures(object.list=ifnb.list, nfeatures=3000)
ifnb.list <- PrepSCTIntegration(object.list=ifnb.list, anchor.features=features)
immune.anchors <- FindIntegrationAnchors(object.list=ifnb.list, normalization.method="SCT", anchor.features=features)
immune.combined.sct <- IntegrateData(anchorset=immune.anchors, normalization.method="SCT")

# perform an integrated analysis
immune.combined.sct <- RunPCA(immune.combined.sct, verbose=FALSE)
immune.combined.sct <- RunUMAP(immune.combined.sct, reduction="pca", dims=1:30, verbose=FALSE)
immune.combined.sct <- FindNeighbors(immune.combined.sct, reduction="pca", dims=1:30)
immune.combined.sct <- FindClusters(immune.combined.sct, resolution=0.3)

DimPlot(immune.combined.sct, reduction="umap", split.by="stim")
p1 <- DimPlot(immune.combined.sct, reduction="umap", group.by="stim")
p2 <- DimPlot(immune.combined.sct, reduction="umap", group.by="seurat_clusters", label=TRUE, repel=TRUE)
p3 <- DimPlot(immune.combined.sct, reduction="umap", group.by="seurat_annotations", label=TRUE, repel=TRUE)
p1 | p2 | p3

# identify differentially expressed genes across conditions
immune.combined.sct$celltype.stim <- paste(immune.combined.sct$seurat_annotations, immune.combined.sct$stim, sep="_")
Idents(immune.combined.sct) <- "celltype.stim"
immune.combined.sct <- PrepSCTFindMarkers(immune.combined.sct)
b.interferon.response <- FindMarkers(immune.combined.sct, assay="SCT", ident.1="B_STIM", ident.2="B_CTRL", verbose=FALSE)
head(b.interferon.response, n=5)

A data.frame: 5 x 5 p_val avg_log2FC pct.1 pct.2 p_val_adj

ISG15 1.505693e-159 3.508888 0.998 0.229 2.003475e-155
IFIT3 4.128835e-154 2.568440 0.961 0.052 5.493827e-150
IFI6 2.479476e-153 2.460206 0.965 0.076 3.299190e-149
ISG20 9.385626e-152 2.561808 1.000 0.666 1.248851e-147
IFIT1 2.447118e-139 2.132814 0.904 0.029 3.256136e-135

## other tests
### LR
b.interferon.response.LR <- FindMarkers(immune.combined.sct, assay="SCT", ident.1="B_STIM", ident.2="B_CTRL", test.use="LR", verbose=FALSE)
head(b.interferon.response.LR, n=5)

A data.frame: 5 x 5 p_val avg_log2FC pct.1 pct.2 p_val_adj

ISG15 1.011450e-258 3.508888 0.998 0.229 1.345835e-254
ISG20 1.665382e-240 2.561808 1.000 0.666 2.215957e-236
IFIT3 1.078258e-230 2.568440 0.961 0.052 1.434731e-226
IFI6 2.781566e-228 2.460206 0.965 0.076 3.701151e-224
IFIT1 2.844733e-197 2.132814 0.904 0.029 3.785202e-193

### negbinom
b.interferon.response.NB <- FindMarkers(immune.combined.sct, assay="SCT", ident.1="B_STIM", ident.2="B_CTRL", test.use="negbinom", verbose=FALSE)
head(b.interferon.response.NB, n=5)

A data.frame: 5 x 5 p_val avg_log2FC pct.1 pct.2 p_val_adj

ISG15 0.000000e+00 103.311763 0.998 0.229 0.000000e+00
ISG20 0.000000e+00 41.286540 1.000 0.666 0.000000e+00
B2M 7.201407e-161 34.209536 1.000 1.000 9.582192e-157
MX1 3.615401e-132 9.806964 0.900 0.115 4.810653e-128
IFI6 4.913761e-132 13.616252 0.965 0.076 6.538251e-128

Identical average log2FC are obtained with wilcox and LR tests but different and unrealistic high values are obtained with negbinom test (see ISG15, ISG20 and IFI6 genes).

After exploring the source code of FindMarkers function (https://github.com/satijalab/seurat/blob/HEAD/R/differential_expression.R), if I understood well the code, the following commands are invoked:

  1. FindMarkers.Seurat function is called
  2. data.use is set to the SCT assay of Seurat object
  3. FindMarkers.SCTAssay function is called
  4. data.slot is set to "counts" if test.use is set to one of the following tests: negbinom, poisson, MAST, otherwise it is set to "data"
  5. mean.fxn is set to
function(x) {
     return(log(x = rowMeans(x = expm1(x = x)) + pseudocount.use, base = base))

   }
  1. the FoldChange function is called with the previously set data.slot and mean.fxn and returns the average log2FC values to fc.results
  2. the FindMarkers function is called with the previously set fc.results containing the average log2FC values which appear in FindMarkers outputs

The function set in mean.fxn to compute mean expression values is the correct function to use when the test uses the log-counts in the data slot, i.e. when FindMarkers test.use parameter is set to one of the following tests: wilcox, bimod, roc, t, LR, DESeq2, but is not correct when the test uses the counts in the counts slot, i.e. when FindMarkers test.use parameter is set to one of the following tests: negbinom, poisson, MAST.

Setting the FindMarkers mean.fxn parameter to the correct function enables to obtain average log2FC values with negbinom test identical to those obtained with wilcox or LR tests:

### negbinom with correct mean function
mean.fxn <- function(x, pseudocount.use=1, base=2) {
    return(log(x = rowMeans(x = x) + pseudocount.use, base = base))
  }
b.interferon.response.NB.mean.fxn <- FindMarkers(immune.combined.sct, assay="SCT", ident.1="B_STIM", ident.2="B_CTRL", test.use="negbinom", mean.fxn=mean.fxn, verbose=FALSE)
head(b.interferon.response.NB.mean.fxn, n=5)

A data.frame: 5 x 5 p_val avg_log2FC pct.1 pct.2 p_val_adj

ISG15 0.000000e+00 3.5088882 0.998 0.229 0.000000e+00
ISG20 0.000000e+00 2.5618077 1.000 0.666 0.000000e+00
B2M 7.201407e-161 0.5980104 1.000 1.000 9.582192e-157
MX1 3.615401e-132 1.9311875 0.900 0.115 4.810653e-128
IFI6 4.913761e-132 2.4602058 0.965 0.076 6.538251e-128

Thus, I suggest this correction in R/differential_expression.R:
Replace the following lines in FindMarkers.SCTAssay (lines 755 to 760):

if (is.null(x = mean.fxn)){
    mean.fxn <- function(x) {
      return(log(x = rowMeans(x = expm1(x = x)) + pseudocount.use, base = base))

    }
  }

by:

default.mean.fxn <- function(x) {
	return(log(x = rowMeans(x = expm1(x = x)) + pseudocount.use, base = base))
  }
  mean.fxn <- mean.fxn %||% switch(
    EXPR = slot,
    'counts' = function(x) {
        return(log(x = rowMeans(x = x) + pseudocount.use, base = base))
      },
    'scale.data' = rowMeans,
    default.mean.fxn
  )

The FoldChange.Assay function may also need to be corrected since it may generate incorrect average log2FC with SCTransform log-counts:

default.mean.fxn <- function(x) {
    return(log(x = rowMeans(x = x) + pseudocount.use, base = base))
  }
  mean.fxn <- mean.fxn %||% switch(
    EXPR = slot,
    'data' = switch(
      EXPR = norm.method %||% '',
      'LogNormalize' = function(x) {
        return(log(x = rowMeans(x = expm1(x = x)) + pseudocount.use, base = base))
      },
      default.mean.fxn
    ),
    'scale.data' = rowMeans,
    default.mean.fxn
  )

If slot is set to "data" and norm.method is not "LogNormalize", then default.mean.fxn is used, which is the correct function to use for counts values, but not for log-counts values.

Here is the output of sessionInfo():

sessionInfo()

R version 4.1.0 (2021-05-18)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS

Matrix products: default
BLAS/LAPACK: /opt/miniconda3/envs/r41/lib/libopenblasp-r0.3.17.so

locale:
 [1] LC_CTYPE=C.UTF-8    LC_NUMERIC=C        LC_TIME=C          
 [4] LC_COLLATE=C        LC_MONETARY=C       LC_MESSAGES=C      
 [7] LC_PAPER=C          LC_NAME=C           LC_ADDRESS=C       
[10] LC_TELEPHONE=C      LC_MEASUREMENT=C    LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ggplot2_3.4.0         dplyr_1.0.7           patchwork_1.1.1      
[4] ifnb.SeuratData_3.0.0 SeuratData_0.2.1      SeuratObject_4.1.3   
[7] Seurat_4.3.0         

loaded via a namespace (and not attached):
  [1] uuid_0.1-4                  plyr_1.8.6                 
  [3] igraph_1.3.4                repr_1.1.3                 
  [5] lazyeval_0.2.2              sp_1.6-0                   
  [7] splines_4.1.0               listenv_0.8.0              
  [9] scattermore_0.7             GenomeInfoDb_1.28.4        
 [11] digest_0.6.27               htmltools_0.5.3            
 [13] fansi_0.5.0                 magrittr_2.0.1             
 [15] tensor_1.5                  cluster_2.1.2              
 [17] ROCR_1.0-11                 limma_3.48.3               
 [19] globals_0.14.0              matrixStats_0.61.0         
 [21] spatstat.sparse_3.0-0       colorspace_2.0-2           
 [23] rappdirs_0.3.3              ggrepel_0.9.1              
 [25] crayon_1.4.1                RCurl_1.98-1.5             
 [27] jsonlite_1.7.2              progressr_0.13.0           
 [29] spatstat.data_3.0-0         survival_3.2-13            
 [31] zoo_1.8-9                   glue_1.4.2                 
 [33] polyclip_1.10-0             gtable_0.3.0               
 [35] zlibbioc_1.40.0             XVector_0.32.0             
 [37] leiden_0.3.9                DelayedArray_0.18.0        
 [39] future.apply_1.8.1          BiocGenerics_0.38.0        
 [41] abind_1.4-5                 scales_1.2.1               
 [43] DBI_1.1.1                   spatstat.random_3.0-1      
 [45] miniUI_0.1.1.1              Rcpp_1.0.10                
 [47] viridisLite_0.4.0           xtable_1.8-4               
 [49] reticulate_1.22             stats4_4.1.0               
 [51] htmlwidgets_1.5.4           httr_1.4.2                 
 [53] RColorBrewer_1.1-2          ellipsis_0.3.2             
 [55] ica_1.0-2                   pkgconfig_2.0.3            
 [57] farver_2.1.0                uwot_0.1.14                
 [59] deldir_1.0-6                utf8_1.2.2                 
 [61] tidyselect_1.1.1            labeling_0.4.2             
 [63] rlang_1.0.6                 reshape2_1.4.4             
 [65] later_1.3.0                 munsell_0.5.0              
 [67] tools_4.1.0                 cli_3.4.1                  
 [69] generics_0.1.0              ggridges_0.5.3             
 [71] evaluate_0.14               stringr_1.4.0              
 [73] fastmap_1.1.0               goftest_1.2-3              
 [75] fitdistrplus_1.1-6          purrr_0.3.4                
 [77] RANN_2.6.1                  pbapply_1.5-0              
 [79] future_1.23.0               nlme_3.1-153               
 [81] sparseMatrixStats_1.4.2     mime_0.12                  
 [83] compiler_4.1.0              plotly_4.10.0              
 [85] png_0.1-7                   spatstat.utils_3.0-1       
 [87] tibble_3.1.8                glmGamPoi_1.6.0            
 [89] stringi_1.7.4               lattice_0.20-45            
 [91] IRdisplay_1.0               Matrix_1.5-3               
 [93] vctrs_0.5.1                 pillar_1.8.1               
 [95] lifecycle_1.0.3             spatstat.geom_3.0-5        
 [97] lmtest_0.9-39               RcppAnnoy_0.0.19           
 [99] data.table_1.14.2           cowplot_1.1.1              
[101] bitops_1.0-7                irlba_2.3.5                
[103] httpuv_1.6.4                GenomicRanges_1.44.0       
[105] R6_2.5.1                    promises_1.2.0.1           
[107] KernSmooth_2.23-20          gridExtra_2.3              
[109] IRanges_2.26.0              parallelly_1.29.0          
[111] codetools_0.2-18            MASS_7.3-54                
[113] assertthat_0.2.1            SummarizedExperiment_1.22.0
[115] withr_2.5.0                 sctransform_0.3.5          
[117] S4Vectors_0.30.1            GenomeInfoDbData_1.2.6     
[119] parallel_4.1.0              grid_4.1.0                 
[121] IRkernel_1.2                tidyr_1.1.4                
[123] DelayedMatrixStats_1.14.3   MatrixGenerics_1.4.3       
[125] Cairo_1.6-0                 Rtsne_0.15                 
[127] pbdZMQ_0.3-5                spatstat.explore_3.0-5     
[129] Biobase_2.52.0              shiny_1.7.1                
[131] base64enc_0.1-3            

I hope my post can help !

Christophe

@saketkc
Copy link
Collaborator

saketkc commented Jul 6, 2023

Thanks for your patience @lepriolc and for suggesting changes (and the PR!). I have pushed a fix here: 66e655a that should solve this. Please feel free to reopen if you still notice issues. You should be able to pull latest changes from the develop branch:

remotes::install_github("satijalab/seurat", ref="develop")

@saketkc saketkc closed this as completed Jul 6, 2023
@roi-meir
Copy link

roi-meir commented Sep 4, 2023

Hi Seurat dev team,

Thank you very much for the fix.
We also encountered this problem and we used the workaround of passing the right mean.fxn to the FindMarkers function. We don't want to start using the develop branch in all of our projects in the lab and passing the right mean.fxn every time is also error-prone.

Is there an expected release of a new version that includes the fix?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants