Inconsistent avg_log2FC in FindMarkers outputs depending on the test used after SCTransform: workaround and correction suggestion #7095

lepriolc · 2023-03-31T15:10:23Z

Hi Seurat dev team,

Thanks for the amazing work you have been doing to enable the community to perform single-cell data analysis.

I have noticed discrepancies in FindMarkers outputs regarding the average log2FC values in the different tests I did. Using the same input data and dealing with the same genes, I noticed that the average log2FC are different whether the test used in FindMarkers is "wilcox" or "negbinom". This problem has already been raised in different issues (#6701, #6654, #6976, #4892). Solutions have been proposed as in @caodudu's comment (#6701 (comment)). As suggested by @nathanhaigh (#6701 (comment)), I propose here a reproducible example to expose these discrepancies by reproducing SCTransform v2 vigenette: https://satijalab.org/seurat/articles/sctransform_v2_vignette.html

library(Seurat)
library(SeuratData)
library(patchwork)
library(dplyr)
library(ggplot2)

data(ifnb)
ifnb

# split the dataset into a list of two seurat objects (stim and CTRL)
ifnb.list <- SplitObject(ifnb, split.by="stim")
ctrl <- ifnb.list[["CTRL"]]
stim <- ifnb.list[["STIM"]]

# normalize and run dimensionality reduction on control dataset
ctrl <- SCTransform(ctrl, vst.flavor="v2", verbose=FALSE) %>%
    RunPCA(npcs=30, verbose=FALSE) %>%
    RunUMAP(reduction="pca", dims=1:30, verbose=FALSE) %>%
    FindNeighbors(reduction="pca", dims=1:30, verbose=FALSE) %>%
    FindClusters(resolution=0.7, verbose=FALSE)

p1 <- DimPlot(ctrl, label=TRUE, repel=TRUE) + ggtitle("Unsupervised clustering")
p2 <- DimPlot(ctrl, label=TRUE, repel=TRUE, group.by="seurat_annotations") + ggtitle("Annotated celltypes")
p1 | p2

# perform integration using Pearson residuals
stim <- SCTransform(stim, vst.flavor="v2", verbose=FALSE) %>%
    RunPCA(npcs=30, verbose=FALSE)
ifnb.list <- list(ctrl=ctrl, stim=stim)
features <- SelectIntegrationFeatures(object.list=ifnb.list, nfeatures=3000)
ifnb.list <- PrepSCTIntegration(object.list=ifnb.list, anchor.features=features)
immune.anchors <- FindIntegrationAnchors(object.list=ifnb.list, normalization.method="SCT", anchor.features=features)
immune.combined.sct <- IntegrateData(anchorset=immune.anchors, normalization.method="SCT")

# perform an integrated analysis
immune.combined.sct <- RunPCA(immune.combined.sct, verbose=FALSE)
immune.combined.sct <- RunUMAP(immune.combined.sct, reduction="pca", dims=1:30, verbose=FALSE)
immune.combined.sct <- FindNeighbors(immune.combined.sct, reduction="pca", dims=1:30)
immune.combined.sct <- FindClusters(immune.combined.sct, resolution=0.3)

DimPlot(immune.combined.sct, reduction="umap", split.by="stim")
p1 <- DimPlot(immune.combined.sct, reduction="umap", group.by="stim")
p2 <- DimPlot(immune.combined.sct, reduction="umap", group.by="seurat_clusters", label=TRUE, repel=TRUE)
p3 <- DimPlot(immune.combined.sct, reduction="umap", group.by="seurat_annotations", label=TRUE, repel=TRUE)
p1 | p2 | p3

# identify differentially expressed genes across conditions
immune.combined.sct$celltype.stim <- paste(immune.combined.sct$seurat_annotations, immune.combined.sct$stim, sep="_")
Idents(immune.combined.sct) <- "celltype.stim"
immune.combined.sct <- PrepSCTFindMarkers(immune.combined.sct)
b.interferon.response <- FindMarkers(immune.combined.sct, assay="SCT", ident.1="B_STIM", ident.2="B_CTRL", verbose=FALSE)
head(b.interferon.response, n=5)

A data.frame: 5 x 5 p_val avg_log2FC pct.1 pct.2 p_val_adj

ISG15 1.505693e-159 3.508888 0.998 0.229 2.003475e-155
IFIT3 4.128835e-154 2.568440 0.961 0.052 5.493827e-150
IFI6 2.479476e-153 2.460206 0.965 0.076 3.299190e-149
ISG20 9.385626e-152 2.561808 1.000 0.666 1.248851e-147
IFIT1 2.447118e-139 2.132814 0.904 0.029 3.256136e-135

## other tests
### LR
b.interferon.response.LR <- FindMarkers(immune.combined.sct, assay="SCT", ident.1="B_STIM", ident.2="B_CTRL", test.use="LR", verbose=FALSE)
head(b.interferon.response.LR, n=5)

A data.frame: 5 x 5 p_val avg_log2FC pct.1 pct.2 p_val_adj

ISG15 1.011450e-258 3.508888 0.998 0.229 1.345835e-254
ISG20 1.665382e-240 2.561808 1.000 0.666 2.215957e-236
IFIT3 1.078258e-230 2.568440 0.961 0.052 1.434731e-226
IFI6 2.781566e-228 2.460206 0.965 0.076 3.701151e-224
IFIT1 2.844733e-197 2.132814 0.904 0.029 3.785202e-193

### negbinom
b.interferon.response.NB <- FindMarkers(immune.combined.sct, assay="SCT", ident.1="B_STIM", ident.2="B_CTRL", test.use="negbinom", verbose=FALSE)
head(b.interferon.response.NB, n=5)

A data.frame: 5 x 5 p_val avg_log2FC pct.1 pct.2 p_val_adj

ISG15 0.000000e+00 103.311763 0.998 0.229 0.000000e+00
ISG20 0.000000e+00 41.286540 1.000 0.666 0.000000e+00
B2M 7.201407e-161 34.209536 1.000 1.000 9.582192e-157
MX1 3.615401e-132 9.806964 0.900 0.115 4.810653e-128
IFI6 4.913761e-132 13.616252 0.965 0.076 6.538251e-128

Identical average log2FC are obtained with wilcox and LR tests but different and unrealistic high values are obtained with negbinom test (see ISG15, ISG20 and IFI6 genes).

After exploring the source code of FindMarkers function (https://github.com/satijalab/seurat/blob/HEAD/R/differential_expression.R), if I understood well the code, the following commands are invoked:

FindMarkers.Seurat function is called
data.use is set to the SCT assay of Seurat object
FindMarkers.SCTAssay function is called
data.slot is set to "counts" if test.use is set to one of the following tests: negbinom, poisson, MAST, otherwise it is set to "data"
mean.fxn is set to

function(x) {
     return(log(x = rowMeans(x = expm1(x = x)) + pseudocount.use, base = base))

   }

the FoldChange function is called with the previously set data.slot and mean.fxn and returns the average log2FC values to fc.results
the FindMarkers function is called with the previously set fc.results containing the average log2FC values which appear in FindMarkers outputs

The function set in mean.fxn to compute mean expression values is the correct function to use when the test uses the log-counts in the data slot, i.e. when FindMarkers test.use parameter is set to one of the following tests: wilcox, bimod, roc, t, LR, DESeq2, but is not correct when the test uses the counts in the counts slot, i.e. when FindMarkers test.use parameter is set to one of the following tests: negbinom, poisson, MAST.

Setting the FindMarkers mean.fxn parameter to the correct function enables to obtain average log2FC values with negbinom test identical to those obtained with wilcox or LR tests:

### negbinom with correct mean function
mean.fxn <- function(x, pseudocount.use=1, base=2) {
    return(log(x = rowMeans(x = x) + pseudocount.use, base = base))
  }
b.interferon.response.NB.mean.fxn <- FindMarkers(immune.combined.sct, assay="SCT", ident.1="B_STIM", ident.2="B_CTRL", test.use="negbinom", mean.fxn=mean.fxn, verbose=FALSE)
head(b.interferon.response.NB.mean.fxn, n=5)

A data.frame: 5 x 5 p_val avg_log2FC pct.1 pct.2 p_val_adj

ISG15 0.000000e+00 3.5088882 0.998 0.229 0.000000e+00
ISG20 0.000000e+00 2.5618077 1.000 0.666 0.000000e+00
B2M 7.201407e-161 0.5980104 1.000 1.000 9.582192e-157
MX1 3.615401e-132 1.9311875 0.900 0.115 4.810653e-128
IFI6 4.913761e-132 2.4602058 0.965 0.076 6.538251e-128

Thus, I suggest this correction in R/differential_expression.R:
Replace the following lines in FindMarkers.SCTAssay (lines 755 to 760):

if (is.null(x = mean.fxn)){
    mean.fxn <- function(x) {
      return(log(x = rowMeans(x = expm1(x = x)) + pseudocount.use, base = base))

    }
  }

by:

default.mean.fxn <- function(x) {
	return(log(x = rowMeans(x = expm1(x = x)) + pseudocount.use, base = base))
  }
  mean.fxn <- mean.fxn %||% switch(
    EXPR = slot,
    'counts' = function(x) {
        return(log(x = rowMeans(x = x) + pseudocount.use, base = base))
      },
    'scale.data' = rowMeans,
    default.mean.fxn
  )

The FoldChange.Assay function may also need to be corrected since it may generate incorrect average log2FC with SCTransform log-counts:

default.mean.fxn <- function(x) {
    return(log(x = rowMeans(x = x) + pseudocount.use, base = base))
  }
  mean.fxn <- mean.fxn %||% switch(
    EXPR = slot,
    'data' = switch(
      EXPR = norm.method %||% '',
      'LogNormalize' = function(x) {
        return(log(x = rowMeans(x = expm1(x = x)) + pseudocount.use, base = base))
      },
      default.mean.fxn
    ),
    'scale.data' = rowMeans,
    default.mean.fxn
  )

If slot is set to "data" and norm.method is not "LogNormalize", then default.mean.fxn is used, which is the correct function to use for counts values, but not for log-counts values.

Here is the output of sessionInfo():

sessionInfo()

R version 4.1.0 (2021-05-18)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS

Matrix products: default
BLAS/LAPACK: /opt/miniconda3/envs/r41/lib/libopenblasp-r0.3.17.so

locale:
 [1] LC_CTYPE=C.UTF-8    LC_NUMERIC=C        LC_TIME=C          
 [4] LC_COLLATE=C        LC_MONETARY=C       LC_MESSAGES=C      
 [7] LC_PAPER=C          LC_NAME=C           LC_ADDRESS=C       
[10] LC_TELEPHONE=C      LC_MEASUREMENT=C    LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ggplot2_3.4.0         dplyr_1.0.7           patchwork_1.1.1      
[4] ifnb.SeuratData_3.0.0 SeuratData_0.2.1      SeuratObject_4.1.3   
[7] Seurat_4.3.0         

loaded via a namespace (and not attached):
  [1] uuid_0.1-4                  plyr_1.8.6                 
  [3] igraph_1.3.4                repr_1.1.3                 
  [5] lazyeval_0.2.2              sp_1.6-0                   
  [7] splines_4.1.0               listenv_0.8.0              
  [9] scattermore_0.7             GenomeInfoDb_1.28.4        
 [11] digest_0.6.27               htmltools_0.5.3            
 [13] fansi_0.5.0                 magrittr_2.0.1             
 [15] tensor_1.5                  cluster_2.1.2              
 [17] ROCR_1.0-11                 limma_3.48.3               
 [19] globals_0.14.0              matrixStats_0.61.0         
 [21] spatstat.sparse_3.0-0       colorspace_2.0-2           
 [23] rappdirs_0.3.3              ggrepel_0.9.1              
 [25] crayon_1.4.1                RCurl_1.98-1.5             
 [27] jsonlite_1.7.2              progressr_0.13.0           
 [29] spatstat.data_3.0-0         survival_3.2-13            
 [31] zoo_1.8-9                   glue_1.4.2                 
 [33] polyclip_1.10-0             gtable_0.3.0               
 [35] zlibbioc_1.40.0             XVector_0.32.0             
 [37] leiden_0.3.9                DelayedArray_0.18.0        
 [39] future.apply_1.8.1          BiocGenerics_0.38.0        
 [41] abind_1.4-5                 scales_1.2.1               
 [43] DBI_1.1.1                   spatstat.random_3.0-1      
 [45] miniUI_0.1.1.1              Rcpp_1.0.10                
 [47] viridisLite_0.4.0           xtable_1.8-4               
 [49] reticulate_1.22             stats4_4.1.0               
 [51] htmlwidgets_1.5.4           httr_1.4.2                 
 [53] RColorBrewer_1.1-2          ellipsis_0.3.2             
 [55] ica_1.0-2                   pkgconfig_2.0.3            
 [57] farver_2.1.0                uwot_0.1.14                
 [59] deldir_1.0-6                utf8_1.2.2                 
 [61] tidyselect_1.1.1            labeling_0.4.2             
 [63] rlang_1.0.6                 reshape2_1.4.4             
 [65] later_1.3.0                 munsell_0.5.0              
 [67] tools_4.1.0                 cli_3.4.1                  
 [69] generics_0.1.0              ggridges_0.5.3             
 [71] evaluate_0.14               stringr_1.4.0              
 [73] fastmap_1.1.0               goftest_1.2-3              
 [75] fitdistrplus_1.1-6          purrr_0.3.4                
 [77] RANN_2.6.1                  pbapply_1.5-0              
 [79] future_1.23.0               nlme_3.1-153               
 [81] sparseMatrixStats_1.4.2     mime_0.12                  
 [83] compiler_4.1.0              plotly_4.10.0              
 [85] png_0.1-7                   spatstat.utils_3.0-1       
 [87] tibble_3.1.8                glmGamPoi_1.6.0            
 [89] stringi_1.7.4               lattice_0.20-45            
 [91] IRdisplay_1.0               Matrix_1.5-3               
 [93] vctrs_0.5.1                 pillar_1.8.1               
 [95] lifecycle_1.0.3             spatstat.geom_3.0-5        
 [97] lmtest_0.9-39               RcppAnnoy_0.0.19           
 [99] data.table_1.14.2           cowplot_1.1.1              
[101] bitops_1.0-7                irlba_2.3.5                
[103] httpuv_1.6.4                GenomicRanges_1.44.0       
[105] R6_2.5.1                    promises_1.2.0.1           
[107] KernSmooth_2.23-20          gridExtra_2.3              
[109] IRanges_2.26.0              parallelly_1.29.0          
[111] codetools_0.2-18            MASS_7.3-54                
[113] assertthat_0.2.1            SummarizedExperiment_1.22.0
[115] withr_2.5.0                 sctransform_0.3.5          
[117] S4Vectors_0.30.1            GenomeInfoDbData_1.2.6     
[119] parallel_4.1.0              grid_4.1.0                 
[121] IRkernel_1.2                tidyr_1.1.4                
[123] DelayedMatrixStats_1.14.3   MatrixGenerics_1.4.3       
[125] Cairo_1.6-0                 Rtsne_0.15                 
[127] pbdZMQ_0.3-5                spatstat.explore_3.0-5     
[129] Biobase_2.52.0              shiny_1.7.1                
[131] base64enc_0.1-3

I hope my post can help !

Christophe

The text was updated successfully, but these errors were encountered:

saketkc · 2023-07-06T21:22:41Z

Thanks for your patience @lepriolc and for suggesting changes (and the PR!). I have pushed a fix here: 66e655a that should solve this. Please feel free to reopen if you still notice issues. You should be able to pull latest changes from the develop branch:

remotes::install_github("satijalab/seurat", ref="develop")

roi-meir · 2023-09-04T06:12:48Z

Hi Seurat dev team,

Thank you very much for the fix.
We also encountered this problem and we used the workaround of passing the right mean.fxn to the FindMarkers function. We don't want to start using the develop branch in all of our projects in the lab and passing the right mean.fxn every time is also error-prone.

Is there an expected release of a new version that includes the fix?

lepriolc added the bug Something isn't working label Mar 31, 2023

This was referenced Apr 3, 2023

Fix mean.fxn setting in FindMarkers.SCTAssay function #7115

Closed

FindMarkers with MAST using SCT assay, erroneous result with slots = data ? #7110

Closed

mean.fxn Decision Not Explained #7107

Closed

saketkc closed this as completed Jul 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent avg_log2FC in FindMarkers outputs depending on the test used after SCTransform: workaround and correction suggestion #7095

Inconsistent avg_log2FC in FindMarkers outputs depending on the test used after SCTransform: workaround and correction suggestion #7095

lepriolc commented Mar 31, 2023

saketkc commented Jul 6, 2023 •

edited

Loading

roi-meir commented Sep 4, 2023

Inconsistent avg_log2FC in FindMarkers outputs depending on the test used after SCTransform: workaround and correction suggestion #7095

Inconsistent avg_log2FC in FindMarkers outputs depending on the test used after SCTransform: workaround and correction suggestion #7095

Comments

lepriolc commented Mar 31, 2023

saketkc commented Jul 6, 2023 • edited Loading

roi-meir commented Sep 4, 2023

saketkc commented Jul 6, 2023 •

edited

Loading