Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SCTransform not regressing out variables - Seurat v5.0.1 #8148

Closed
ChristopherStephens21 opened this issue Dec 6, 2023 · 7 comments
Closed
Assignees
Labels
bug Something isn't working

Comments

@ChristopherStephens21
Copy link

ChristopherStephens21 commented Dec 6, 2023

I am performing routine scRNAseq analysis on a single cell dataset using the scTransform framework. and have run into a weird issue. Briefly, running the code below exactly reproduces the results of the SCTransform vignette, with clear differences in the UMAP plot reflecting cell cycle or mitochondrial percentage regression (Here I regressed out mitochondrial gene percentage). However, when I run this script with my own data by substituting out the path name in line 1 with that for my own data, regressing out percent.mt or cell cycle score has no effect on the downstream UMAP or PCA plots. I'm not sure what to make of this, but my dataset is considerably larger, with around 9000 cells. I have also been able to reproduce these results with other datasets.

#Example 1: With Regression
pbmc_data <- Seurat::Read10X(data.dir = FilePath)
pbmc <- CreateSeuratObject(counts = pbmc_data)
s.genes <- cc.genes$s.genes
g2m.genes <- cc.genes$g2m.genes
pbmc <- NormalizeData(pbmc)
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst")
s.genes <- cc.genes.updated.2019$s.genes #Cell cycle markers loaded from Seurat
s.genes <- sapply(s.genes, str_to_title)
g2m.genes <- cc.genes.updated.2019$g2m.genes #Separating into S and G2M markers
g2m.genes <- sapply(g2m.genes, str_to_title)
pbmc <- CellCycleScoring(pbmc, s.features = s.genes, g2m.features = g2m.genes, set.ident = TRUE)
pbmc <- PercentageFeatureSet(pbmc, pattern = "^mt-", col.name = "percent.mt")
pbmc <- SCTransform(pbmc, vars.to.regress = c("percent.mt", "S.Score", "G2M.Score"), vst.flavor = "v2", method = "glmGamPoi", verbose = T)
pbmc <- RunPCA(pbmc, verbose = T)
pbmc <- RunUMAP(pbmc, dims = 1:30, verbose = T)
pbmc <- FindNeighbors(pbmc, dims = 1:30, verbose = T)
pbmc <- FindClusters(pbmc, verbose = T)
#Example 2: Without Regression
pbmc2 <- CreateSeuratObject(counts = pbmc_data)
pbmc2 <- NormalizeData(pbmc2)
pbmc2 <- FindVariableFeatures(pbmc2, selection.method = "vst")
pbmc2 <- CellCycleScoring(pbmc2, s.features = s.genes, g2m.features = g2m.genes, set.ident = TRUE)
pbmc2 <- PercentageFeatureSet(pbmc2, pattern = "^mt-", col.name = "percent.mt")
pbmc2 <- SCTransform(pbmc2, vst.flavor = "v2", method = "glmGamPoi", verbose = T)
pbmc2 <- RunPCA(pbmc2, verbose = T)
pbmc2 <- RunUMAP(pbmc2, dims = 1:30, verbose = T)
pbmc2 <- FindNeighbors(pbmc2, dims = 1:30, verbose = T)
pbmc2 <- FindClusters(pbmc2, verbose = T)
Plot1 <- DimPlot(pbmc, label = TRUE) + ggtitle("With MT and CC Regression")
Plot2 <- DimPlot(pbmc2, label = TRUE) + ggtitle("Without MT and CC Regression")
Plot1 + Plot2
identical(pbmc@meta.data, pbmc2@meta.data)

pbmc_data <- Seurat::Read10X(data.dir = FilePath)
pbmc <- CreateSeuratObject(counts = pbmc_data)
s.genes <- cc.genes$s.genes
g2m.genes <- cc.genes$g2m.genes
pbmc <- NormalizeData(pbmc)
Normalizing layer: counts
Performing log-normalization
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst")
Finding variable features for layer counts
Calculating gene variances
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Calculating feature variances of standardized and clipped values
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
s.genes <- cc.genes.updated.2019$s.genes #Cell cycle markers loaded from Seurat
s.genes <- sapply(s.genes, str_to_title)
g2m.genes <- cc.genes.updated.2019$g2m.genes #Separating into S and G2M markers
g2m.genes <- sapply(g2m.genes, str_to_title)
pbmc <- CellCycleScoring(pbmc, s.features = s.genes, g2m.features = g2m.genes, set.ident = TRUE)
pbmc <- PercentageFeatureSet(pbmc, pattern = "^mt-", col.name = "percent.mt")
pbmc <- SCTransform(pbmc, vars.to.regress = c("percent.mt", "S.Score", "G2M.Score"), vst.flavor = "v2", method = "glmGamPoi", verbose = T)
Running SCTransform on assay: RNA
Running SCTransform on layer: counts
vst.flavor='v2' set. Using model with fixed slope and excluding poisson genes.
Variance stabilizing transformation of count matrix of size 22452 by 8924
Model formula is y ~ log_umi
Get Negative Binomial regression parameters per gene
Using 2000 genes, 5000 cells
Found 93 outliers - those will be ignored in fitting/regularization step

Second step: Get residuals using fitted parameters for 22452 genes
Computing corrected count matrix for 22452 genes
Calculating gene attributes
Wall clock passed: Time difference of 23.15094 secs
Determine variable features
Regressing out percent.mt, S.Score, G2M.Score
|=================================================================================================================================================================================================================================================================================| 100%
Centering data matrix
|=================================================================================================================================================================================================================================================================================| 100%
Getting residuals for block 1(of 2) for counts dataset
Getting residuals for block 2(of 2) for counts dataset
Centering data matrix
|=================================================================================================================================================================================================================================================================================| 100%
Finished calculating residuals for counts
Set default assay to SCT

pbmc <- RunPCA(pbmc, verbose = T)
PC_ 1
Positive: Col1a1, Col1a2, Col3a1, Tnfaip6, Dlk1, Cxcl1, Cxcl2, Meg3, Ccl2, Sparc
Itm2a, Tnc, Cdkn1c, Mfap4, Lum, Gm32647, Fstl1, Aspn, Dcn, Mest
Col6a1, Col5a2, Dlc1, Thbs1, Ebf1, Ccl7, Nrk, Igf2, Pcolce, Ccn1
Negative: Krt14, Krt15, Krt5, Krt1, Ly6d, Krt10, Lgals7, Cstdc5, Stfa3, Krt17
Krtdap, Perp, Dmkn, Stfa1, Sfn, Dsp, Srgn, Sox6, Grip1, Cpa3
Hdc, Cma1, Lgals3, Tpsb2, mt-Cytb, Calm4, mt-Atp6, Fcer1g, Bmpr1b, S100a14
PC_ 2
Positive: Krt14, Krt15, Krt1, Krt5, Ly6d, Krt10, Lgals7, Stfa3, Cstdc5, Krt17
Krtdap, Perp, Dmkn, Sfn, Dsp, Grip1, Fabp5, Sox6, Stfa1, Calm4
Cxcl14, S100a14, Bmpr1b, Fras1, Fgfbp1, Ptma, Tspear, Trp63, Dsc3, Serpinb5
Negative: Srgn, Fcer1g, Hdc, Cpa3, Cma1, Tpsb2, Serpinb1a, Tyrobp, Plek, Slc6a4
Csf2rb, Cxcl2, Gata2, Fxyd5, Mcpt4, C1qb, Calca, Slco2b1, Mrc1, Cd14
Cyp11a1, Pf4, C1qc, Slc18a2, C1qa, F13a1, Ccl7, Otulinl, Ccl4, Rac2
PC_ 3
Positive: C1qb, Pf4, Mrc1, C1qc, C1qa, F13a1, Cxcl2, Cd14, Ccl12, Cd86
Ccl4, C5ar1, Il1b, Csf1r, Apoe, Ccl7, Ccl3, Slfn2, Lyz2, Fcrls
Lyve1, Selenop, Aoah, Ptprj, Ifi207, Nlrp3, Cd83, Hmox1, Ms4a6c, Ms4a7
Negative: Cpa3, Cma1, Hdc, Tpsb2, Serpinb1a, Slc6a4, Gata2, Srgn, Csf2rb, Calca
Mcpt4, Cyp11a1, Slc18a2, Hs6st2, Rab27b, Itk, Slco2b1, Hs3st1, Fdx1, Il13
Otulinl, St6galnac3, Il4, Il1rl1, Fxyd5, Ctla2a, Mrgprb1, Csf2, Edil3, Adora3
PC_ 4
Positive: Krt1, Krt10, Ly6d, Stfa3, Cstdc5, Krtdap, Col1a1, Col1a2, Stfa1, Lgals7
Tnfaip6, Krt15, Dlk1, Perp, Fabp5, Dmkn, Lgals3, Krt14, Col3a1, Cxcl2
Krt5, Npl, Sfn, Itm2a, Cxcl1, Mfap4, Dsp, Dsg1a, Calm4, Krt16
Negative: Dct, Pmel, Ptgds, Tyr, Cpne4, Trps1, Pax3, Chsy3, Fmn1, Mlana
Pde3a, Ednrb, Col23a1, Rarb, Robo2, Acta2, Ebf1, Aff3, Rgs5, Fstl4
Synpr, Ptprj, Prex2, Sobp, Ctnna3, Myh11, Myl9, Syt4, Tagln, Mcoln3
PC_ 5
Positive: Dlk1, Dct, Pmel, Ptgds, Tyr, Itm2a, Mest, Cpne4, Chsy3, Mfap4
Pax3, Nrk, Cdkn1c, Clec3b, Aspn, Postn, Mfap5, Fmn1, Acta2, Tmeff2
Mlana, Igfbp7, Dpysl3, Col14a1, Ednrb, Krt1, Rgs5, Igf2, Krt10, Myh11
Negative: Gm32647, Robo2, Trps1, Col3a1, Col1a1, Egfl6, Cntn4, Cxcl2, Col1a2, Lum
Wif1, Slit3, Twist2, Ccl2, Grem2, Tmem132c, Tnc, Col23a1, Zfp536, Lsamp
Twist1, Cped1, Rspo3, Lgals1, St6galnac3, Cntn5, Itih5, Pth1r, Hecw2, Crabp1
pbmc <- RunUMAP(pbmc, dims = 1:30, verbose = T)
16:41:19 UMAP embedding parameters a = 0.9922 b = 1.112
Found more than one class "dist" in cache; using the first, from namespace 'spam'
Also defined by ‘BiocGenerics’
16:41:19 Read 8924 rows and found 30 numeric columns
16:41:19 Using Annoy for neighbor search, n_neighbors = 30
Found more than one class "dist" in cache; using the first, from namespace 'spam'
Also defined by ‘BiocGenerics’
16:41:19 Building Annoy index with metric = cosine, n_trees = 50
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
16:41:19 Writing NN index file to temp file /var/folders/r7/psn_5h9s3ql986wjn3b1sxj90yv6n8/T//RtmpQyNO0H/file147d22f35981e
16:41:19 Searching Annoy index using 1 thread, search_k = 3000
16:41:21 Annoy recall = 100%
16:41:21 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
16:41:22 Initializing from normalized Laplacian + noise (using RSpectra)
16:41:22 Commencing optimization for 500 epochs, with 351934 positive edges
Using method 'umap'
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
16:41:30 Optimization finished
pbmc <- FindNeighbors(pbmc, dims = 1:30, verbose = T)
Computing nearest neighbor graph
Computing SNN
pbmc <- FindClusters(pbmc, verbose = T)
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 8924
Number of edges: 274391

Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.9218
Number of communities: 25
Elapsed time: 0 seconds

pbmc2 <- CreateSeuratObject(counts = pbmc_data)
pbmc2 <- NormalizeData(pbmc2)
Normalizing layer: counts
Performing log-normalization
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
pbmc2 <- FindVariableFeatures(pbmc2, selection.method = "vst")
Finding variable features for layer counts
Calculating gene variances
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Calculating feature variances of standardized and clipped values
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
pbmc2 <- CellCycleScoring(pbmc2, s.features = s.genes, g2m.features = g2m.genes, set.ident = TRUE)
pbmc2 <- PercentageFeatureSet(pbmc2, pattern = "^MT-", col.name = "percent.mt")
pbmc2 <- SCTransform(pbmc2, vst.flavor = "v2", method = "glmGamPoi", verbose = T)
Running SCTransform on assay: RNA
Running SCTransform on layer: counts
vst.flavor='v2' set. Using model with fixed slope and excluding poisson genes.
Variance stabilizing transformation of count matrix of size 22452 by 8924
Model formula is y ~ log_umi
Get Negative Binomial regression parameters per gene
Using 2000 genes, 5000 cells
Found 93 outliers - those will be ignored in fitting/regularization step

Second step: Get residuals using fitted parameters for 22452 genes
Computing corrected count matrix for 22452 genes
Calculating gene attributes
Wall clock passed: Time difference of 22.24379 secs
Determine variable features
Centering data matrix
|=================================================================================================================================================================================================================================================================================| 100%
Getting residuals for block 1(of 2) for counts dataset
Getting residuals for block 2(of 2) for counts dataset
Centering data matrix
|=================================================================================================================================================================================================================================================================================| 100%
Finished calculating residuals for counts
Set default assay to SCT

pbmc2 <- RunPCA(pbmc2, verbose = T)
PC_ 1
Positive: Col1a1, Col1a2, Col3a1, Tnfaip6, Dlk1, Cxcl1, Cxcl2, Meg3, Ccl2, Sparc
Itm2a, Tnc, Cdkn1c, Mfap4, Lum, Gm32647, Fstl1, Aspn, Dcn, Mest
Col6a1, Col5a2, Dlc1, Thbs1, Ebf1, Ccl7, Nrk, Igf2, Pcolce, Ccn1
Negative: Krt14, Krt15, Krt5, Krt1, Ly6d, Krt10, Lgals7, Cstdc5, Stfa3, Krt17
Krtdap, Perp, Dmkn, Stfa1, Sfn, Dsp, Srgn, Sox6, Grip1, Cpa3
Hdc, Cma1, Lgals3, Tpsb2, mt-Cytb, Calm4, mt-Atp6, Fcer1g, Bmpr1b, S100a14
PC_ 2
Positive: Krt14, Krt15, Krt1, Krt5, Ly6d, Krt10, Lgals7, Stfa3, Cstdc5, Krt17
Krtdap, Perp, Dmkn, Sfn, Dsp, Grip1, Fabp5, Sox6, Stfa1, Calm4
Cxcl14, S100a14, Bmpr1b, Fras1, Fgfbp1, Ptma, Tspear, Trp63, Dsc3, Serpinb5
Negative: Srgn, Fcer1g, Hdc, Cpa3, Cma1, Tpsb2, Serpinb1a, Tyrobp, Plek, Slc6a4
Csf2rb, Cxcl2, Gata2, Fxyd5, Mcpt4, C1qb, Calca, Slco2b1, Mrc1, Cd14
Cyp11a1, Pf4, C1qc, Slc18a2, C1qa, F13a1, Ccl7, Otulinl, Ccl4, Rac2
PC_ 3
Positive: C1qb, Pf4, Mrc1, C1qc, C1qa, F13a1, Cxcl2, Cd14, Ccl12, Cd86
Ccl4, C5ar1, Il1b, Csf1r, Apoe, Ccl7, Ccl3, Slfn2, Lyz2, Fcrls
Lyve1, Selenop, Aoah, Ptprj, Ifi207, Nlrp3, Cd83, Hmox1, Ms4a6c, Ms4a7
Negative: Cpa3, Cma1, Hdc, Tpsb2, Serpinb1a, Slc6a4, Gata2, Srgn, Csf2rb, Calca
Mcpt4, Cyp11a1, Slc18a2, Hs6st2, Rab27b, Itk, Slco2b1, Hs3st1, Fdx1, Il13
Otulinl, St6galnac3, Il4, Il1rl1, Fxyd5, Ctla2a, Mrgprb1, Csf2, Edil3, Adora3
PC_ 4
Positive: Krt1, Krt10, Ly6d, Stfa3, Cstdc5, Krtdap, Col1a1, Col1a2, Stfa1, Lgals7
Tnfaip6, Krt15, Dlk1, Perp, Fabp5, Dmkn, Lgals3, Krt14, Col3a1, Cxcl2
Krt5, Npl, Sfn, Itm2a, Cxcl1, Mfap4, Dsp, Dsg1a, Calm4, Krt16
Negative: Dct, Pmel, Ptgds, Tyr, Cpne4, Trps1, Pax3, Chsy3, Fmn1, Mlana
Pde3a, Ednrb, Col23a1, Rarb, Robo2, Acta2, Ebf1, Aff3, Rgs5, Fstl4
Synpr, Ptprj, Prex2, Sobp, Ctnna3, Myh11, Myl9, Syt4, Tagln, Mcoln3
PC_ 5
Positive: Dlk1, Dct, Pmel, Ptgds, Tyr, Itm2a, Mest, Cpne4, Chsy3, Mfap4
Pax3, Nrk, Cdkn1c, Clec3b, Aspn, Postn, Mfap5, Fmn1, Acta2, Tmeff2
Mlana, Igfbp7, Dpysl3, Col14a1, Ednrb, Krt1, Rgs5, Igf2, Krt10, Myh11
Negative: Gm32647, Robo2, Trps1, Col3a1, Col1a1, Egfl6, Cntn4, Cxcl2, Col1a2, Lum
Wif1, Slit3, Twist2, Ccl2, Grem2, Tmem132c, Tnc, Col23a1, Zfp536, Lsamp
Twist1, Cped1, Rspo3, Lgals1, St6galnac3, Cntn5, Itih5, Pth1r, Hecw2, Crabp1
pbmc2 <- RunUMAP(pbmc2, dims = 1:30, verbose = T)
16:42:27 UMAP embedding parameters a = 0.9922 b = 1.112
Found more than one class "dist" in cache; using the first, from namespace 'spam'
Also defined by ‘BiocGenerics’
16:42:27 Read 8924 rows and found 30 numeric columns
16:42:27 Using Annoy for neighbor search, n_neighbors = 30
Found more than one class "dist" in cache; using the first, from namespace 'spam'
Also defined by ‘BiocGenerics’
16:42:27 Building Annoy index with metric = cosine, n_trees = 50
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
16:42:27 Writing NN index file to temp file /var/folders/r7/psn_5h9s3ql986wjn3b1sxj90yv6n8/T//RtmpQyNO0H/file147d26795f1c8
16:42:27 Searching Annoy index using 1 thread, search_k = 3000
16:42:29 Annoy recall = 100%
16:42:29 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
16:42:30 Initializing from normalized Laplacian + noise (using RSpectra)
16:42:30 Commencing optimization for 500 epochs, with 351934 positive edges
Using method 'umap'
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
16:42:38 Optimization finished
pbmc2 <- FindNeighbors(pbmc2, dims = 1:30, verbose = T)
Computing nearest neighbor graph
Computing SNN
pbmc2 <- FindClusters(pbmc2, verbose = T)
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 8924
Number of edges: 274391

Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.9218
Number of communities: 25
Elapsed time: 0 seconds

Plot1 <- DimPlot(pbmc, label = TRUE) + ggtitle("With MT and CC Regression")
Plot2 <- DimPlot(pbmc2, label = TRUE) + ggtitle("Without MT and CC Regression")
Plot1 + Plot2
identical(pbmc@meta.data, pbmc2@meta.data)
[1] TRUE

# insert reproducible example here

image

@ChristopherStephens21 ChristopherStephens21 added the bug Something isn't working label Dec 6, 2023
@ChristopherStephens21
Copy link
Author

I am running the analysis on a Mac (MacOS Ventura 13.3.1). Code for cell cycle regression and mitochondrial gene percentage was changed from the pbmc code to reflect differences in gene name formatting for my mouse datasets, but no other changes were made between running it on the pbmc data and my own.

@ChristopherStephens21 ChristopherStephens21 changed the title SCTransform not regressing out variables - Seurat v5.1.0 SCTransform not regressing out variables - Seurat v5.0.1 Dec 6, 2023
@saketkc
Copy link
Collaborator

saketkc commented Dec 8, 2023

Thanks for catching this @ChristopherStephens21, I will push a fix.

@saketkc saketkc self-assigned this Dec 8, 2023
@saketkc
Copy link
Collaborator

saketkc commented Dec 19, 2023

I have pushed a fix here 64a6495
You can install the develop branch to test:

remotes::install_github("sataijalab/seurat", ref="develop")

Hope this helps!

@saketkc saketkc closed this as completed Dec 19, 2023
@Greysun109
Copy link

I was having similar observations to the original post. However, after installing the developer version it appears that it no longer is able to grab the values from the meta.data as I get the following message.

pbmc <- SCTransform(pbmc, vars.to.regress = c("percent.mt"), vst.flavor = "v2", method = "glmGamPoi", verbose = T)
Running SCTransform on assay: RNA
Running SCTransform on layer: counts
vst.flavor='v2' set. Using model with fixed slope and excluding poisson genes.
Variance stabilizing transformation of count matrix of size 19037 by 9723
Model formula is y ~ log_umi
Get Negative Binomial regression parameters per gene
Using 2000 genes, 5000 cells
Found 233 outliers - those will be ignored in fitting/regularization step

Second step: Get residuals using fitted parameters for 19037 genes
Computing corrected count matrix for 19037 genes
Calculating gene attributes
Wall clock passed: Time difference of 49.41705 secs
Determine variable features
Regressing out percent.mt
|============================================================================================================| 100%
Centering data matrix
|============================================================================================================| 100%
Getting residuals for block 1(of 2) for counts dataset
Getting residuals for block 2(of 2) for counts dataset
Error: None of the requested variables to regress are present in the object.

@Bildungsluecke
Copy link

Hi, I'm having the same issue as @Greysun109 Regression of mitochondrial gene content by SCTransform using vars.to.regress="percent.mito" leads to the same UMAP as without regression.

Unfortunately, installing develop branch as suggested by @saketkc did not resolve the issue.

I'm also getting the same error message as @Greysun109 (Error: None of the requested variables to regress are present in the object.) although the variable is included in the metadata slot of the seurat object.

I'm on macOS Ventura 13.6.1

@CroixJeremy2
Copy link

CroixJeremy2 commented Jan 29, 2024

Hello, I am having the same issue as @Bildungsluecke . I have installed your version using

remotes::install_github("sataijalab/seurat", ref="develop")

but I have this error at the end of SCTransform() calculations:

Getting residuals for block 1(of 2) for counts dataset
Getting residuals for block 2(of 2) for counts dataset
Error: None of the requested variables to regress are present in the object.

Before your version, I had the same problem as @ChristopherStephens21 , same UMAP from regressed out samples running SCTransform()

Here is my sessionInfo() if it can be useful:

> sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: x86_64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.7.3

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] C

time zone: Europe/Paris
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] DoubletFinder_2.0.3 dplyr_1.1.4         MetaDE_2.2.3       
 [4] glmGamPoi_1.14.0    presto_1.0.0        data.table_1.14.10 
 [7] Rcpp_1.0.12         scCustomize_2.0.1   ggplot2_3.4.4      
[10] Seurat_5.0.1.9003   SeuratObject_5.0.1  sp_2.1-2           

loaded via a namespace (and not attached):
  [1] RcppAnnoy_0.0.22            splines_4.3.1              
  [3] later_1.3.2                 bitops_1.0-7               
  [5] tibble_3.2.1                polyclip_1.10-6            
  [7] janitor_2.2.0               fastDummies_1.7.3          
  [9] lifecycle_1.0.4             edgeR_4.0.12               
 [11] hdf5r_1.3.9                 globals_0.16.2             
 [13] lattice_0.22-5              MASS_7.3-60.0.1            
 [15] magrittr_2.0.3              openxlsx_4.2.5.2           
 [17] limma_3.58.1                plotly_4.10.4              
 [19] httpuv_1.6.14               sctransform_0.4.1          
 [21] spam_2.10-0                 zip_2.3.1                  
 [23] spatstat.sparse_3.0-3       reticulate_1.34.0          
 [25] cowplot_1.1.3               pbapply_1.7-2              
 [27] RColorBrewer_1.1-3          lubridate_1.9.3            
 [29] abind_1.4-5                 zlibbioc_1.48.0            
 [31] Rtsne_0.17                  GenomicRanges_1.54.1       
 [33] purrr_1.0.2                 BiocGenerics_0.48.1        
 [35] RCurl_1.98-1.14             circlize_0.4.16            
 [37] GenomeInfoDbData_1.2.11     IRanges_2.36.0             
 [39] S4Vectors_0.40.2            ggrepel_0.9.5              
 [41] irlba_2.3.5.1               listenv_0.9.1              
 [43] spatstat.utils_3.0-4        goftest_1.2-3              
 [45] RSpectra_0.16-1             spatstat.random_3.2-2      
 [47] fitdistrplus_1.1-11         parallelly_1.36.0          
 [49] DelayedMatrixStats_1.24.0   leiden_0.4.3.1             
 [51] codetools_0.2-19            DelayedArray_0.28.0        
 [53] tidyselect_1.2.0            shape_1.4.6                
 [55] farver_2.1.1                matrixStats_1.2.0          
 [57] stats4_4.3.1                spatstat.explore_3.2-5     
 [59] jsonlite_1.8.8              ellipsis_0.3.2             
 [61] progressr_0.14.0            ggridges_0.5.6             
 [63] survival_3.5-7              tools_4.3.1                
 [65] ica_1.0-3                   glue_1.7.0                 
 [67] gridExtra_2.3               SparseArray_1.2.3          
 [69] DESeq2_1.42.0               MatrixGenerics_1.14.0      
 [71] GenomeInfoDb_1.38.5         withr_3.0.0                
 [73] combinat_0.0-8              fastmap_1.1.1              
 [75] fansi_1.0.6                 digest_0.6.34              
 [77] timechange_0.3.0            R6_2.5.1                   
 [79] mime_0.12                   ggprism_1.0.4              
 [81] samr_3.0                    colorspace_2.1-0           
 [83] scattermore_1.2             tensor_1.5                 
 [85] spatstat.data_3.0-4         utf8_1.2.4                 
 [87] tidyr_1.3.1                 generics_0.1.3             
 [89] httr_1.4.7                  htmlwidgets_1.6.4          
 [91] S4Arrays_1.2.0              GSA_1.03.2                 
 [93] uwot_0.1.16                 pkgconfig_2.0.3            
 [95] gtable_0.3.4                lmtest_0.9-40              
 [97] impute_1.76.0               XVector_0.42.0             
 [99] htmltools_0.5.7             dotCall64_1.1-1            
[101] scales_1.3.0                Biobase_2.62.0             
[103] png_0.1-8                   snakecase_0.11.1           
[105] rstudioapi_0.15.0           reshape2_1.4.4             
[107] nlme_3.1-164                zoo_1.8-12                 
[109] GlobalOptions_0.1.2         stringr_1.5.1              
[111] KernSmooth_2.23-22          parallel_4.3.1             
[113] miniUI_0.1.1.1              vipor_0.4.7                
[115] ggrastr_1.0.2               pillar_1.9.0               
[117] grid_4.3.1                  vctrs_0.6.5                
[119] RANN_2.6.1                  promises_1.2.1             
[121] shinyFiles_0.9.3            xtable_1.8-4               
[123] cluster_2.1.6               beeswarm_0.4.0             
[125] paletteer_1.6.0             locfit_1.5-9.8             
[127] cli_3.6.2                   compiler_4.3.1             
[129] rlang_1.1.3                 crayon_1.5.2               
[131] future.apply_1.11.1         labeling_0.4.3             
[133] rematch2_2.1.2              plyr_1.8.9                 
[135] forcats_1.0.0               fs_1.6.3                   
[137] ggbeeswarm_0.7.2            stringi_1.8.3              
[139] BiocParallel_1.36.0         viridisLite_0.4.2          
[141] deldir_2.0-2                munsell_0.5.0              
[143] lazyeval_0.2.2              spatstat.geom_3.2-8        
[145] Matrix_1.6-5                RcppHNSW_0.5.0             
[147] patchwork_1.2.0             sparseMatrixStats_1.14.0   
[149] bit64_4.0.5                 future_1.33.1              
[151] statmod_1.5.0               shiny_1.8.0                
[153] SummarizedExperiment_1.32.0 ROCR_1.0-11                
[155] igraph_1.6.0                bit_4.0.5             

EDIT (I have found clues about how to solve the problem!):

It is because in the function SCTransform(), the default option ncell = 5000 split my dataset in two at some point because with verbose = TRUE, I can see:

Getting residuals for block 1(of 2) for counts dataset
Getting residuals for block 2(of 2) for counts dataset

Indeed, my dataset is composed of 5640 cells, thus using the default option ncell = 5000 might produce some weird calculations maybe? Therefore I have tried to use ncell = 5640, in addition to vars.to.regress = "mtRNA.percent", and it works fine (I have a different UMAP indicating that regressing out mitochondrial RNA worked). Plus, I don't have the two following lines:

Getting residuals for block 1(of 2) for counts dataset
Getting residuals for block 2(of 2) for counts dataset

indicating that the calculations are at some point splitted in two I guess.

Remark 1: I have updated back my Seurat package to use the current version of the function (not the one you sent us with remotes::install_github("sataijalab/seurat", ref="develop") ), and the problem can be solved using ncell = your_number_of_cells_in_your_dataset

Remark 2: It seems that without any regression, ncell can be set to less than 5000 (I tried 5640/2), and the output is OK. Therefore, there is a problem with ncell only when using vars.to.regress. I hope it will help troubleshooting the function.

In any case, @saketkc , I think you should update the SCTransform function about this ncell option; either by setting it by default to the number of cells in the sample, or by updating the calculations that I know nothing about. Plus, you should talk about ncell in your tutorial vignette, because I didn't see it anywhere. Otherwise, many people will have the same issue...

Thanks again for your amazing work,
Best regards,

@kew24
Copy link

kew24 commented Feb 14, 2024

To add to this – I was seeing the same original issue of vars.to.regress not changing anything with Seurat_5.0.0. Can confirm that (for whatever reason) adding ncells = ncol(seurat_object) seemed to do the trick, as now my UMAP coordinates & clusters are different when I use different variables for vars.to.regress (thanks, @CroixJeremy2!).

Never installed from develop so I can't speak into the Error: None of the requested variables to regress are present in the object error. I've been using Seurat_5.0.0 this whole time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants