-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to perform subclustering and DE analysis on a subset of an integrated object #1883
Comments
@attal-kush I hope its okay to piggyback of your question. I have been following the SCTransform integration tutorial and it doesn't mention how to FindClusters or identify cluster specific markers. I simply used the FindNeighbors and FindClusters command in order to create the 'seurat_clusters' list in the meta.data. Could you please let me know if the steps below are the correct way to go about identifying clusters and markers? Steps
Are these the correct steps to follow? I just want to make sure the Seurat Team agrees with my workflow for identifying the cell clusters and conserved markers for the integrated and sctransform analysis. Thank you! |
Hello Seurat Team, Thank you for the wonderful package. I have similar questions as @attal-kush with regards to reclustering of a subset of an integrated object. The ideal workflow is not clear to me and perusing the vignettes and past issues did not clarify it fully. Is this workflow indeed the best? Or should we go directly onto integrated dataset and RunPCA?
|
@MediciPrime That looks correct to me, though your I would also like to know the recommended way of doing this. |
Thank you for your reply @j-andrews7! |
Hi all, I'm also interested to this topic: what is the best way to subset and reclustering data starting from an integrating dataset? After subsetting clusters of interest (subsetting by ident) I have a Seurat object with RNA, SCT and integrated assay, and dimensional reduction (pca, tsne, umap) coming from the original Seurat object. The integrated assay consists of 3000 features comings from the original integration analysis (so choosed from the whole dataset, and not only from cells of the subset). Thanks!! |
Hi all, My intuition would have been that when I focus on 1 cluster, I can start the whole process (including integration) as if it was a fresh dataset, especially given the fact that the integration has focused on genes that were highly variable in the whole dataset but may not be variable at all in the population I'm focusing on... But reading a few posts and issues here, it's not the way to go and I would like to understand why and to know how to do it properly. Thank you @satijalab for this amazing tool and the amazing tutorials !!!! |
a) My approach would be to just run FindClusters() with a higher resolution on the whole dataset until the desired subclustering is reached. Then find the DEGs between 2 clusters with FindMarkers(ident.1=, ident.2=). b) Running FindVariableGenes() and RunPCA() again on the integrated dataset does not seem helpful to me because the limited feature space of 3000 is not changed. The alternative would be to subset() the population of interest and run the complete preprocessing including integration only on those cells again. That enables to change the feature space. But I'm also curious how others approach this! EDIT: See this paper also: |
Hi all, I'm also interested in this issue, and wonder what is the best way to subset and reclustering data starting from an integrating dataset? |
Hi All, I am also stuck on this issue too. I followed a similar approach to @amayer21 with regards to treating the data set as new after removing clusters/cells. Which of course included re-calculating the variable genes (on the "RNA" Slot) and re-integration. For the same reasons, I felt this was the most intuitive way. From my understanding, including all genes into the "Feature.to.integrate" functions will give you a gene matrix of all genes altered based on the integration, but the PCA analysis and subsequent non-linear dimensionality reduction and clustering will still be calculated based on the 2000 features found in the "Find.Integration.anchors" functions (unless otherwise stated), which change depending on the original data used, ie subsetted or whole. Similar to @amayer21 I am wondering what the best way to approach this is, and why treating a subsetted data set as new is not the correct way to run an integrated analysis pipeline? Thank you @satijalab for this amazing analysis package. |
Hi All, |
I know that we shouldn't rescale subsetted data from an integrated object but is it possible to RunUMAP on the subsetted data so I can at least get a plot? I have increased the resolution on FindClusters to analyze the integrated object and get my cluster of interested subclustered enough for DEG analysis but would simply like a new UMAP plot to visualize expression within that group of clusters. |
Hello, It seems that a repeated possibility would be to change the
In any case, could this workflow (slightly modified from the one from @attal-kush) be accepted to subcluster from an integrated object?
|
@attal-kush Your questions are so comprehensive and I am also curious if there is a practical way to analyse the subsetted cells. Btw, regarding DE analysis in your question 1, according to #1836 (comment), it says that both RNA and SCT assay could be used for DE analysis if my understanding is correct. |
it makes no sense to me the not to use the integrated assay on every downstream analysis. isn't the whole point of integration to remove batch effects? I tried this two ways
and reading this issue I only got more confused. |
Hi @attal-kush ,
|
which do you use for DE and heatmap? |
did the Seurat team ever respond to this..? |
I think the proper way is to subset before integration as in Smillie et al. 2019 as referred to by @tilofreiwald. Note that @timoast from the Seurat team recommended otherwise, although I never seen an explanation why would this not best way to go. While I did not test the above, I tested running Without
|
Hi @vertesy , |
Glad to find out so many of you thinking about the same problem here, sad to realize there is indeed no pratical guide about how to do this properly yet. |
I would like some help with this thread as well. Thank you @satijalab !!!! |
This issue may help you address your question. |
It would be nice if Satija lab could give more clear instruction on how to proceed in case of high versus low heterogeneity after subsettting. @satijalab, could you please help us? Thank you! |
Hi all, I wonder if anyone has found a definitive answer for this? I also have similar question. I did SCTransform() workflow, then subset a cluster of interest. Thank you so much! |
@vertesy just came here to chime in after seeing your comment mate, so I tried what you are suggesting, and I see no marked difference, in fact, I don't have the data to show rn because I've a lot on my plate currently, but subset>integrate>re-cluster is more laborious and less useful than integrate>subset>re-cluster. (by re-cluster I mean the entire subsetted dataset is treated as an independent body of cells and re-analyzed similar to what you allude to. I think of this as if you FACS sorted cells and were to analyze them independently of the other cell types). Everyone: I strongly suggest using the RNA assay for all DE. This is because the RNA slot is a true representative of biological variation, when someone tries to reproduce your findings they won't perform a negative binomial regression on their PCR. Now I understand that batch variation is a pain in the a** but honestly one has to assume this will occur naturally in a PCR as well. Honestly now I'm very stringent on what my definition of a DE is because minor gene fluctuations in scRNAseq data are very unreliable and reside within the realm of false-positive dropouts. Nevertheless, I have seen that normalized RNA (log norm'd) is very reproducible in a PCR/bulk RNAseq/rnaFISH exp (if your DE gene FC is >1.5x and expressed in atleast 50% of cells). Unless a gene is not expressed (n-reads) at 1/p* try to forget about it just like a bad day (p* being the relative mean gene expression taking into account cDNA library construction efficiency, which in the case of 10x is 15%, or 1/p* = 1/0.15 ≅ 7 reads/cell/gene). You can read more on the concept here in Martin's paper. As far as heterogeneity goes, if you keep sub-sampling till you reach 2 cells you will find differences between even them. So there is really no simple answer because heterogeneous populations themselves should be reproducibly present in multiple individuals, in order to identify that cell type as an organism-specific cell not a donor-specific transcriptional fluctuation in a particular cell population. Cheers, all look forward to learning more on this when the devs respond. |
I am also wondering if there is an official recommendation for this task. Replies here and in some other GitHub issues have slightly different approaches but they all make general sense. Out of all possible solutions, I feel like performing the analysis as @tilofreiwald's "option b" would be the best. That way, one would avoid the pitfall described in @Zha0rong's first scenario because the sub-clustering would have been driven by the variable features recalculated in the data subset. We would all appreciate it if @timoast or others from the @satijalab can chime in. Many, many thanks for the great package and continued support! |
As the proof is in the pudding, I decided to try different approaches on my own data and share my findings here. I hope it is useful. The sample code is also provided at the end. My scenario is very similar to what @attal-kush described. In short:
I found that the first and second approaches lead to a nice integration while the third and fourth lead to an uncorrected batch effect (see the image below). That would be great if someone can confirm or deny :) Subsetting
Approach 1
Approach 2
Approach 3
Approach 4
|
Hi Team Seurat,
Similar to issue #1547,
I integrated samples across multiple batch conditions and diets after performing SCTransform (according to your most recent vignette for integration with SCTransform - Compiled: 2019-07-16). Now, I have a Seurat object with 3 assays: RNA, SCT, and Integrated.
So I have a couple of questions regarding my workflow:
For downstream DE analysis, the scale.data slot in the SCT assay has disappeared after integration. Therefore, I assume I cannot use Pearson residuals for DE analysis. Is it possible and valid instead to use values from the "data" slot of the SCT assay (log-normalized corrected values) for the MAST test?
I want to subset a specific cell type (cluster) and examine subtypes in this cell type. So, my here is my workflow:
SCT_integrated <- IntegrateData(anchorset = SCT_Integrated.anchors, normalization.method = "SCT", features.to.integrate = rownames(SCT_Integrated))
SCT_integrated <- RunPCA(SCT_integrated)
SCT_integrated <- FindNeighbors(SCT_integrated, dims = 1:15)
SCT_integrated <- RunUMAP(SCT_integrated, dims = 1:15)
SCT_integrated <- FindClusters(SCT_integrated)
control_subset <- subset(SCT_integrated, orig.ident = 'Chow')
DefaultAssay(control_subset) <- "RNA"
control_subset <- FindVariableFeatures(control_subset, selection.method = "vst", nfeatures = 3000)
DefaultAssay(control_subset) <- "integrated"
control_subset <- ScaleData(control_subset, verbose = FALSE, vars.to.regress = c( "percent.mt"))
control_subset <- RunPCA(control_subset, npcs = 30, verbose = FALSE)
control_subset <- FindNeighbors(control_subset, dims = 1:15)
control_subset <- RunUMAP(control_subset, dims = 1:15)
control_subset <- FindClusters(control_subset)
I want to know:
a. Is it valid to set features.to.integrate to all the genes in the original Seurat object if I want run subclustering on the subset using its integrated assay?
b. Is it necessary to run FindVariableFeatures on the RNA assay of the subset and get new variables to use in PCA in order to properly cluster the subset? I am worried that the top variable features of the original Seurat Object are not the same variable features of the new subset. Also, instead of changing the default assay to "RNA", finding the variable features, and changing the default assay back to "integrated", would it be make more sense to just delete those lines of code and just change:
control_subset <- RunPCA(control_subset, npcs = 30, verbose = FALSE) to
control_subset <- RunPCA(control_subset, npcs = 30, verbose = FALSE, features = Variable Features(control_subset))
c. Should FindVariableFeatures be run on the RNA assay, the integrated assay, or the SCT assay? (I assume if I just need to delete the 3 lines of code I just mentioned above and change
control_subset <- RunPCA(control_subset, npcs = 30, verbose = FALSE) to
control_subset <- RunPCA(control_subset, npcs = 30, verbose = FALSE, features = Variable Features(control_subset)),
then the answer is to run it on the integrated assay).
d. Should ScaleData be run on the subset prior to PCA even though the subset comes from an integrated object prepped from SCT? (I ask because in the new integration vignette, it explicitly mentions not to run ScaleData after running the IntegrateData function)?
I have a conceptual question about the batch-correction (integration) model developed by Seurat (the one from the most recent vignette for integration with SCTransform - Compiled: 2019-07-16). Does this batch-correction overfit the data so much so such that legitimate biological differences in gene expression profiles of cells from different diets (HFD, LFD, Chow) are gone? Since the data I am analyzing comes from different diets as well as different batches, will batch-correction make me unable to determine differences in gene expression of cells from different diets? If so, would only performing batch correction on batches of the same diet and merging all the diets together without batch correction be a valid method of retaining gene expression differences between diet but not batches?
If I decide that batch correction is not required for my samples, could I subset cells from my original Seurat Object (after running Quality Control and clustering on it), set the assay to "RNA", and and run the standard SCTransform pipeline. In other words, is this workflow valid:
SCT_not_integrated <- FindClusters(SCT_not_integrated)
control_subset <- subset(SCT_not_integrated, orig.ident = 'Chow')
DefaultAssay(control_subset) <- "RNA"
control_subset <- SCTransform(control_subset, vars.to.regress = "percent.mt") %>% RunPCA() %>% FindNeighbors(dims = 1:15) %>% RunUMAP(dims = 1:15) %>% FindClusters()
In addition, since I am not integrating the subset, is it recommended to use the "scale.data" slot in the SCT assay for DE analysis or continue using the "data" slot in the SCT assay for this subset?
Thanks for all your help
The text was updated successfully, but these errors were encountered: