Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subsetting integrated data #3465

Closed
fisherj-2212 opened this issue Sep 2, 2020 · 3 comments
Closed

Subsetting integrated data #3465

fisherj-2212 opened this issue Sep 2, 2020 · 3 comments

Comments

@fisherj-2212
Copy link

I have integrated data, computed using the standard workflow (not SCtransform). I wish to subset the data for sub-clustering, using an iterative hierarchical clustering approach. I understand from the discussion I've been able to find that it's not recommended to re-scale the subsetted integrated assay. The alternative options I've seen are to use the RNA assay, or use the scaled data from the original object prior to subsetting.

The issue is that my RNA assay is too batch effected to use, and attempting to use the original scaled matrix seems strange for hierarchical clustering. I compute correlation distance on the scaled data to get my input for hierarchical clustering. Using genes scaled relative to a different set of cells seems like it may impact my correlation computation in an undesirable way.

I've tried proceeding using a scaled subset, which gives clusters that looks sensible in the embedding and have clear DE genes (first dendrogram). Whereas proceeding without rescaling gives a dendrogram that suggests a lack of well defined subclusters, and an overall failure to identify distinctions even though we're confident the subgroup contains notable heterogeneity (second dendrogram). I worry that using the globally scaled data isn't showing enough subgroup-specific contrast. What is the motivation behind discouraging scaling subsets of the integrated assay, and are there situations where it might be acceptable?

image
image

@timoast
Copy link
Collaborator

timoast commented Sep 4, 2020

I understand from the discussion I've been able to find that it's not recommended to re-scale the subsetted integrated assay

What discussion are you referring to? I don't see any reason why you shouldn't rescale after subsetting, and as you point out rescaling would generally be preferred.

@timoast timoast added the more-information-needed We need more information before this can be addressed label Sep 4, 2020
@fisherj-2212
Copy link
Author

Someone states here that it is not supported to rescale a subset of the integrated assay in Seurat v3. I am using v3.
#1547

Someone mentions here not to rescale a subset of the integrated assay (though they are talking about SCtransform method)
#1883

In this case I notice the poster does not rescale their subset before re-clustering
#2340

Here they discourage running FindVariableFeatures() on a subset of integrated assay and recommend switching to the RNA assay, and someone mentions it as " still matter of debate" whether to work with a subset of the integrated assay
#1528

Reading these have left me uneasy about the way I'm handling my sub-clustering approach. I guess I'm just looking for confirmation on whether there's a strong technical reason to discourage running ScaleData() after subsetting the integrated assay, at least for the standard v3 integration method: https://satijalab.org/seurat/v3.1/integration.html.

Perhaps I'm just getting confused between best practice for SCtransform vs standard approach.

@no-response no-response bot removed the more-information-needed We need more information before this can be addressed label Sep 7, 2020
@timoast
Copy link
Collaborator

timoast commented Sep 7, 2020

Someone states here that it is not supported to rescale a subset of the integrated assay in Seurat v3. I am using v3.
#1547

I read that issue but couldn't see where anyone said not to rescale. I made a comment that you shouldn't repeat the integration using a subset of cells which is a separate issue.

Perhaps I'm just getting confused between best practice for SCtransform vs standard approach.

When using SCTransform you can't run ScaleData after integration as the integrated data is stored in the scale.data slot (and so the integration results would be overwritten by re-running ScaleData), and I suspect this is the source of confusion around this issue.

To be clear: you can run ScaleData on a subset of the integrated assay when using log-normalized data but not when using SCTransform-normalized data

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants