Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random picking of cells from an object #243

Closed
kathirij opened this issue Dec 12, 2017 · 9 comments
Closed

Random picking of cells from an object #243

kathirij opened this issue Dec 12, 2017 · 9 comments

Comments

@kathirij
Copy link

Hello All,
I am trying to compare two clusters from two different populations from two different runs... So, I would like to merge the clusters together (using MergeSeurat option) and then recluster them to find overlap/distinctions between the clusters.

However, one of the clusters has ~10-fold more number of cells than the other one. So, I am afraid that when I calculate varianble genes, the cluster with higher number of cells is going to be overrepresented. Is there a way to maybe pick a set number of cells (but randomly) from the larger cluster so that I am comparing a similar number of cells? In other words - is there a way to randomly subscluster my cells in an unsupervised manner?

@leonfodoulian
Copy link
Contributor

leonfodoulian commented Dec 12, 2017

Hi,

I guess you can randomly sample your cells from that cluster using sample() (from the base in R). You can then create a vector of cells including the sampled cells and the remaining cells, then subset your Seurat object using SubsetData() and compute the variable genes on this new Seurat object. These genes can then be used for dimensional reduction on the original data including all cells.

# Object obj1 is the Seurat object having the highest number of cells
# Object obj2 is the second Seurat object with lower number of cells

# Compute the length of cells from obj2
cells.to.sample <- length(obj2@cell.names)

# Sample from obj1 as many cells as there are cells in obj2
# For reproducibility, set a random seed
set.seed(111)
sampled.cells <- sample(x = obj1@cell.names, size = cells.to.sample, replace = F)

# Create a vector of cells on which variable genes will be computed
cells.to.subset <- c(sampled.cells, obj2@cell.names)

# Merge your Seurat objects
obj.merged <- MergeSeurat(object1 = obj1, object2 = obj2)

# Subset Seurat object
obj.sub <- SubsetData(object = obj.merged, cells.use = cells.to.subset)

# Compute variable genes, etc.

Hope that helps!

Best,
Leon

@del2007
Copy link

del2007 commented Dec 13, 2017

Hi Leon,

Related question: "SubsetData" cannot be directly used to randomly sample 1000 cells (let's say) from a larger object?

SubsetData(object, cells.use = NULL, subset.name = NULL, ident.use = NULL, max.cells.per.ident. = 1000)

If this new subset is not randomly sampled, then on what criteria is it sampled?

Thanks!

@jbalberge
Copy link

Hi
If ident.use = NULL, then Seurat looks at your actual object@ident (see Seurat::WhichCells, l.6). It won't necessarily pick the expected number of cells .

length(unique(object@ident))
[1] 8
n.cells <- 10
subset <- SubsetData(object, max.cells.per.ident = n.cells, random.seed = NULL)
length(subset@cell.names)
[1] 80

This can be misleading. I would rather use the sample function directly.

set.seed(NULL)
n.cells <- 10
subset <- SubsetData(object, cells.use = sample(x = object@cell.names, size = n.cells) )
length(subset@cell.names)
[1] 10

Best,
Jean-Baptiste

@leonfodoulian
Copy link
Contributor

leonfodoulian commented Dec 13, 2017

Hi,

@del2007: What you showed as an example allows you to sample randomly a maximum of 1000 cells from each cluster who's information is stored in object@ident. So if you clustered your cells (e.g. which, lets suppose, gives you 8 clusters), and would like to subset your dataset using the code you wrote, and assuming that all clusters are formed of at least 1000 cells, your final Seurat object will include 8000 cells. However, if you did not compute FindClusters() yet, all your cells would show the information stored in object@meta.data$orig.ident in the object@ident slot. If no clustering was performed, and if the cells have the same orig.ident, only 1000 cells are sampled randomly independent of the clusters to which they will belong after computing FindClusters(). This is pretty much what Jean-Baptiste was pointing out.

So if you want to sample randomly 1000 cells, independent of the clusters to which those cells belong, you can simply provide a vector of cell names to the cells.use argument. Otherwise, if you'd like to have equal number of cells (optimally) per cluster in your final dataset after subsetting, then what you proposed would do the job.

Hope that helps!

Best,
Leon

@kathirij
Copy link
Author

Hello All,
I appreciate the lively discussion and great suggestions - @leonfodoulian I used your method and was able to do exactly what I wanted. Appreciate the detailed code you wrote.

@leonfodoulian
Copy link
Contributor

leonfodoulian commented Dec 14, 2017

Great. Happy to hear that. However, for robustness issues, I would try to resample from obj1 several times using different seed values (which you can store for reproducibility), compute variable genes at each step as described above, and then get either the union or the intersection of those variable genes. The final variable genes vector can be used for dimensional reduction.

@kathirij
Copy link
Author

Yep! I did it three times and got very similar clustering at the end using unique set of variable genes... But using a union of the variable genes might be even more robust. My analysis is helped by the fact that the larger cluster is very homogeneous - so, random sampling of ~1000 cells is still very representative.

In any case, appreciate your help.

@del2007
Copy link

del2007 commented Dec 14, 2017 via email

@leonfodoulian
Copy link
Contributor

Yes it does randomly sample (using the sample() function from base). You can check lines 714 to 716 in interaction.R. However, you have to know that for reproducibility, a random seed is set (in this case random.seed = 1). So if you repeat your subsetting several times with the same max.cells.per.ident, you will always end up having the same cells. You can however change the seed value and end up with a different dataset. Try doing that, and see for yourself if the mean or the median remain the same.

However, to avoid cases where you might have different orig.ident stored in the object@meta.data slot, which happened in my case, I suggest you create a new column where you have the same identity for all your cells, and set the identity of all your cells to that identity.

# Create a new column having the same 'value' stored for all cells
object@meta.data$noIdent <- "noIdent"

# Change the identity of your cells to the common noIdent value
object <- SetAllIdent(object, id = "noIdent")

For your last question, I suggest you read this bioRxiv paper. Of course, your case does not exactly match theirs, since they have ~1.3M cells and, therefore, more chance to maximally enrich in rare cell types, and the tissues you're studying might be very different. But this is something you can test by minimally subsetting your data (i.e. to a point where your R doesn't crash, but that you loose the less cells), and then decreasing in the number of sampled cells and see if the results remain consistent and get recapitulated by lower number of cells.

Best,
Leon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants