Random picking of cells from an object #243

kathirij · 2017-12-12T22:42:50Z

Hello All,
I am trying to compare two clusters from two different populations from two different runs... So, I would like to merge the clusters together (using MergeSeurat option) and then recluster them to find overlap/distinctions between the clusters.

However, one of the clusters has ~10-fold more number of cells than the other one. So, I am afraid that when I calculate varianble genes, the cluster with higher number of cells is going to be overrepresented. Is there a way to maybe pick a set number of cells (but randomly) from the larger cluster so that I am comparing a similar number of cells? In other words - is there a way to randomly subscluster my cells in an unsupervised manner?

leonfodoulian · 2017-12-12T23:08:53Z

Hi,

I guess you can randomly sample your cells from that cluster using sample() (from the base in R). You can then create a vector of cells including the sampled cells and the remaining cells, then subset your Seurat object using SubsetData() and compute the variable genes on this new Seurat object. These genes can then be used for dimensional reduction on the original data including all cells.

# Object obj1 is the Seurat object having the highest number of cells
# Object obj2 is the second Seurat object with lower number of cells

# Compute the length of cells from obj2
cells.to.sample <- length(obj2@cell.names)

# Sample from obj1 as many cells as there are cells in obj2
# For reproducibility, set a random seed
set.seed(111)
sampled.cells <- sample(x = obj1@cell.names, size = cells.to.sample, replace = F)

# Create a vector of cells on which variable genes will be computed
cells.to.subset <- c(sampled.cells, obj2@cell.names)

# Merge your Seurat objects
obj.merged <- MergeSeurat(object1 = obj1, object2 = obj2)

# Subset Seurat object
obj.sub <- SubsetData(object = obj.merged, cells.use = cells.to.subset)

# Compute variable genes, etc.

Hope that helps!

Best,
Leon

del2007 · 2017-12-13T18:40:47Z

Hi Leon,

Related question: "SubsetData" cannot be directly used to randomly sample 1000 cells (let's say) from a larger object?

SubsetData(object, cells.use = NULL, subset.name = NULL, ident.use = NULL, max.cells.per.ident. = 1000)

If this new subset is not randomly sampled, then on what criteria is it sampled?

Thanks!

jbalberge · 2017-12-13T21:30:06Z

Hi
If ident.use = NULL, then Seurat looks at your actual object@ident (see Seurat::WhichCells, l.6). It won't necessarily pick the expected number of cells .

length(unique(object@ident))
[1] 8
n.cells <- 10
subset <- SubsetData(object, max.cells.per.ident = n.cells, random.seed = NULL)
length(subset@cell.names)
[1] 80

This can be misleading. I would rather use the sample function directly.

set.seed(NULL)
n.cells <- 10
subset <- SubsetData(object, cells.use = sample(x = object@cell.names, size = n.cells) )
length(subset@cell.names)
[1] 10

Best,
Jean-Baptiste

leonfodoulian · 2017-12-13T23:36:39Z

Hi,

@del2007: What you showed as an example allows you to sample randomly a maximum of 1000 cells from each cluster who's information is stored in object@ident. So if you clustered your cells (e.g. which, lets suppose, gives you 8 clusters), and would like to subset your dataset using the code you wrote, and assuming that all clusters are formed of at least 1000 cells, your final Seurat object will include 8000 cells. However, if you did not compute FindClusters() yet, all your cells would show the information stored in object@meta.data$orig.ident in the object@ident slot. If no clustering was performed, and if the cells have the same orig.ident, only 1000 cells are sampled randomly independent of the clusters to which they will belong after computing FindClusters(). This is pretty much what Jean-Baptiste was pointing out.

So if you want to sample randomly 1000 cells, independent of the clusters to which those cells belong, you can simply provide a vector of cell names to the cells.use argument. Otherwise, if you'd like to have equal number of cells (optimally) per cluster in your final dataset after subsetting, then what you proposed would do the job.

Hope that helps!

Best,
Leon

kathirij · 2017-12-13T23:42:32Z

Hello All,
I appreciate the lively discussion and great suggestions - @leonfodoulian I used your method and was able to do exactly what I wanted. Appreciate the detailed code you wrote.

leonfodoulian · 2017-12-14T00:04:23Z

Great. Happy to hear that. However, for robustness issues, I would try to resample from obj1 several times using different seed values (which you can store for reproducibility), compute variable genes at each step as described above, and then get either the union or the intersection of those variable genes. The final variable genes vector can be used for dimensional reduction.

kathirij · 2017-12-14T00:17:05Z

Yep! I did it three times and got very similar clustering at the end using unique set of variable genes... But using a union of the variable genes might be even more robust. My analysis is helped by the fact that the larger cluster is very homogeneous - so, random sampling of ~1000 cells is still very representative.

In any case, appreciate your help.

del2007 · 2017-12-14T00:59:24Z

Hi Leon, Thanks for the answer! I actually did not need to randomly sample clusters but instead I wanted to randomly sample an object - for me my starting object after filtering. This is due to having ~100k cells in my starting object so I randomly sampled 60k or 50k with the SubsetData as I mentioned to use for the downstream analysis. For this application, using SubsetData is fine, it seems from your answers. Again, I’d like to confirm that it randomly samples! Does it not? If I verify the subsetted object, it does have the nr of cells I asked for in max.cells.per.ident (only one ident in one starting object). For ex., 50k or 60k. This subset also has the same exact mean and median as my original object I’m subsetting from. Therefore I wanted to confirm: does the SubsetData blindly randomly sample? If I always end up with the same mean and median (UMI) then is it truly “random” sampling? Does it make sense to subsample as such even? I don’t have much choice, it’s either that or my R crashes with so many cells. Thanks again for any help!

leonfodoulian · 2017-12-14T07:51:11Z

Yes it does randomly sample (using the sample() function from base). You can check lines 714 to 716 in interaction.R. However, you have to know that for reproducibility, a random seed is set (in this case random.seed = 1). So if you repeat your subsetting several times with the same max.cells.per.ident, you will always end up having the same cells. You can however change the seed value and end up with a different dataset. Try doing that, and see for yourself if the mean or the median remain the same.

However, to avoid cases where you might have different orig.ident stored in the object@meta.data slot, which happened in my case, I suggest you create a new column where you have the same identity for all your cells, and set the identity of all your cells to that identity.

# Create a new column having the same 'value' stored for all cells
object@meta.data$noIdent <- "noIdent"

# Change the identity of your cells to the common noIdent value
object <- SetAllIdent(object, id = "noIdent")

For your last question, I suggest you read this bioRxiv paper. Of course, your case does not exactly match theirs, since they have ~1.3M cells and, therefore, more chance to maximally enrich in rare cell types, and the tissues you're studying might be very different. But this is something you can test by minimally subsetting your data (i.e. to a point where your R doesn't crash, but that you loose the less cells), and then decreasing in the number of sampled cells and see if the results remain consistent and get recapitulated by lower number of cells.

Best,
Leon

kathirij closed this as completed Dec 20, 2017

williamsdrake mentioned this issue Jun 4, 2020

Randomly downsample seurat object #3108

Closed

stemangiola mentioned this issue Jun 4, 2023

sample_frac function changes the cell order stemangiola/tidyseurat#58

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Random picking of cells from an object #243

Random picking of cells from an object #243

kathirij commented Dec 12, 2017

leonfodoulian commented Dec 12, 2017 •

edited

Loading

del2007 commented Dec 13, 2017

jbalberge commented Dec 13, 2017

leonfodoulian commented Dec 13, 2017 •

edited

Loading

kathirij commented Dec 13, 2017

leonfodoulian commented Dec 14, 2017 •

edited

Loading

kathirij commented Dec 14, 2017

del2007 commented Dec 14, 2017 via email •

edited

Loading

leonfodoulian commented Dec 14, 2017

Random picking of cells from an object #243

Random picking of cells from an object #243

Comments

kathirij commented Dec 12, 2017

leonfodoulian commented Dec 12, 2017 • edited Loading

del2007 commented Dec 13, 2017

jbalberge commented Dec 13, 2017

leonfodoulian commented Dec 13, 2017 • edited Loading

kathirij commented Dec 13, 2017

leonfodoulian commented Dec 14, 2017 • edited Loading

kathirij commented Dec 14, 2017

del2007 commented Dec 14, 2017 via email • edited Loading

leonfodoulian commented Dec 14, 2017

leonfodoulian commented Dec 12, 2017 •

edited

Loading

leonfodoulian commented Dec 13, 2017 •

edited

Loading

leonfodoulian commented Dec 14, 2017 •

edited

Loading

del2007 commented Dec 14, 2017 via email •

edited

Loading