glmpca giving error on simulated data #2

jakeyeung · 2019-06-13T14:08:26Z

Hello,

I enjoyed reading your paper and I thought I would try out the GLM-PCA.

I tried to simulate some data and then run glmpca, but I am getting an error:


# Generate data -----------------------------------------------------------

ngenes <- 10000
ncells <- 50
ncounts <- 30
ntranscripts <- 10^6

p1 <- exp(rnorm(n = ngenes, mean = 0, sd = 1))
p2 <- exp(rnorm(n = ngenes, mean = 0, sd = 1))
p3 <- mapply(function(x, y) sqrt(x * y), p1, p2)

#1 generate single-cell transcriptomes 

clstr1 <- rmultinom(ncells, ncounts, prob = p1)
clstr2 <- rmultinom(ncells, ncounts, prob = p2)
clstr3 <- rmultinom(ncells, ncounts, prob = p3)

dat.merged <- cbind(clstr1, clstr2, clstr3)

Yout <- glmpca(as.matrix(dat.merged), 2, fam = "mult")

I get an error message:

Error in if (t > 5 && abs(dev[t] - dev[t - 1])/(0.1 + abs(dev[t - 1])) < : missing value where TRUE/FALSE needed

Would you have any hints on where this error may be coming from?

The text was updated successfully, but these errors were encountered:

willtownes · 2019-06-13T14:25:53Z

Hi, thanks for your interest! This error can occur when there are rows or columns with all zero values (usually it is the rows/genes). You need to filter out these uninformative features before passing it to GLM-PCA. Also the error can occur if there are extremely large values say larger than 1e6, which would not be expected in a typical scRNA-Seq dataset. Let me know if that helps, otherwise I'll dig deeper. Also, we are actively working on turning GLM-PCA into an R package for CRAN and trying to improve error handling and provide more informative messages so I'll leave this issue open to remind myself to put in a more informative error message.

jakeyeung · 2019-06-13T14:51:12Z

Thanks I found the problem by removing empty rows.


ngenes <- 10000
ncells <- 50
ncounts <- 50
ntranscripts <- 10^6

p1 <- exp(rnorm(n = ngenes, mean = 0, sd = 1))
p2 <- exp(rnorm(n = ngenes, mean = 0, sd = 1))
p3 <- mapply(function(x, y) sqrt(x * y), p1, p2)

#1 generate single-cell transcriptomes 

clstr1 <- rmultinom(ncells, ncounts, prob = p1)
clstr2 <- rmultinom(ncells, ncounts, prob = p2)
clstr3 <- rmultinom(ncells, ncounts, prob = p3)

dat.merged <- cbind(clstr1, clstr2, clstr3)
rownames(dat.merged) <- paste("gene", seq(nrow(dat.merged)), sep = "_")
colnames(dat.merged) <- paste("cell", seq(ncol(dat.merged)), sep = "_")
# clean up rows
dat.merged.cleaned <- dat.merged[rowSums(dat.merged) > 0, ]
# rownamesd(dat.merged.cleaned) <- names(colnames(dat.merged)[which(rowSums(dat.merged) > 0)])

colvec <- c(rep("blue", ncells), rep("red", ncells), rep("black", ncells))

Yout <- glmpca(Y = as.matrix(dat.merged.cleaned), L=2, fam = "poi", verbose = TRUE). # fam="mult" gives same "ring"
plot(Yout$factors[, 1], Yout$factors[, 2], col = colvec, pch = 20, xlab = "GLMPC1 Loadings", ylab = "GLMPC2 Loadings", cex = 5, cex.lab = 1.5, cex.main = 1.5, main = "GLMPCA on Observed Space")

The output gives me something peculiar, some sort of ring. Any hints on why that may be? Seems to be solved by setting L=3. I suspect it may be coming from the SVD step, but just a guess.

bc2zb · 2019-06-13T15:14:45Z

I'm no expert here, but is that pretty much what you would expect for randomly simulated data and two latent dimensions?

willtownes · 2019-06-13T16:18:26Z

I played with your code a bit, and I think the issue is your simulation has extremely low total counts per cell (50). This causes the data become essentially binary (nearly all the nonzero values are ones. I found the maximum value was 3 across the entire matrix). In that case, it might be more appropriate to just set all the nonzero values to one and use fam="bern" (bernoulli likelihood). Try setting ncounts<-1000 (a value more typical of droplet scRNA-Seq) and the plot should show a better separation of the clusters.

By the way, I don't think GLM-PCA is giving a wrong projection even with the low total counts, because you still see a clear separation between red and blue clusters (p1 and p2) along the first dimension, and the black points (p3) are roughly an equal mixture of the red and blue.

Another interesting simulation would be to allow the total counts to vary across cells, perhaps according to two batches. You could have one batch where all the cells have total counts ranging from say 800-1,500 and another ranging from 2,000-5,000. Then, if you apply PCA to log2(1+CPM) you will see the two batches separating on PC1, whereas GLM-PCA should remove this batch effect and still show the clusters separating on dimension 1.

I also encourage you to try running GLM-PCA on real scRNA-Seq data such as that of Duo et al

willtownes · 2019-06-14T20:06:57Z

As of commit 507a7da , GLM-PCA should give an informative error message when there are rows in the data matrix that are all zero.

willtownes closed this as completed Jun 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

glmpca giving error on simulated data #2

glmpca giving error on simulated data #2

jakeyeung commented Jun 13, 2019 •

edited

Loading

willtownes commented Jun 13, 2019

jakeyeung commented Jun 13, 2019 •

edited

Loading

bc2zb commented Jun 13, 2019

willtownes commented Jun 13, 2019

willtownes commented Jun 14, 2019

glmpca giving error on simulated data #2

glmpca giving error on simulated data #2

Comments

jakeyeung commented Jun 13, 2019 • edited Loading

willtownes commented Jun 13, 2019

jakeyeung commented Jun 13, 2019 • edited Loading

bc2zb commented Jun 13, 2019

willtownes commented Jun 13, 2019

willtownes commented Jun 14, 2019

jakeyeung commented Jun 13, 2019 •

edited

Loading

jakeyeung commented Jun 13, 2019 •

edited

Loading