Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

glmpca giving error on simulated data #2

Closed
jakeyeung opened this issue Jun 13, 2019 · 5 comments
Closed

glmpca giving error on simulated data #2

jakeyeung opened this issue Jun 13, 2019 · 5 comments

Comments

@jakeyeung
Copy link

jakeyeung commented Jun 13, 2019

Hello,

I enjoyed reading your paper and I thought I would try out the GLM-PCA.

I tried to simulate some data and then run glmpca, but I am getting an error:


# Generate data -----------------------------------------------------------

ngenes <- 10000
ncells <- 50
ncounts <- 30
ntranscripts <- 10^6

p1 <- exp(rnorm(n = ngenes, mean = 0, sd = 1))
p2 <- exp(rnorm(n = ngenes, mean = 0, sd = 1))
p3 <- mapply(function(x, y) sqrt(x * y), p1, p2)

#1 generate single-cell transcriptomes 

clstr1 <- rmultinom(ncells, ncounts, prob = p1)
clstr2 <- rmultinom(ncells, ncounts, prob = p2)
clstr3 <- rmultinom(ncells, ncounts, prob = p3)

dat.merged <- cbind(clstr1, clstr2, clstr3)

Yout <- glmpca(as.matrix(dat.merged), 2, fam = "mult")

I get an error message:

Error in if (t > 5 && abs(dev[t] - dev[t - 1])/(0.1 + abs(dev[t - 1])) < : missing value where TRUE/FALSE needed

Would you have any hints on where this error may be coming from?

@willtownes
Copy link
Owner

Hi, thanks for your interest! This error can occur when there are rows or columns with all zero values (usually it is the rows/genes). You need to filter out these uninformative features before passing it to GLM-PCA. Also the error can occur if there are extremely large values say larger than 1e6, which would not be expected in a typical scRNA-Seq dataset. Let me know if that helps, otherwise I'll dig deeper. Also, we are actively working on turning GLM-PCA into an R package for CRAN and trying to improve error handling and provide more informative messages so I'll leave this issue open to remind myself to put in a more informative error message.

@jakeyeung
Copy link
Author

jakeyeung commented Jun 13, 2019

Thanks I found the problem by removing empty rows.


ngenes <- 10000
ncells <- 50
ncounts <- 50
ntranscripts <- 10^6

p1 <- exp(rnorm(n = ngenes, mean = 0, sd = 1))
p2 <- exp(rnorm(n = ngenes, mean = 0, sd = 1))
p3 <- mapply(function(x, y) sqrt(x * y), p1, p2)

#1 generate single-cell transcriptomes 

clstr1 <- rmultinom(ncells, ncounts, prob = p1)
clstr2 <- rmultinom(ncells, ncounts, prob = p2)
clstr3 <- rmultinom(ncells, ncounts, prob = p3)

dat.merged <- cbind(clstr1, clstr2, clstr3)
rownames(dat.merged) <- paste("gene", seq(nrow(dat.merged)), sep = "_")
colnames(dat.merged) <- paste("cell", seq(ncol(dat.merged)), sep = "_")
# clean up rows
dat.merged.cleaned <- dat.merged[rowSums(dat.merged) > 0, ]
# rownamesd(dat.merged.cleaned) <- names(colnames(dat.merged)[which(rowSums(dat.merged) > 0)])

colvec <- c(rep("blue", ncells), rep("red", ncells), rep("black", ncells))

Yout <- glmpca(Y = as.matrix(dat.merged.cleaned), L=2, fam = "poi", verbose = TRUE). # fam="mult" gives same "ring"
plot(Yout$factors[, 1], Yout$factors[, 2], col = colvec, pch = 20, xlab = "GLMPC1 Loadings", ylab = "GLMPC2 Loadings", cex = 5, cex.lab = 1.5, cex.main = 1.5, main = "GLMPCA on Observed Space")

The output gives me something peculiar, some sort of ring. Any hints on why that may be? Seems to be solved by setting L=3. I suspect it may be coming from the SVD step, but just a guess.

Rplot

@bc2zb
Copy link

bc2zb commented Jun 13, 2019

I'm no expert here, but is that pretty much what you would expect for randomly simulated data and two latent dimensions?

@willtownes
Copy link
Owner

I played with your code a bit, and I think the issue is your simulation has extremely low total counts per cell (50). This causes the data become essentially binary (nearly all the nonzero values are ones. I found the maximum value was 3 across the entire matrix). In that case, it might be more appropriate to just set all the nonzero values to one and use fam="bern" (bernoulli likelihood). Try setting ncounts<-1000 (a value more typical of droplet scRNA-Seq) and the plot should show a better separation of the clusters.

image

By the way, I don't think GLM-PCA is giving a wrong projection even with the low total counts, because you still see a clear separation between red and blue clusters (p1 and p2) along the first dimension, and the black points (p3) are roughly an equal mixture of the red and blue.

Another interesting simulation would be to allow the total counts to vary across cells, perhaps according to two batches. You could have one batch where all the cells have total counts ranging from say 800-1,500 and another ranging from 2,000-5,000. Then, if you apply PCA to log2(1+CPM) you will see the two batches separating on PC1, whereas GLM-PCA should remove this batch effect and still show the clusters separating on dimension 1.

I also encourage you to try running GLM-PCA on real scRNA-Seq data such as that of Duo et al

@willtownes
Copy link
Owner

As of commit 507a7da , GLM-PCA should give an informative error message when there are rows in the data matrix that are all zero.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants