-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
glmpca giving error on simulated data #2
Comments
Hi, thanks for your interest! This error can occur when there are rows or columns with all zero values (usually it is the rows/genes). You need to filter out these uninformative features before passing it to GLM-PCA. Also the error can occur if there are extremely large values say larger than 1e6, which would not be expected in a typical scRNA-Seq dataset. Let me know if that helps, otherwise I'll dig deeper. Also, we are actively working on turning GLM-PCA into an R package for CRAN and trying to improve error handling and provide more informative messages so I'll leave this issue open to remind myself to put in a more informative error message. |
Thanks I found the problem by removing empty rows.
The output gives me something peculiar, some sort of ring. Any hints on why that may be? Seems to be solved by setting L=3. I suspect it may be coming from the SVD step, but just a guess. |
I'm no expert here, but is that pretty much what you would expect for randomly simulated data and two latent dimensions? |
I played with your code a bit, and I think the issue is your simulation has extremely low total counts per cell (50). This causes the data become essentially binary (nearly all the nonzero values are ones. I found the maximum value was 3 across the entire matrix). In that case, it might be more appropriate to just set all the nonzero values to one and use fam="bern" (bernoulli likelihood). Try setting ncounts<-1000 (a value more typical of droplet scRNA-Seq) and the plot should show a better separation of the clusters. By the way, I don't think GLM-PCA is giving a wrong projection even with the low total counts, because you still see a clear separation between red and blue clusters (p1 and p2) along the first dimension, and the black points (p3) are roughly an equal mixture of the red and blue. Another interesting simulation would be to allow the total counts to vary across cells, perhaps according to two batches. You could have one batch where all the cells have total counts ranging from say 800-1,500 and another ranging from 2,000-5,000. Then, if you apply PCA to log2(1+CPM) you will see the two batches separating on PC1, whereas GLM-PCA should remove this batch effect and still show the clusters separating on dimension 1. I also encourage you to try running GLM-PCA on real scRNA-Seq data such as that of Duo et al |
As of commit 507a7da , GLM-PCA should give an informative error message when there are rows in the data matrix that are all zero. |
Hello,
I enjoyed reading your paper and I thought I would try out the GLM-PCA.
I tried to simulate some data and then run glmpca, but I am getting an error:
I get an error message:
Error in if (t > 5 && abs(dev[t] - dev[t - 1])/(0.1 + abs(dev[t - 1])) < : missing value where TRUE/FALSE needed
Would you have any hints on where this error may be coming from?
The text was updated successfully, but these errors were encountered: