Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug on estimateDispersionsGeneEst when niter is larger than 1 #64

Merged
merged 1 commit into from Feb 21, 2023

Conversation

areyesq89
Copy link

@areyesq89 areyesq89 commented Feb 21, 2023

I think I found a minor bug in estimateDispersionsGeneEst when the parameter niter is set to > 1 and fitType="glmGamPoi".

  • The indexing fitMu[idx, ] inside the else if (type == "glmGamPoi") chunk was not necessary as fitMu already has the correct dimensions specified in the line that generates it (i.e. fit <- fitNbinomGLMs(objectNZ[fitidx,,drop=FALSE], ...).
  • Also within the interactions across niter, when alpha_hat and alpha_hat_new had both zero values, it returned NA values. This seemed to happen when one of the groups has all zero counts and glmGamPoi is used.

Hope I'm not chasing my own tail here.

Here is the code that generated the error message:

data("pasillaDEXSeqDataSet", package="pasilla")
dxd <- estimateSizeFactors( dxd )
modelMatrix <- DEXSeq:::rmDepCols( model.matrix(design(dxd), as.data.frame(colData(dxd))))
tst <- estimateDispersionsGeneEst( dxd, quiet=FALSE, modelMatrix = modelMatrix, niter = 10, type="glmGamPoi" )

using 'glmGamPoi' as fitType. If used in published research, please cite:
    Ahlmann-Eltze, C., Huber, W. (2020) glmGamPoi: Fitting Gamma-Poisson
    Generalized Linear Models on Single Cell Count Data. Bioinformatics.
    https://doi.org/10.1093/bioinformatics/btaa1009
using 'glmGamPoi' as fitType. If used in published research, please cite:
    Ahlmann-Eltze, C., Huber, W. (2020) glmGamPoi: Fitting Gamma-Poisson
    Generalized Linear Models on Single Cell Count Data. Bioinformatics.
    https://doi.org/10.1093/bioinformatics/btaa1009
Error in fitMu[idx, ] : subscript out of bounds

@const-ae
Copy link
Contributor

Hi Alejandro,

Thanks for filling the bug report. I guess I never tested the DESeq2 / glmGamPoi integration with niter > 1.

Do I understand correctly that the problem alpha_hat = 0 is not fixed by your PR? I could try to take a look but won't manage this week because of a deadline on Friday.

Best, Constantin

@areyesq89
Copy link
Author

areyesq89 commented Feb 21, 2023

Hi @const-ae,

Yes, that's also fixed in my PR by the following line: fitidx[is.na(fitidx)] <- FALSE.

No rush, and best of luck with your deadline!

Best regards,
Alejandro

@mikelove
Copy link
Collaborator

Thanks Alejandro and Constantin. I will pull this and send to devel branch for now, as scanning the code it looks like you are correct. I can add a test case at some point for glmGamPoi and niter > 1.

@mikelove mikelove merged commit 5d85d7a into thelovelab:master Feb 21, 2023
@ycl6
Copy link
Contributor

ycl6 commented Mar 4, 2024

Hi all,

I posted a question relating to this on the Bioconductor support site.
https://support.bioconductor.org/p/9156883/

I think this fix is providing the dnbinom() with the wrong mu input. My understanding is that in the loop, the dnbinom function should be provided with the per-gene counts and per-gene fitMu, and therefore it should be like what it was before, i.e. fitMu[idx, ] and not the whole fitMu matrix, e.g.:

# fitMu
[1] "matrix" "array" 
 num [1:11774, 1:49] 1.834 0.999 3.028 1 0.52 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:11774] "Mrpl15" "Lypla1" "Tcea1" "Atp6v1h" ...
  ..$ : chr [1:49] "AGGCCGTTCACCTCGT-1" "AGGGATGAGTGCGATG-1" "ATGCGATTCGTGTAGT-1" "CAACCAAAGAGTCGGT-1" ...

The time DESeq() took for my small single-cell subset (17881 genes and 49 cells) to complete increased from 10 seconds to 30 minutes and the results are different from before-the-fix because estimated dispersions had changed due to the different mu input.


Below is my original post on the support site:


Hi, I recently upgraded to latest R 4.3 from R 4.1. While running the DEA workflow on scRNA-seq data, I noticed it was dramatically slower with my latest setup. I wonder if anyone are aware of changes that may have contributed to this, and how I can make the dispersion estimation step to run as fast as before?

Using a small dataset with 49 cells as demo (22 vs. 27 in two tissues), here's how I run DESeq():

dds <- DESeqDataSetFromMatrix(countData = counts(sce),
                              colData = droplevels(colData(sce)[,c("Barcode","tissue")]),
                              rowData = rowData(sce), design = ~ tissue)
sizeFactors(dds) <- sizeFactors(sce)
dds

system.time({
    dds <- DESeq(dds, test = "LRT", useT = TRUE, minmu = 1e-6, minReplicatesForReplace = Inf,
                 fitType = "glmGamPoi", sfType = "poscounts", reduced = ~1, quiet = FALSE)
})
# dds
class: DESeqDataSet 
dim: 17881 49 
metadata(1): version
assays(1): counts
rownames(17881): Xkr4 Gm37381 ... CAAA01147332.1 AC149090.1
rowData names(5): ID Symbol Type SEQNAME is_mito
colnames(49): LR03_AGGCCGTTCACCTCGT-1 LR03_AGGGATGAGTGCGATG-1 ... LR04_TTGACTTAGGTGTTAA-1 LR04_TTGACTTGTCTGCGGT-1
colData names(3): Barcode tissue sizeFactor

When using R version 4.1.3, DESeq2_1.34.0 and glmGamPoi_1.6.0, the run time is:

   user  system elapsed 
 10.996   6.414   9.632 

When using R version 4.3.2, DESeq2_1.42.0 and glmGamPoi_1.14.3, the run time is:

    user   system  elapsed 
1661.721   25.540 1679.590 

Edit:

More digging into the codes themselves, there seems to a change in how initial_lp and last_lp is calculated.
This greatly increase the amount of time vapply takes to perform the calculation.

In the current version, the whole fitMu matrix is used as mu

      initial_lp <- vapply(which(fitidx), function(idx){
        sum(dnbinom(Counts[idx, ], mu = fitMu, size = 1 / alpha_hat[idx], log = TRUE))
      }, FUN.VALUE = 0.0)

      last_lp <- vapply(which(fitidx), function(idx){
        sum(dnbinom(Counts[idx, ], mu = fitMu, size = 1 / alpha_hat_new[idx], log = TRUE))
      }, FUN.VALUE = 0.0)

Whereas in the previous version, the per-gene vector fitMu[idx, ] is used as mu, which I think make more sense, isn't it?
I think this is how fitMu was used in the fitting process in DESeq2.cpp

      initial_lp <- vapply(which(fitidx), function(idx){
        sum(dnbinom(Counts[idx, ], mu = fitMu[idx, ], size = 1 / alpha_hat[idx], log = TRUE))
      }, FUN.VALUE = 0.0)

      last_lp <- vapply(which(fitidx), function(idx){
        sum(dnbinom(Counts[idx, ], mu = fitMu[idx, ], size = 1 / alpha_hat_new[idx], log = TRUE))
      }, FUN.VALUE = 0.0)

Also, why is fitMu used as mean input when running glmGamPoi::overdispersion_mle(), and not the subset (fitMu[fitidx, ]) as in the previous version?

      # New version
      dispersion_fits <- glmGamPoi::overdispersion_mle(Counts[fitidx, ], mean = fitMu,
                                                       model_matrix = modelMatrix, verbose = ! quiet)

      # Old version
      dispersion_fits <- glmGamPoi::overdispersion_mle(Counts[fitidx, ], mean = fitMu[fitidx, ],
                                                       model_matrix = modelMatrix, verbose = ! quiet)

@mikelove
Copy link
Collaborator

mikelove commented Mar 4, 2024

From looking this over I agree with @ycl6 that there is a mismatch between the argument provided to dnbinom because the intention of the vapply is to loop over genes.

@areyesq89 and @const-ae do you agree?

https://github.com/thelovelab/DESeq2/blob/devel/R/core.R#L796-L805

I can work on this but not until next week probably.

@mikelove
Copy link
Collaborator

mikelove commented Mar 6, 2024

I haven't tested this (sorry just overloaded right now) but I think the fix would look something like this:

a9171e9

@ycl6
Copy link
Contributor

ycl6 commented Mar 6, 2024

Hi @mikelove I tested it and it works fine, albeit with a missing comma ;)
It is still a little slow so I made some changes and submitted a PR to address that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants