Fix bug on estimateDispersionsGeneEst when niter is larger than 1 #64

areyesq89 · 2023-02-21T16:47:03Z

I think I found a minor bug in estimateDispersionsGeneEst when the parameter niter is set to > 1 and fitType="glmGamPoi".

The indexing fitMu[idx, ] inside the else if (type == "glmGamPoi") chunk was not necessary as fitMu already has the correct dimensions specified in the line that generates it (i.e. fit <- fitNbinomGLMs(objectNZ[fitidx,,drop=FALSE], ...).
Also within the interactions across niter, when alpha_hat and alpha_hat_new had both zero values, it returned NA values. This seemed to happen when one of the groups has all zero counts and glmGamPoi is used.

Hope I'm not chasing my own tail here.

Here is the code that generated the error message:

data("pasillaDEXSeqDataSet", package="pasilla")
dxd <- estimateSizeFactors( dxd )
modelMatrix <- DEXSeq:::rmDepCols( model.matrix(design(dxd), as.data.frame(colData(dxd))))
tst <- estimateDispersionsGeneEst( dxd, quiet=FALSE, modelMatrix = modelMatrix, niter = 10, type="glmGamPoi" )

using 'glmGamPoi' as fitType. If used in published research, please cite:
    Ahlmann-Eltze, C., Huber, W. (2020) glmGamPoi: Fitting Gamma-Poisson
    Generalized Linear Models on Single Cell Count Data. Bioinformatics.
    https://doi.org/10.1093/bioinformatics/btaa1009
using 'glmGamPoi' as fitType. If used in published research, please cite:
    Ahlmann-Eltze, C., Huber, W. (2020) glmGamPoi: Fitting Gamma-Poisson
    Generalized Linear Models on Single Cell Count Data. Bioinformatics.
    https://doi.org/10.1093/bioinformatics/btaa1009
Error in fitMu[idx, ] : subscript out of bounds

const-ae · 2023-02-21T16:56:21Z

Hi Alejandro,

Thanks for filling the bug report. I guess I never tested the DESeq2 / glmGamPoi integration with niter > 1.

Do I understand correctly that the problem alpha_hat = 0 is not fixed by your PR? I could try to take a look but won't manage this week because of a deadline on Friday.

Best, Constantin

areyesq89 · 2023-02-21T17:05:04Z

Hi @const-ae,

Yes, that's also fixed in my PR by the following line: fitidx[is.na(fitidx)] <- FALSE.

No rush, and best of luck with your deadline!

Best regards,
Alejandro

mikelove · 2023-02-21T20:25:44Z

Thanks Alejandro and Constantin. I will pull this and send to devel branch for now, as scanning the code it looks like you are correct. I can add a test case at some point for glmGamPoi and niter > 1.

ycl6 · 2024-03-04T09:51:22Z

Hi all,

I posted a question relating to this on the Bioconductor support site.
https://support.bioconductor.org/p/9156883/

I think this fix is providing the dnbinom() with the wrong mu input. My understanding is that in the loop, the dnbinom function should be provided with the per-gene counts and per-gene fitMu, and therefore it should be like what it was before, i.e. fitMu[idx, ] and not the whole fitMu matrix, e.g.:

# fitMu
[1] "matrix" "array" 
 num [1:11774, 1:49] 1.834 0.999 3.028 1 0.52 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:11774] "Mrpl15" "Lypla1" "Tcea1" "Atp6v1h" ...
  ..$ : chr [1:49] "AGGCCGTTCACCTCGT-1" "AGGGATGAGTGCGATG-1" "ATGCGATTCGTGTAGT-1" "CAACCAAAGAGTCGGT-1" ...

The time DESeq() took for my small single-cell subset (17881 genes and 49 cells) to complete increased from 10 seconds to 30 minutes and the results are different from before-the-fix because estimated dispersions had changed due to the different mu input.

Below is my original post on the support site:

Hi, I recently upgraded to latest R 4.3 from R 4.1. While running the DEA workflow on scRNA-seq data, I noticed it was dramatically slower with my latest setup. I wonder if anyone are aware of changes that may have contributed to this, and how I can make the dispersion estimation step to run as fast as before?

Using a small dataset with 49 cells as demo (22 vs. 27 in two tissues), here's how I run DESeq():

dds <- DESeqDataSetFromMatrix(countData = counts(sce),
                              colData = droplevels(colData(sce)[,c("Barcode","tissue")]),
                              rowData = rowData(sce), design = ~ tissue)
sizeFactors(dds) <- sizeFactors(sce)
dds

system.time({
    dds <- DESeq(dds, test = "LRT", useT = TRUE, minmu = 1e-6, minReplicatesForReplace = Inf,
                 fitType = "glmGamPoi", sfType = "poscounts", reduced = ~1, quiet = FALSE)
})

# dds
class: DESeqDataSet 
dim: 17881 49 
metadata(1): version
assays(1): counts
rownames(17881): Xkr4 Gm37381 ... CAAA01147332.1 AC149090.1
rowData names(5): ID Symbol Type SEQNAME is_mito
colnames(49): LR03_AGGCCGTTCACCTCGT-1 LR03_AGGGATGAGTGCGATG-1 ... LR04_TTGACTTAGGTGTTAA-1 LR04_TTGACTTGTCTGCGGT-1
colData names(3): Barcode tissue sizeFactor

When using R version 4.1.3, DESeq2_1.34.0 and glmGamPoi_1.6.0, the run time is:

   user  system elapsed 
 10.996   6.414   9.632

When using R version 4.3.2, DESeq2_1.42.0 and glmGamPoi_1.14.3, the run time is:

    user   system  elapsed 
1661.721   25.540 1679.590

Edit:

More digging into the codes themselves, there seems to a change in how initial_lp and last_lp is calculated.
This greatly increase the amount of time vapply takes to perform the calculation.

In the current version, the whole fitMu matrix is used as mu

      initial_lp <- vapply(which(fitidx), function(idx){
        sum(dnbinom(Counts[idx, ], mu = fitMu, size = 1 / alpha_hat[idx], log = TRUE))
      }, FUN.VALUE = 0.0)

      last_lp <- vapply(which(fitidx), function(idx){
        sum(dnbinom(Counts[idx, ], mu = fitMu, size = 1 / alpha_hat_new[idx], log = TRUE))
      }, FUN.VALUE = 0.0)

Whereas in the previous version, the per-gene vector fitMu[idx, ] is used as mu, which I think make more sense, isn't it?
I think this is how fitMu was used in the fitting process in DESeq2.cpp

      initial_lp <- vapply(which(fitidx), function(idx){
        sum(dnbinom(Counts[idx, ], mu = fitMu[idx, ], size = 1 / alpha_hat[idx], log = TRUE))
      }, FUN.VALUE = 0.0)

      last_lp <- vapply(which(fitidx), function(idx){
        sum(dnbinom(Counts[idx, ], mu = fitMu[idx, ], size = 1 / alpha_hat_new[idx], log = TRUE))
      }, FUN.VALUE = 0.0)

Also, why is fitMu used as mean input when running glmGamPoi::overdispersion_mle(), and not the subset (fitMu[fitidx, ]) as in the previous version?

      # New version
      dispersion_fits <- glmGamPoi::overdispersion_mle(Counts[fitidx, ], mean = fitMu,
                                                       model_matrix = modelMatrix, verbose = ! quiet)

      # Old version
      dispersion_fits <- glmGamPoi::overdispersion_mle(Counts[fitidx, ], mean = fitMu[fitidx, ],
                                                       model_matrix = modelMatrix, verbose = ! quiet)

mikelove · 2024-03-04T12:00:09Z

From looking this over I agree with @ycl6 that there is a mismatch between the argument provided to dnbinom because the intention of the vapply is to loop over genes.

@areyesq89 and @const-ae do you agree?

https://github.com/thelovelab/DESeq2/blob/devel/R/core.R#L796-L805

I can work on this but not until next week probably.

mikelove · 2024-03-06T00:15:49Z

I haven't tested this (sorry just overloaded right now) but I think the fix would look something like this:

a9171e9

ycl6 · 2024-03-06T10:58:36Z

Hi @mikelove I tested it and it works fine, albeit with a missing comma ;)
It is still a little slow so I made some changes and submitted a PR to address that

Fix bug on estimateDispersionsGeneEst when niter is larger than 1

f9eaa56

mikelove merged commit 5d85d7a into thelovelab:master Feb 21, 2023

ycl6 mentioned this pull request Mar 6, 2024

Optimize dispersion estimation in estimateDispersionsGeneEst #95

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bug on estimateDispersionsGeneEst when niter is larger than 1 #64

Fix bug on estimateDispersionsGeneEst when niter is larger than 1 #64

areyesq89 commented Feb 21, 2023 •

edited

const-ae commented Feb 21, 2023

areyesq89 commented Feb 21, 2023 •

edited

mikelove commented Feb 21, 2023

ycl6 commented Mar 4, 2024

mikelove commented Mar 4, 2024

mikelove commented Mar 6, 2024

ycl6 commented Mar 6, 2024

Fix bug on estimateDispersionsGeneEst when niter is larger than 1 #64

Fix bug on estimateDispersionsGeneEst when niter is larger than 1 #64

Conversation

areyesq89 commented Feb 21, 2023 • edited

const-ae commented Feb 21, 2023

areyesq89 commented Feb 21, 2023 • edited

mikelove commented Feb 21, 2023

ycl6 commented Mar 4, 2024

mikelove commented Mar 4, 2024

mikelove commented Mar 6, 2024

ycl6 commented Mar 6, 2024

areyesq89 commented Feb 21, 2023 •

edited

areyesq89 commented Feb 21, 2023 •

edited