Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

glmGamPoi fit fails if normMatrix is provided #29

Closed
mschubert opened this issue Nov 19, 2020 · 17 comments
Closed

glmGamPoi fit fails if normMatrix is provided #29

mschubert opened this issue Nov 19, 2020 · 17 comments

Comments

@mschubert
Copy link

Looking at the glmGamPoi test case:

dds <- makeExampleDESeqDataSet(n=100, m=8)
dds$group <- factor(rep(1:2,times=4))
design(dds) <- ~group + condition
dds <- DESeq(dds, test="LRT", reduced=~1, fitType="glmGamPoi")

This also works when I call estimateSizeFactors and estimateDispersions manually:

dds <- makeExampleDESeqDataSet(n=100, m=8)
dds$group <- factor(rep(1:2,times=4))
design(dds) <- ~group + condition
dds <- estimateSizeFactors(dds)
dds <- estimateDispersions(dds, fitType="glmGamPoi")
dds <- DESeq(dds, test="LRT", reduced=~1, fitType="glmGamPoi")

However, when I supply the normMatrix argument the subsequent glmGamPoi fails:

dds <- makeExampleDESeqDataSet(n=100, m=8)
nm = matrix(2, nrow=nrow(dds), ncol=ncol(dds))
dds$group <- factor(rep(1:2,times=4))
design(dds) <- ~group + condition
dds <- estimateSizeFactors(dds, normMatrix=nm)
dds <- estimateDispersions(dds, fitType="glmGamPoi")
dds <- nbinomLRT(dds, reduced=~1, type="glmGamPoi")
# Error in combine_size_factors_and_offset(offset, size_factors, Y, verbose = verbose) :
#   is.numeric(size_factors) && (length(size_factors) == 1 || length(size_factors) ==  .... is not TRUE
#dds <- DESeq(dds, test="LRT", reduced=~1, fitType="glmGamPoi") # same if used instead of nbinomLRT

Is it possible to use normMatrix with glmGamPoi?

The reason why I want to do that is that I'm looking at single-cell gene expression changes with chromosome aneuploidies.

(Filing this here instead of Bioc support because I think it's a bug.)

@const-ae
Copy link
Contributor

const-ae commented Nov 20, 2020

Hi Michael,

thanks for filing a bug report. I wasn't aware of the normMatrix argument and that's why I missed it, when I integrated glmGamPoi into DESeq2.

Is it possible to use normMatrix with glmGamPoi?

In theory yes. glmGamPoi::glm_gp() has an offset parameter which works as the log(normalizationFactors(dds)). So if you are willing to use glmGamPoi directly, this would be an option.

@mikelove, for fixing this bug. I think the solution would be to check here if is.null(normalizationFactors(objectNZ)) and only if true get sizeFactors(objectNZ), right? I could draft a PR to fix the problem, if you want.

Best,
Constantin

@mschubert
Copy link
Author

Hi Constantin,

Thanks a lot for your answer!

It would be great if this worked from within DESeq2, then I can use an API I'm already familiar with

@mikelove
Copy link
Collaborator

Thanks Michael and Constantin, I took a shot at addressing this here: 40d3cba

Let me know if that makes sense on your end(s).

@mikelove
Copy link
Collaborator

You'll actually need to use 1.31.3 : bc03212

@const-ae pointed out that I needed to add size_factors = FALSE to make it work correctly.

@mschubert
Copy link
Author

Wow, that was quick! Thank you so much.

I'm running a large test case now and will let you know if there are any issues. So far it's looking good

@mschubert
Copy link
Author

I have now run this for 45 treatment vs. 2947 control cells, and the normMatrix parameter behaves es expected: for instance, I got a log2FoldChange of -0.289 with and 0.267 without normMatrix, which is consistent with a normalization value of 3 vs. 2 for that gene 👍

However, I'm also confused that in contrast to the test case above, I get NA for all stat fields (maybe because many p-values are 0?) And it would be great for me if I had an estimate for the lfcSE, which is always NA including in the test case above.

@const-ae
Copy link
Contributor

I have now run this for 45 treatment vs. 2947 control cells, and the normMatrix parameter behaves es expected: for instance, I got a log2FoldChange of -0.289 with and 0.267 without normMatrix, which is consistent with a normalization value of 3 vs. 2 for that gene 👍

That is great to hear :)

It is common that you get very small p-values for a single cell analysis, because you have so many cells. However, keep in mind that cells from the same sample are not independent replicates, so you should be careful how you interpret the results.

I get NA for all stat fields

That is weird. Can you say a bit more how you exactly do the test? Or provide some example data or subset of the data, so that I can reproduce the problem?

And it would be great for me if I had an estimate for the lfcSE, which is always NA including in the test case above.

Indeed, glmGamPoi doesn't store the standard error of the coefficients, that's why I fill the lfcSE column with NA's in DESeq2 (https://github.com/mikelove/DESeq2/blob/master/R/core.R#L1909). The reason that I don't store the standard error of the coefficients in glmGamPoi is that I don't need for the likelihood-ratio test. Could you say a bit more for what you use the lfcSE?

@mschubert
Copy link
Author

It is common that you get very small p-values

In this case, I was surprised at the difference between the "normal" DESeq2 LRT (smallest p-values around 1e-30) vs. the glmGamPoi (300 genes < 1e-300)

I get NA for all stat fields

That was my mistake. I tried to calculate the Wald statistic from the log2FoldChange / lfcSE, which of course didn't work without lfcSE (where I use the LRT results for e.g. MA/volcano plots, but the Wald stat to compare 2 samples on x/y axis)

Could you say a bit more for what you use the lfcSE?

One of the analyses I do downstream of the DE is to model whether fold changes have a discontinuity along the genome (e.g. due to an amplification). The "raw" log2FCs are too noisy, in order to get anything out I need to weigh them by their SE when using them e.g. in a change point analysis.

@const-ae
Copy link
Contributor

In this case, I was surprised at the difference between the "normal" DESeq2 LRT (smallest p-values around 1e-30) vs. the glmGamPoi (300 genes < 1e-300)

There can be several reasons for the discrepancy:

  1. the minmu argument can influence how low the estimate for a particular group gets
  2. I previous observed that the kind of test (glmGamPoi=quasi-likelihood ratio vs DESeq2=likelihood ratio) influences the result (see attached picture which is Suppl. Fig 7 from our preprint). image
    It's not that glmGamPoi will always calculate smaller p-values, but glmGamPoi and DESeq2 use different tests and this can have a large effect on the log-scale (e.g. p-value = 1e-30 vs 1e-300), because with so many replicates the exact behavior of the tail of the distribution matters greatly.
  3. there was a bug in a recent version of glmGamPoi which meant that it sometimes converged to bad coefficient estimates. This in turn caused an unreasonable large deviance and thus very small p-values. I did fix it in version 1.3.1, which you can install from GitHub.

@const-ae
Copy link
Contributor

I get NA for all stat fields

That was my mistake. I tried to calculate the Wald statistic from the log2FoldChange / lfcSE, which of course didn't work without lfcSE (where I use the LRT results for e.g. MA/volcano plots, but the Wald stat to compare 2 samples on x/y axis)

Okay that is reassuring to hear :)

One of the analyses I do downstream of the DE is to model whether fold changes have a discontinuity along the genome (e.g. due to an amplification). The "raw" log2FCs are too noisy, in order to get anything out I need to weigh them by their SE when using them e.g. in a change point analysis.

Oh, that is very interesting and a cool use-case I haven't considered so far. When you say, that you weight the log-fold changes by the standard error, do you do something like lfc / se, i.e. use the test-statistic? In that case, could you try to plug-in the test-statistic that glmGamPoi calculates? (You would still need to incorporate the sign of the LFC. It's just an idea, not sure it would work.)

@mschubert
Copy link
Author

There can be several reasons for the discrepancy

Thanks for the linked figure! It seems to be a DESeq/glmGamPoi difference, no change with setting minmu or or updating glmGamPoi.

When you say, that you weight the log-fold changes by the standard error, do you do something like lfc / se, i.e. use the test-statistic?

No, I want to keep them separate to be able to call break points for different ploidies. In the case of the ecp package, their change point detection takes a weights parameter

In that case, could you try to plug-in the test-statistic that glmGamPoi calculates?

I don't think so. If the FCs are not changing, I still want to separate strong evidence for no change (lfc=0, low se, low stat) vs. weak evidence for no change (lfc=0, high se, low stat) (unless I misunderstood and the LRT stat is not testing H0)

@const-ae
Copy link
Contributor

I opened an issue at glmGamPoi about the standard error estimates.

I am open to add a way to calculate the standard error and provide those estimates, but I would need to spend more time on this, than I currently can spare. I would, however, be open for a PR or any other input, but let's continue the discussion at const-ae/glmGamPoi#12

@mschubert
Copy link
Author

Coming back on-topic: It seems I was too quick to judge the normMatrix argument as working. The log2FoldChange seems correct, as I noted before, e.g. if you look at the correlation between a DESeq run where I do not set it vs. one where I set it to 1 everywhere (maybe with a small deviation on very large negative lfcs):

image

However, when I look at the correlation of the pvalue, I get the following:

image

This doesn't look right.

The plots are from my own data, but you also see the discrepancy when plotting it for the normMatrix test case:

library(DESeq2)
makeExampleDESeqDataSet = DESeq2:::makeExampleDESeqDataSet

dds <- makeExampleDESeqDataSet(n=100, m=8)
dds$group <- factor(rep(1:2,times=4))
design(dds) <- ~group + condition

res1 <- DESeq(dds, test="LRT", reduced=~1, fitType="glmGamPoi")

nm = matrix(2, nrow=nrow(dds), ncol=ncol(dds))
dds <- estimateSizeFactors(dds, normMatrix=nm)
dds <- estimateDispersions(dds, fitType="glmGamPoi")
res2 <- nbinomLRT(dds, reduced=~1, type="glmGamPoi")

plot(log10(results(res1)$pvalue), log10(results(res2)$pvalue))

image

@const-ae
Copy link
Contributor

Okay, I can reproduce the problem. Here is a bit shorter code that shows that the problem

library(glmGamPoi)
Y <- matrix(rnbinom(n = 30 * 10, mu = 4, size = 0.3), nrow = 30, ncol  =10)
annot <- data.frame(group = sample(c("A", "B"), size = 10, replace = TRUE),
                    cont1 = rnorm(10), cont2 = rnorm(10, mean = 30))
design <- model.matrix(~ group + cont1 + cont2, data = annot)
reduced_des <- model.matrix(~ group + cont1, data = annot)

fit1 <- glm_gp(Y, design = design)
offset <- matrix(log(fit1$size_factors), byrow = TRUE, nrow = nrow(Y), ncol = ncol(Y))
fit2 <- glm_gp(Y, design = design, offset = offset, size_factors = FALSE)

# The size factor is at the 6th position
testthat::expect_equal(fit1[-6], fit2[-6])

res1 <- test_de(fit1, reduced_design = reduced_des)
res2 <- test_de(fit2, reduced_design = reduced_des)

testthat::expect_equal(res1, res2)
#> Error: `res1` not equal to `res2`.
#> Component "pval": Mean relative difference: 0.5928957
#> Component "adj_pval": Mean relative difference: 0.4650072
#> Component "f_statistic": Mean relative difference: 0.9238129

Created on 2020-11-25 by the reprex package (v0.3.0.9001)

The problem is that I currently, ignore the offset when I do the fit with the reduced design

@const-ae
Copy link
Contributor

I fixed the issue in const-ae/glmGamPoi@0deacb3#diff-e923364f1bda63722e222d7080a9a0dd30a5cd77712563c71b0d18eea3ffa239R230

To test the new version, you will need to install glmGamPoi version 1.3.4

library(DESeq2)
dds <- DESeq2:::makeExampleDESeqDataSet(n=100, m=8)
dds$group <- factor(rep(1:2,times=4))
design(dds) <- ~group + condition

res1 <- DESeq(dds, test="LRT", reduced=~1, fitType="glmGamPoi")

nm = matrix(2, nrow=nrow(dds), ncol=ncol(dds))
dds <- estimateSizeFactors(dds, normMatrix=nm)
dds <- estimateDispersions(dds, fitType="glmGamPoi")


plot(log10(results(res1)$pvalue), log10(results(res2)$pvalue)); abline(0,1)

Created on 2020-11-25 by the reprex package (v0.3.0.9001)

Thank you again @mschubert for your effort to track down the bug :)

@mikelove
Copy link
Collaborator

Thanks Michael and Constantin!

@mikelove
Copy link
Collaborator

mikelove commented Dec 2, 2020

I'm going to close this for now, but please feel free to re-open as needed.

@mikelove mikelove closed this as completed Dec 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants