Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A helper function allowing arbitrarily complex custom contrasts #11

Open
atakanekiz opened this issue May 9, 2024 · 5 comments
Open

Comments

@atakanekiz
Copy link

atakanekiz commented May 9, 2024

Hi Hugo,

Thank you for the tremendously helpful tutorial! I like the getNumericCoef() function you mentioned in your Gist and I wanted to improve on that a little bit to allow specifying custom groups to compare across multiple conditions. I would appreciate it if you can take a look and let me know if this could be universally (?) used to streamline analyses. In a nutshell, the function has 4 inputs:

  • A DESeq object, dds, with design and colData
  • group1 and group2 arguments to specify groups to compare
  • weighted argument to weight differing group sizes differently

The group1 and group2 arguments have the same syntax: character vector of length 2+ where the first position points to one of the column names in colData(dds) (ie grouping variable), and the second position (and onwards) indicate(s) specific subsets to include in that comparison group. The contrast is extracted as group1 - group2 (ie group1 is the group of interest, and the group2 is the reference)

This is the function

contraster <- function(dds,    # should contain colData and design
                       group1, # list of character vectors each with 2 or more items 
                       group2, # list of character vectors each with 2 or more items
                       weighted = F
){
  
  mod_mat <- model.matrix(design(dds), colData(dds))
  
  grp1_rows <- list()
  grp2_rows <- list()
  
  
  for(i in 1:length(group1)){
    grp1_rows[[i]] <- colData(dds)[[group1[[i]][1]]] %in% group1[[i]][2:length(group1[[i]])]
  }
  
  
  for(i in 1:length(group2)){
    grp2_rows[[i]] <- colData(dds)[[group2[[i]][1]]] %in% group2[[i]][2:length(group2[[i]])]
  }
  
  grp1_rows <- Reduce(function(x, y) x & y, grp1_rows)
  grp2_rows <- Reduce(function(x, y) x & y, grp2_rows)
  
  mod_mat1 <- mod_mat[grp1_rows, ,drop=F]
  mod_mat2 <- mod_mat[grp2_rows, ,drop=F]
  
  if(!weighted){
    mod_mat1 <- mod_mat1[!duplicated(mod_mat1),,drop=F]
    mod_mat2 <- mod_mat2[!duplicated(mod_mat2),,drop=F]
  }
  
  return(colMeans(mod_mat1)-colMeans(mod_mat2))
}

This is the reprex


# simulate data
dds3 <- makeExampleDESeqDataSet(n = 1000, m = 9, betaSD = 2)
dds3$condition <- factor(rep(c("control", "treatmentA", "treatmentB"), each = 3))

design(dds3) <- ~ condition
dds3 <- DESeq(dds3)

# check the coefficients estimated by DEseq
resultsNames(dds3)

# Original way of accessing contrasts
# treatmentA vs control
res1 <- results(dds3, contrast = list("condition_treatmentA_vs_control"))
# treatmentB vs control
res2 <- results(dds3, contrast = list("condition_treatmentB_vs_control"))
# treatmentA vs treatmentB
res3 <- results(dds3, contrast= list("condition_treatmentA_vs_control",
                                     "condition_treatmentB_vs_control"))

# Alternatively we can use our new function
res1 <- results(dds3,
                contrast = contraster(dds3,
                                      group1 = list(c("condition","treatmentA")),
                                      group2 = list(c("condition", "control"))))

res2 <- results(dds3,
                contrast = contraster(dds3,
                                      group1 = list(c("condition","treatmentB")),
                                      group2 = list(c("condition", "control"))))

res3 <- results(dds3,
                contrast = contraster(dds3,
                                      group1 = list(c("condition","treatmentA")),
                                      group2 = list(c("condition", "treatment"))))

A more complex reprex


# simulate data
dds <- makeExampleDESeqDataSet(n = 1000, m = 24, betaSD = 2)
dds$condition <- factor(rep(c("control", "treatmentA", "treatmentB"), each = 8))
dds$sex <- factor(rep(c("male","male","male", "female", "female", "female"), 4))
dds$batch <- factor(rep(c(1,2), 12))

design(dds) <- ~ condition + sex + batch
dds <- DESeq(dds)

# check the coefficients estimated by DEseq
resultsNames(dds)

contraster(dds,
           group1=list(c("batch", 2), c("sex", "male"), c("condition", "treatmentB")),
           group2=list(c("batch", 1), c("sex", "female"), c("condition", "treatmentA")))

contraster(dds,
           group1=list(c("batch", 1), c("sex", "male"), c("condition", "treatmentB", "treatmentA")),
           group2=list(c("batch", 1), c("sex", "male"), c("condition", "control")))

contraster(dds,
           group1=list(c("batch", 1), c("sex", "male"), c("condition", "treatmentB", "treatmentA")),
           group2=list(c("batch", 1), c("sex", "male"), c("condition", "control")), weighted = T) 
@atakanekiz
Copy link
Author

A little update for using the function in interactions:

dds4 <- makeExampleDESeqDataSet(n = 1000, m = 12, betaSD = 2)
dds4$sex <- factor(rep(c("female", "male"), each = 6))
dds4$condition <- factor(rep(c("treatment", "control"), 6))
dds4 <- dds4[, order(dds4$sex, dds4$condition)]
colnames(dds4) <- paste0("sample", 1:ncol(dds4))

design(dds4) <- ~ sex + condition + sex:condition
dds4 <- DESeq(dds4)
resultsNames(dds4)

# male_vs_female in the treatment condition
diff1 <-  contraster(dds4,
                     group1 = list(c("sex", "male"), c("condition", "treatment")),
                     group2 = list(c("sex", "female"), c("condition", "treatment")))

# male_vs_female in the control condition
diff2 <- contraster(dds4,
                     group1 = list(c("sex", "male"), c("condition", "control")),
                     group2 = list(c("sex", "female"), c("condition", "control")))
  
res1 <- results(dds4, 
                contrast = diff1-diff2)

# or equivalently
res2 <- results(dds4, contrast = list("sexmale.conditiontreatment"))

# head(res1,3)
# head(res2, 3)

@tavareshugo
Copy link
Owner

Hi @atakanekiz, that does look like a useful and intuitive function to use.

However, as I've mentioned in my warning (now on the README as well), I'm in general not confident of the "universality" of this approach anymore, in particular for imbalanced designs.
For example, imagine that in batch 1 we only had females in the control group:

# replace all control males in batch 1 with female
dds$sex[c(1, 3, 7)] <- "female"

But we still want to adjust for average sex differences, so we fit your model:

design(dds) <- ~ condition + sex + batch
dds <- DESeq(dds)

Now, let's say I wanted to simply contrast treatment A vs control:

contraster(dds,
           group1=list(c("condition", "treatmentA")),
           group2=list(c("condition", "control")))

The result is:

(Intercept) conditiontreatmentA conditiontreatmentB          sexmale           batch2 
0.0000000           1.0000000           0.0000000           0.1666667          -0.1666667

We get coefficient weights for sexmale and batch2, because now the design is imbalanced.
And I think this is wrong, we do not want to add any weight to those coefficients in this case.
And the results will indeed be different from using the standard DESeq contrast:

results(dds, name = "condition_treatmentA_vs_control")

In summary, yes your function seems like a really nice modification, but the general strategy might not be as "universal" as we hoped it would be.

@atakanekiz
Copy link
Author

atakanekiz commented May 10, 2024

I see. I find your approach of creating contrast vectors the most straightforward and intuitively understandable way to specify contrasts in DESeq2. Simple designs (one factor-two/three levels or two factor-two levels) are easy to deal with by using the list or coefficient names, but things get complicated quickly.

I am curious about what you think about the following points:

1- I played around with it a bit and found out that even if there is a single male in batch1, the coefficients become zero. Of course the statistics may not be too powerful at that point due to n=1, but at least the coefficients are identified correctly. In your example, since both sexes are not found in batch1 of treament, we cannot possibly distinguish the effects of batch and sex. One option is not allowing these two together in the design if data are not available for all the groups. However, somebody may want to distinguish the effects of sex and batch for treatmentB, thus not allowing two contrasts together is clearly not the best choice. How about if I engineered a logic inside the function to throw an error if the specified groups contain no data?

2- The other approach is, making all coefficients with decimals zero when weighted = FALSE. This seems to solve the problem here, but I am not sure whether this is a right approach in general.

3- When would you need to assign weights in the contrasts? When multiple data subsets are grouped together, they should be assigned the same weight as you alluded to in #7

That would be tremendous if we can find a way to fix this unwanted behavior -- I think it will be a great service to many in the field 🦸

@atakanekiz atakanekiz changed the title A helper function allowing arbitrarily complex custom constrasts A helper function allowing arbitrarily complex custom contrasts May 10, 2024
@tavareshugo
Copy link
Owner

I find your approach of creating contrast vectors the most straightforward and intuitively understandable way to specify contrasts in DESeq2.

I agree, except that I eventually realised that this approach breaks down for imbalanced designs. So, it can be dangerous to trust it blindly.

Regarding your points:

1- For the example design I gave (one of the sexes missing in one of the batches), we don't want to find sex-by-batch interactions, we just want to account for an average effect of batch across our groups (that's what the additive design is doing effectively). We're assuming that there is an average offset for each batch. For example, in batch 2 on average the expression was 3 units higher and in batch 3 on average the expression was 1 unit lower, regardless of sex or treatment. Something like that. So, we do want to include batch in the formula, but when we make our contrasts we don't want to include the batch coefficient.

2 - I'm not sure it would work, because sometimes we may want to have decimals. For example, if I want to compare the average of treatment2 and treatment3 against control, then the coefficients for treatment2 and treatment3 should be 0.5 each.

3 - Yes, as I say in #7 I'm not sure the weights make sense at all. I need to revise the whole tutorial to remove that.

The more people open issues, the more skeptical I am of my initial approach!

@atakanekiz
Copy link
Author

At the risk of making you scream "Enough already!", another possible solution comes to mind: The unbalanced experiments cause erroneously weighted (and different) contrasts. As you said, we may want to have numbers like 0.5, 0.33, 0.25 in the contrasts vector (representing averages of 2,3 and 4 groups respectively). However, there should be more than one of these numbers in the contrasts vector (eg. condA--> 0.5 condB --> 0.5). Would it work to engineer something into contraster() function to throw an error when weight = FALSE and the contrast vector contains only one occurrence of a number (eg. sexmale ---> 0.16, batch2 ---> -0.16). There must be a way to deal with this problem (except of course directly accessing the contrasts as it was initially intended)...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants