Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running MAST with random effect for subject via FindMarkers #3712

Closed
esrasefik opened this issue Nov 11, 2020 · 9 comments
Closed

Running MAST with random effect for subject via FindMarkers #3712

esrasefik opened this issue Nov 11, 2020 · 9 comments

Comments

@esrasefik
Copy link

Dear Seurat Team,

I am contacting you in regards to a question about how to use your FindMarkers function to run MAST with a random effect added for subject. I have a single cell RNAseq dataset with two genotypes (4 subjects each) and I’m trying to run a cluster specific differential expression analysis between these two genotypes to identify cell-type specific DEGs. I was able to successfully run MAST for this purpose via your FindMarkers function by using the instructions provided in your vignettes. I now would like to slightly modify my approach and add a random effect for subject to avoid a simple pseudoreplication type issue while keeping other parameters (i.e., logfc.threshold, min.pct etc.) the same as my original approach. To my knowledge, it should be possible to tweak the MAST algorithm to incorporate a random effect but I couldn’t find any instructions online on using Seurat to do this. I would appreciate any recommendations you could give me.

Here is the code I used for running MAST without a random effect (ident.1 and ident.2 indicate wild type and mutant genotypes):

FindMarkers(s_obj4, ident.1 = "0_2", ident.2 = "0_1", min.pct = 0.1, logfc.threshold = 0.1, test.use = "MAST", verbose = FALSE, assay = "RNA", slot = "data", latent.vars = "nCount_RNA", pseudocount.use = 0.001)

Thank you!
Esra

@esrasefik esrasefik changed the title Running MAST with random effects via FindMarkers Running MAST with random effect for subject via FindMarkers Nov 11, 2020
@jaisonj708
Copy link
Collaborator

We currently do not support this, but you may try altering our MAST wrapper to incorporate a random effect. The hurdle model is specified by the formula parameter in the zlm function, called in MASTDETest:

zlmCond <- MAST::zlm(formula = fmla, sca = sca, ...)

@jgamache014
Copy link

@esrasefik just wondering if you've gotten this to work, and if you wouldn't mind sharing your script? I am trying to do the same with my differential expression analysis.

@esrasefik
Copy link
Author

esrasefik commented Jan 27, 2021

@jgamache014 - I unfortunately haven't been able to. I decided to switch to using MAST directly instead of using the Seurat interface given the little support the developers provided for this issue. I don't yet have a solution I can share - still working on it. Sorry!

@jgamache014
Copy link

@esrasefik Thanks, I'll try using MAST directly as well!

@JonThom
Copy link

JonThom commented Feb 26, 2021

@jgamache014 @esrasefik have any of you succeeded? If not, what have been the stumbling blocks? I'd like to have a crack at it.

@rdalbanus
Copy link

This clunky solution will add this functionality to FindMarkers by making minor changes to two internal functions in Seurat and MAST. I haven't tested this extensively, but it seems to be working as expected so far. It's helpful for exploratory analyses when you only have a seurat object. However, I'd strongly suggest going through the steps in the MAST vignette (converting Seurat to SummarizedExperiment, etc) if you are planning to use this for important analyses.

Note that it doesn't allow using empirical Bayes estimation to regularize variance, and it's considerably slower.

# We want a different behavior from MASTDETest, where the latent variable is
# used as a random effect in the model
rem_MASTDETest <- function(
  data.use,
  cells.1,
  cells.2,
  latent.vars = NULL,
  verbose = TRUE,
  # New option - random effect variable (should be included in latent.vars)
  re.var = NULL,
  ...
) {
  # Check for MAST
  if (!PackageCheck('MAST', error = FALSE)) {
    stop("Please install MAST - learn more at https://github.com/RGLab/MAST")
  }
  group.info <- data.frame(row.names = c(cells.1, cells.2))
  latent.vars <- latent.vars %||% group.info
  group.info[cells.1, "group"] <- "Group1"
  group.info[cells.2, "group"] <- "Group2"
  group.info[, "group"] <- factor(x = group.info[, "group"])
  latent.vars.names <- c("condition", colnames(x = latent.vars))
  latent.vars <- cbind(latent.vars, group.info)
  latent.vars$wellKey <- rownames(x = latent.vars)
  fdat <- data.frame(rownames(x = data.use))
  colnames(x = fdat)[1] <- "primerid"
  rownames(x = fdat) <- fdat[, 1]
  sca <- MAST::FromMatrix(
    exprsArray = as.matrix(x = data.use),
    check_sanity = FALSE,
    cData = latent.vars,
    fData = fdat
  )
  cond <- factor(x = SummarizedExperiment::colData(sca)$group)
  cond <- relevel(x = cond, ref = "Group1")
  SummarizedExperiment::colData(sca)$condition <- cond

  # This is the main change in the code - we want ~ ( 1 | re.var) in the formula:
  # ~ condition + lat.vars + (1 | re.var)
  if (!is.null(re.var)) {
    if (!re.var %in% latent.vars.names) {
      stop("Random effect variable should be included in latent variables!")
    }
    latent.vars.names <- latent.vars.names[!latent.vars.names %in% re.var]
    fmla <- as.formula(
      object = paste0(
        " ~ ", paste(latent.vars.names, collapse = "+"), glue("+ (1|{re.var})")
      )
    )
    # print(fmla)
    # We need glmer to make this work
    method <-  "glmer" # trying to troubleshoot this - it can clash with the already existing method var in the function call    
    zlmCond <- MAST::zlm(formula = fmla, sca = sca, method = "glmer", ...)
  } else {
    # Original code
    fmla <- as.formula(
      object = paste0(" ~ ", paste(latent.vars.names, collapse = "+"))
    )
    zlmCond <- MAST::zlm(formula = fmla, sca = sca, ...)
  }
  summaryCond <- MAST::summary(object = zlmCond, doLRT = 'conditionGroup2')
  summaryDt <- summaryCond$datatable
  
  # The output format is slightly different, so we need adapt the code
  if(!is.null(re.var)) {
    p_val <- summaryDt[summaryDt$"component" == "H", 4]$`Pr(>Chisq)`
    genes.return <- summaryDt[summaryDt$"component" == "H", 1]$primerid
  } else {
    p_val <- summaryDt[summaryDt[, "component"] == "H", 4]
    genes.return <- summaryDt[summaryDt[, "component"] == "H", 1]
  }
  
  to.return <- data.frame(p_val, row.names = genes.return)
  return(to.return)
}

# This is the zlm function with a minor change to fix a buggy variable being passed possibly due to namespace clash
my_zlm <- function (formula, sca, method = "bayesglm", silent = TRUE, ebayes = TRUE, 
    ebayesControl = NULL, force = FALSE, hook = NULL, parallel = TRUE, 
    LMlike, onlyCoef = FALSE, exprs_values = MAST::assay_idx(sca)$aidx, 
    ...) 
{
    dotsdata = list(...)$data
    if (!is.null(dotsdata)) {
        if (!missing(sca)) 
            stop("Cannot provide both `sca` and `data`")
        sca = dotsdata
    }
    if (!inherits(sca, "SingleCellAssay")) {
        if (inherits(sca, "data.frame")) {
            if (!is.null(dotsdata)) {
                return(.zlm(formula, method = method, silent = silent, 
                  ...))
            }
            else {
                return(.zlm(formula, data = sca, method = method, 
                  silent = silent, ...))
            }
        }
        else {
            stop("`sca` must inherit from `data.frame` or `SingleCellAssay`")
        }
    }
    if (missing(LMlike)) {
        method <- match.arg(method, MAST:::methodDict[, keyword])
        method <- MAST:::methodDict[keyword == method, lmMethod]
        if (!is(sca, "SingleCellAssay")) 
            stop("'sca' must be (or inherit) 'SingleCellAssay'")
        if (!is(formula, "formula")) 
            stop("'formula' must be class 'formula'")
        Formula <- MAST:::removeResponse(formula)
        priorVar <- 1
        priorDOF <- 0
        if (ebayes) {
            # if (!methodDict[lmMethod == method, implementsEbayes]) 
          if (!all(MAST:::methodDict[lmMethod == method, implementsEbayes]))
          stop("Method", method, " does not implement empirical bayes variance shrinkage.")
            ebparm <- MAST:::ebayes(t(SummarizedExperiment:::assay(sca, exprs_values)), ebayesControl, 
                model.matrix(Formula, SummarizedExperiment::colData(sca)))
            priorVar <- ebparm[["v"]]
            priorDOF <- ebparm[["df"]]
            stopifnot(all(!is.na(ebparm)))
        }
        obj <- MAST:::new_with_repaired_slots(classname = method, design = SummarizedExperiment::colData(sca), 
            formula = Formula, priorVar = priorVar, priorDOF = priorDOF, 
            extra = list(...))
    }
    else {
        if (!missing(formula)) 
            warning("Ignoring formula and using model defined in 'objLMLike'")
        if (!inherits(LMlike, "LMlike")) 
            stop("'LMlike' must inherit from class 'LMlike'")
        obj <- LMlike
    }
    ee <- t(SummarizedExperiment:::assay(sca, exprs_values))
    genes <- colnames(ee)
    ng <- length(genes)
    MM <- MAST:::model.matrix(obj)
    coefNames <- colnames(MM)
    listEE <- setNames(seq_len(ng), genes)
    obj <- MAST:::fit(obj, ee[, 1], silent = silent)
    nerror <- totalerr <- 0
    pb = progress::progress_bar$new(total = ng, format = " Completed [:bar] :percent with :err failures")
    .fitGeneSet <- function(idx) {
        hookOut <- NULL
        tt <- try({
            obj <- MAST:::fit(obj, response = ee[, idx], silent = silent, 
                quick = TRUE)
            if (!is.null(hook)) 
                hookOut <- hook(obj)
            nerror <- 0
        })
        if (is(tt, "try-error")) {
            obj@fitC <- obj@fitD <- NULL
            obj@fitted <- c(C = FALSE, D = FALSE)
            nerror <- nerror + 1
            totalerr = totalerr + 1
            if (nerror > 5 & !force) {
                stop("We seem to be having a lot of problems here...are your tests specified correctly?  \n If you're sure, set force=TRUE.", 
                  tt)
            }
        }
        pb$tick(tokens = list(err = totalerr))
        if (onlyCoef) 
            return(cbind(C = coef(obj, "C"), D = coef(obj, "D")))
        summaries <- MAST:::summarize(obj)
        structure(summaries, hookOut = hookOut)
    }
    if (!parallel || getOption("mc.cores", 1L) == 1) {
        listOfSummaries <- lapply(listEE, .fitGeneSet)
    }
    else {
        listOfSummaries <- parallel::mclapply(listEE, .fitGeneSet, 
            mc.preschedule = TRUE, mc.silent = silent)
    }
    if (onlyCoef) {
        out <- do.call(abind, c(listOfSummaries, rev.along = 0))
        return(aperm(out, c(3, 1, 2)))
    }
    cls <- sapply(listOfSummaries, function(x) class(x))
    complain <- if (force) 
        warning
    else stop
    if (mean(cls == "try-error") > 0.5) 
        complain("Lots of errors here..something is amiss.")
    hookOut <- NULL
    if (!is.null(hook)) 
        hookOut <- lapply(listOfSummaries, attr, which = "hookOut")
    message("\nDone!")
    summaries <- MAST:::collectSummaries(listOfSummaries)
    summaries[["LMlike"]] <- obj
    summaries[["sca"]] <- sca
    summaries[["priorVar"]] <- obj@priorVar
    summaries[["priorDOF"]] <- obj@priorDOF
    summaries[["hookOut"]] <- hookOut
    summaries[["exprs_values"]] <- exprs_values
    summaries[["Class"]] <- "ZlmFit"
    zfit <- do.call(new, as.list(summaries))
    zfit
}

# We will replace the original function with ours inside the Seurat namespace
assignInNamespace('MASTDETest', rem_MASTDETest, asNamespace("Seurat"))
# getFromNamespace("MASTDETest", "Seurat")

assignInNamespace('zlm', my_zlm, asNamespace("MAST"))
# getFromNamespace("zlm", "MAST")

# And now we can call our FindMarkers with random effects model for sample_ID
# Don't forget to include ebayes = F
markers <- FindMarkers(
    sobj, ident.1 = "foo", ident.2 = "bar", only.pos = TRUE, 
    test.use = "MAST", verbose = F
    , latent.vars = "sample_ID", re.var = "sample_ID", ebayes = F
  )

@JonThom
Copy link

JonThom commented Jan 12, 2023

@rdalbanus thanks for sharing!

I also wrote a function - it's a while back now, but here it is:
https://github.com/CBMR-Single-Cell-Omics-Platform/SCOPfunctions/blob/main/R/differential_expression.R

@andrewdchen
Copy link

This clunky solution will add this functionality to FindMarkers by making minor changes to two internal functions in Seurat and MAST. I haven't tested this extensively, but it seems to be working as expected so far. It's helpful for exploratory analyses when you only have a seurat object. However, I'd strongly suggest going through the steps in the MAST vignette (converting Seurat to SummarizedExperiment, etc) if you are planning to use this for important analyses.

Note that it doesn't allow using empirical Bayes estimation to regularize variance, and it's considerably slower.

# We want a different behavior from MASTDETest, where the latent variable is
# used as a random effect in the model
rem_MASTDETest <- function(
  data.use,
  cells.1,
  cells.2,
  latent.vars = NULL,
  verbose = TRUE,
  # New option - random effect variable (should be included in latent.vars)
  re.var = NULL,
  ...
) {
  # Check for MAST
  if (!PackageCheck('MAST', error = FALSE)) {
    stop("Please install MAST - learn more at https://github.com/RGLab/MAST")
  }
  group.info <- data.frame(row.names = c(cells.1, cells.2))
  latent.vars <- latent.vars %||% group.info
  group.info[cells.1, "group"] <- "Group1"
  group.info[cells.2, "group"] <- "Group2"
  group.info[, "group"] <- factor(x = group.info[, "group"])
  latent.vars.names <- c("condition", colnames(x = latent.vars))
  latent.vars <- cbind(latent.vars, group.info)
  latent.vars$wellKey <- rownames(x = latent.vars)
  fdat <- data.frame(rownames(x = data.use))
  colnames(x = fdat)[1] <- "primerid"
  rownames(x = fdat) <- fdat[, 1]
  sca <- MAST::FromMatrix(
    exprsArray = as.matrix(x = data.use),
    check_sanity = FALSE,
    cData = latent.vars,
    fData = fdat
  )
  cond <- factor(x = SummarizedExperiment::colData(sca)$group)
  cond <- relevel(x = cond, ref = "Group1")
  SummarizedExperiment::colData(sca)$condition <- cond

  # This is the main change in the code - we want ~ ( 1 | re.var) in the formula:
  # ~ condition + lat.vars + (1 | re.var)
  if (!is.null(re.var)) {
    if (!re.var %in% latent.vars.names) {
      stop("Random effect variable should be included in latent variables!")
    }
    latent.vars.names <- latent.vars.names[!latent.vars.names %in% re.var]
    fmla <- as.formula(
      object = paste0(
        " ~ ", paste(latent.vars.names, collapse = "+"), glue("+ (1|{re.var})")
      )
    )
    # print(fmla)
    # We need glmer to make this work
    method <-  "glmer" # trying to troubleshoot this - it can clash with the already existing method var in the function call    
    zlmCond <- MAST::zlm(formula = fmla, sca = sca, method = "glmer", ...)
  } else {
    # Original code
    fmla <- as.formula(
      object = paste0(" ~ ", paste(latent.vars.names, collapse = "+"))
    )
    zlmCond <- MAST::zlm(formula = fmla, sca = sca, ...)
  }
  summaryCond <- MAST::summary(object = zlmCond, doLRT = 'conditionGroup2')
  summaryDt <- summaryCond$datatable
  
  # The output format is slightly different, so we need adapt the code
  if(!is.null(re.var)) {
    p_val <- summaryDt[summaryDt$"component" == "H", 4]$`Pr(>Chisq)`
    genes.return <- summaryDt[summaryDt$"component" == "H", 1]$primerid
  } else {
    p_val <- summaryDt[summaryDt[, "component"] == "H", 4]
    genes.return <- summaryDt[summaryDt[, "component"] == "H", 1]
  }
  
  to.return <- data.frame(p_val, row.names = genes.return)
  return(to.return)
}

# This is the zlm function with a minor change to fix a buggy variable being passed possibly due to namespace clash
my_zlm <- function (formula, sca, method = "bayesglm", silent = TRUE, ebayes = TRUE, 
    ebayesControl = NULL, force = FALSE, hook = NULL, parallel = TRUE, 
    LMlike, onlyCoef = FALSE, exprs_values = MAST::assay_idx(sca)$aidx, 
    ...) 
{
    dotsdata = list(...)$data
    if (!is.null(dotsdata)) {
        if (!missing(sca)) 
            stop("Cannot provide both `sca` and `data`")
        sca = dotsdata
    }
    if (!inherits(sca, "SingleCellAssay")) {
        if (inherits(sca, "data.frame")) {
            if (!is.null(dotsdata)) {
                return(.zlm(formula, method = method, silent = silent, 
                  ...))
            }
            else {
                return(.zlm(formula, data = sca, method = method, 
                  silent = silent, ...))
            }
        }
        else {
            stop("`sca` must inherit from `data.frame` or `SingleCellAssay`")
        }
    }
    if (missing(LMlike)) {
        method <- match.arg(method, MAST:::methodDict[, keyword])
        method <- MAST:::methodDict[keyword == method, lmMethod]
        if (!is(sca, "SingleCellAssay")) 
            stop("'sca' must be (or inherit) 'SingleCellAssay'")
        if (!is(formula, "formula")) 
            stop("'formula' must be class 'formula'")
        Formula <- MAST:::removeResponse(formula)
        priorVar <- 1
        priorDOF <- 0
        if (ebayes) {
            # if (!methodDict[lmMethod == method, implementsEbayes]) 
          if (!all(MAST:::methodDict[lmMethod == method, implementsEbayes]))
          stop("Method", method, " does not implement empirical bayes variance shrinkage.")
            ebparm <- MAST:::ebayes(t(SummarizedExperiment:::assay(sca, exprs_values)), ebayesControl, 
                model.matrix(Formula, SummarizedExperiment::colData(sca)))
            priorVar <- ebparm[["v"]]
            priorDOF <- ebparm[["df"]]
            stopifnot(all(!is.na(ebparm)))
        }
        obj <- MAST:::new_with_repaired_slots(classname = method, design = SummarizedExperiment::colData(sca), 
            formula = Formula, priorVar = priorVar, priorDOF = priorDOF, 
            extra = list(...))
    }
    else {
        if (!missing(formula)) 
            warning("Ignoring formula and using model defined in 'objLMLike'")
        if (!inherits(LMlike, "LMlike")) 
            stop("'LMlike' must inherit from class 'LMlike'")
        obj <- LMlike
    }
    ee <- t(SummarizedExperiment:::assay(sca, exprs_values))
    genes <- colnames(ee)
    ng <- length(genes)
    MM <- MAST:::model.matrix(obj)
    coefNames <- colnames(MM)
    listEE <- setNames(seq_len(ng), genes)
    obj <- MAST:::fit(obj, ee[, 1], silent = silent)
    nerror <- totalerr <- 0
    pb = progress::progress_bar$new(total = ng, format = " Completed [:bar] :percent with :err failures")
    .fitGeneSet <- function(idx) {
        hookOut <- NULL
        tt <- try({
            obj <- MAST:::fit(obj, response = ee[, idx], silent = silent, 
                quick = TRUE)
            if (!is.null(hook)) 
                hookOut <- hook(obj)
            nerror <- 0
        })
        if (is(tt, "try-error")) {
            obj@fitC <- obj@fitD <- NULL
            obj@fitted <- c(C = FALSE, D = FALSE)
            nerror <- nerror + 1
            totalerr = totalerr + 1
            if (nerror > 5 & !force) {
                stop("We seem to be having a lot of problems here...are your tests specified correctly?  \n If you're sure, set force=TRUE.", 
                  tt)
            }
        }
        pb$tick(tokens = list(err = totalerr))
        if (onlyCoef) 
            return(cbind(C = coef(obj, "C"), D = coef(obj, "D")))
        summaries <- MAST:::summarize(obj)
        structure(summaries, hookOut = hookOut)
    }
    if (!parallel || getOption("mc.cores", 1L) == 1) {
        listOfSummaries <- lapply(listEE, .fitGeneSet)
    }
    else {
        listOfSummaries <- parallel::mclapply(listEE, .fitGeneSet, 
            mc.preschedule = TRUE, mc.silent = silent)
    }
    if (onlyCoef) {
        out <- do.call(abind, c(listOfSummaries, rev.along = 0))
        return(aperm(out, c(3, 1, 2)))
    }
    cls <- sapply(listOfSummaries, function(x) class(x))
    complain <- if (force) 
        warning
    else stop
    if (mean(cls == "try-error") > 0.5) 
        complain("Lots of errors here..something is amiss.")
    hookOut <- NULL
    if (!is.null(hook)) 
        hookOut <- lapply(listOfSummaries, attr, which = "hookOut")
    message("\nDone!")
    summaries <- MAST:::collectSummaries(listOfSummaries)
    summaries[["LMlike"]] <- obj
    summaries[["sca"]] <- sca
    summaries[["priorVar"]] <- obj@priorVar
    summaries[["priorDOF"]] <- obj@priorDOF
    summaries[["hookOut"]] <- hookOut
    summaries[["exprs_values"]] <- exprs_values
    summaries[["Class"]] <- "ZlmFit"
    zfit <- do.call(new, as.list(summaries))
    zfit
}

# We will replace the original function with ours inside the Seurat namespace
assignInNamespace('MASTDETest', rem_MASTDETest, asNamespace("Seurat"))
# getFromNamespace("MASTDETest", "Seurat")

assignInNamespace('zlm', my_zlm, asNamespace("MAST"))
# getFromNamespace("zlm", "MAST")

# And now we can call our FindMarkers with random effects model for sample_ID
# Don't forget to include ebayes = F
markers <- FindMarkers(
    sobj, ident.1 = "foo", ident.2 = "bar", only.pos = TRUE, 
    test.use = "MAST", verbose = F
    , latent.vars = "sample_ID", re.var = "sample_ID", ebayes = F
  )

@rdalbanus Thanks for writing this script! I'm wondering why you wrote it in a way such that the random effect variable needs to also be a latent variable?

@rdalbanus
Copy link

rdalbanus commented Nov 2, 2023

@andrewdchen
I only did it this way to keep code changes to a minimum - you can see that there's some internal processing of the latent.vars flag, and I didn't want risk introducing bugs.

The latent.vars flag captures all the covariates you want to include in the analyses, which get converted into a formula in the fmla <- as.formula(...) line (e.g. latent.vars = c("sex", "age") results in a model ~ sex + age in the original code).

I only added a flag that selects one of these covariates and puts it in a (1 | x ) enclosing (e.g. latent.vars = c("donor", "sex", "age"), re.var = "donor" results in ~ age + sex + (1|donor)).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants