Skip to content

thecailab/SCUTA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SCUTA

linear mixed modelling for single cell RNAseq time series data with multilevel design

install SCUTA

library(devtools)
install_github("thecailab/SCUTA")

Analysis flowchart

pipelineflowchart

Model fitting tutorial

SCUTA is a tools that designed for fitting linear mixed models for single cell RNAseq datasets, especially with longitudinal multi-level design. Two multi-level designs are shown in the flowchart. Focusing on design 1, we show how to integrate precison weights and dropout weights into the differential expression analysis in scRNA-seq data.

Load library and data
library("SCUTA")

The imput matrix is the count matrix and meta data. counts is expression matrix whhere columns represent cells, rows represent genes. coldata is the meta data including condition, individual and time information of each cell.

data(example.data)

Load libraries

library("zinbwave")
library("DESeq2")
library("edgeR")
library("variancePartition")

prepare DESeqDataSet file and specify the variables that are taken into acount handling dropout events.

counts  <- example.data$counts
coldata <- example.data$coldata

fluidigm <- DESeqDataSetFromMatrix(countData = counts,
                                   colData = coldata,
                                   design = ~ condition + time)

The zinbwave function is used to compute dropout weights which alleviate the bias casued by dropout event and thus unlock bulk RNA-seq tools for single-cell applications, as illustrated in Van den Berge et al. 2018.

zinb <- zinbFit(fluidigm, K=2, epsilon=1000)
fluidigm_zinb <- zinbwave(fluidigm, fitted_model = zinb, K = 2, epsilon=1000, observationalWeights = TRUE)
weights <- assay(fluidigm_zinb, "weights")

voomWithDreamWeights function is used to transform count data to log2-counts per million (logCPM), estimate the mean-variance relationship and compute the precision weights, which capture the measurement error in RNA-seq counts. Then, the zinb weights and mean-variance weights are combined together as the overall weights. At last, a three-level linear mixed model is fitted for the data. The model can be expressed as

$$ E(log_2(Y_ig)) = \beta_{0g} + \beta_{1g} * condition_{i} + \beta_{2g} * time_{i} + \alpha_{0ig} + \alpha_{1ig} * time_{i} $$

Where $i$ denotes cell, $g$ represents gene, $\alpha_{0ig}$ is the random sample effect, and $\alpha_{1ig}$ is the random time effect.

d0 <- DGEList(counts)
d0 <- calcNormFactors(d0)
form <- ~ condition + time + (1 + time |individual)
vobjDream = voomWithDreamWeights( d0, form, coldata )
vobjDream.weight<-vobjDream
vobjDream.weight$weights <- weights*vobjDream.weight$weights

getContrast is used to specify the contrast matrix for linear mixed model. The three-level linear mixed model is fitted by suing SCUTA function.

fit.SCUTA     <- SCUTA( vobjDream.weight, form, coldata)

The SCUTA() function is modified to replace the 'dream()' function in variancePartition, so that they share the same parameters and any function in variance partition that used combined with dream() function can be used in conjuction with 'SCUTA()' function. For example, the top 6 differentiall expressed genes between two conditions are

fit.SCUTA.res <- topTable(fit.SCUTA, coef="condition2", number=nrow(counts) )
head(fit.SCUTA.res)

            logFC  AveExpr         t      P.Value    adj.P.Val     z.std
Gene558 -1.451388 4.991292 -9.796536 1.481199e-09 1.478237e-06 -6.046398
Gene569 -1.479936 4.852454 -8.988242 6.702487e-08 3.195357e-05 -5.398969
Gene740 -1.337853 5.140980 -8.551072 9.605281e-08 3.195357e-05 -5.334037
Gene379 -1.452054 5.376776 -7.665623 4.786163e-07 9.380289e-05 -5.034693
Gene98  -1.358545 5.261751 -7.681800 5.068412e-07 9.380289e-05 -5.023705
Gene213 -1.286129 5.877459 -7.457352 5.639453e-07 9.380289e-05 -5.003172

Where logFC is the log fold change comparing condition 2 with condition 1, P.Value and adj.P.Val are the unadjusted and FDR adjusted P values, respectively.

Example of linear mixed models

For the real data application in the manuscript, we fit 3 linear mixed models with respect to different aims. The scRNA-seq study consists of 1529 cells from 88 human embryos across 5 days (Day 3 - Day 7). Three lineages were identified after Day 5, including trophectoderm (TE), primitive endoderm (PE), and epiblast (EPI). Cells from each embryo in each time point were sequenced. Model 1 was fitted for cells in each lineage, separately, by also including cells in Day 3 and Day 4 in the model. The aim was to identify genes which show temporal trend in each lineage when considering correlations among cells within the same embryo.

$$Model 1: E(log_2(Y_ig)) = \beta_{0g} + \beta_{1g} * time_{i} + \alpha_{ig} $$

Where g represents gene while i denotes embryo. Model 2 was fitted for cells in each time point from Day 5 to Day 7, respectively, to detect genes that show differential expression between lineages.

$$Model 2: E(log_2(Y_ig)) = \beta_{0g} + \beta_{2g} * lineage_{i} + \alpha_{ig} $$

Model 3 was fitted for cells from Day 5 to Day 7 with fixed effect of time and lineage, random effect of lineage and embryo according to the hierarchy structure.

$$Model 3: E(log_2(Y_ig)) = \beta_{0g} + \beta_{1g} * time_{i} + \beta_{2g} * lineage_{i} + \alpha_{1ig} + \alpha_{2ig} $$

Where $\alpha_{1ig}$ represents the random effect for embryo, and $\alpha_{2ig}$ capture the within lineage variation with k denoting the lineage.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages