# ZINB_WaVE

This shows how to apply zinbwave to on a sample scRNA-seq dataset which contains 17 clusters and each cluster contains 20 genes.

In [1]:
# BiocManager::install("zinbwave")

In [2]:
library(zinbwave)
library(matrixStats)
library(magrittr)
library(ggplot2)
library(biomaRt)
library(data.table)

Loading required package: SummarizedExperiment

Loading required package: MatrixGenerics

Loading required package: matrixStats


Attaching package: ‘MatrixGenerics’


The following objects are masked from ‘package:matrixStats’:

    colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
    colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
    colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
    colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
    colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
    colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
    colWeightedMeans, colWeightedMedians, colWeightedSds,
    colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
    rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
    rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
    rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
    rowOrderStats, rowProds, rowQuantiles, rowRanges

In [3]:
df_g = read.csv(file = '../data/model_sel_genes.csv')
df_m = read.csv(unz('../data/meta.zip', "meta.tsv"),sep ="\t")
df = read.csv(unz('../data/model_sel_count.zip', "model_sel_count.csv"))

In [4]:
setnames(df_m, 
         old = c('post.mortem.interval..hours.', 'RNA.Integrity.Number', 'RNA.mitochondr..percent', 'RNA.ribosomal.percent'), 
         new = c('PMI', 'RIN', 'ribo_pct', 'mito_pct')
        )

In [5]:
df_m['UMIs_log']=log(df_m['UMIs'])
df_m['genes_log']=log(df_m['genes'])

zinbwave uses the same features on both nb and logit parts. So apply the selected features for nb parts on both nb and logit parts. The log likelihood using these features will be higher than using different features on nb and logit parts.

In [6]:
normFunc <- function(x){(x-mean(x, na.rm = T))/sd(x, na.rm = T)}
features = c("UMIs",
    "genes",
    "UMIs_log",
    "genes_log",
    "sex",
    "age",
    "Capbatch",
    "PMI",
    "RIN",
    "ribo_pct",
    "mito_pct")

f_to_norm  = c('UMIs_log',
 'age',
 'PMI',
 'RIN',
 'ribo_pct',
 'mito_pct')

clusters = unique(df_m[,'cluster'])

In [7]:
formula_base =paste('~',paste(features,collapse='+'),sep='')

In [8]:
# somehow zinb.loglik.matrix is removed from version 1.22.0
zinb.loglik.matrix <- function(model, x) {
    mu <- getMu(model)
    theta <- getTheta(model)
    theta_mat <- matrix(rep(theta, each = nrow(x)), ncol = ncol(x))
    pi <- getPi(model)
    lik <- pi * (x == 0) + (1 - pi) * dnbinom(x, size = theta_mat, mu = mu)
    lik[lik == 0] <- min(lik[lik != 0]) #to avoid log lik to be infinite
    log(lik)
}

In [9]:
df_r = NULL
for (cluster in clusters) {
    print(cluster)
    gene_ids = df_g[df_g$cluster == cluster, "gene_id"]
    df_f = df_m[df_m$cluster == cluster, features]
    df_f[f_to_norm] <- apply(df_f[f_to_norm], 2, normFunc)
    dfy = df[df$cell %in% df_m[df_m$cluster == cluster, "cell"], names(df) %in% gene_ids]
    # zinb will throw an error if sum of counts in a cell is zero
    # create a dummy gene to resolve this issue
    dfy["dummy"] = 1
    Y = t(dfy)
    X = with(df_f, model.matrix(as.formula(formula_base)))

    # zinb wave
    start_time <- Sys.time()
    mod <- zinbFit(Y, X, V = matrix(nrow = dim(Y)[1], ncol = 0), 
        K = 0, epsilon = 0, commondispersion = FALSE)
    end_time <- Sys.time()
    cpu_time = difftime(end_time, start_time, units = "secs")
    llk = colSums(zinb.loglik.matrix(mod, t(Y)))
    df_t <- data.frame(gene_id = rownames(Y), llf = llk)
    df_t["cluster"] = cluster
    df_t["cpu_time"] = cpu_time/length(gene_ids)
    df_t["model"] = "zinb"
    df_t["method"] = "zinb_wave"
    df_t = df_t[1:dim(df_t)[1] - 1, ]

    if (is.null(df_r)) {
        df_r = df_t
    } else {
        df_r = rbind(df_r, df_t)
    }
    write.csv(df_r, "ZINB_WaVE.csv")
}

[1] "Neu-NRGN-II"
[1] "L5/6"


In [10]:
df_r

gene_id,llf,cluster,cpu_time,model,method
<chr>,<dbl>,<chr>,<drtn>,<chr>,<chr>
ENSG00000162545,-10680.0,Neu-NRGN-II,0.3713011 secs,zinb,zinb_wave
ENSG00000117632,-13077.787,Neu-NRGN-II,0.3713011 secs,zinb,zinb_wave
ENSG00000034510,-15290.443,Neu-NRGN-II,0.3713011 secs,zinb,zinb_wave
ENSG00000138814,-10512.199,Neu-NRGN-II,0.3713011 secs,zinb,zinb_wave
ENSG00000171617,-14341.369,Neu-NRGN-II,0.3713011 secs,zinb,zinb_wave
ENSG00000169567,-8008.342,Neu-NRGN-II,0.3713011 secs,zinb,zinb_wave
ENSG00000189043,-7851.619,Neu-NRGN-II,0.3713011 secs,zinb,zinb_wave
ENSG00000205542,-14630.023,Neu-NRGN-II,0.3713011 secs,zinb,zinb_wave
ENSG00000133169,-12316.382,Neu-NRGN-II,0.3713011 secs,zinb,zinb_wave
ENSG00000166681,-12259.992,Neu-NRGN-II,0.3713011 secs,zinb,zinb_wave
