## Gene regulatory network inference using RNA and ATAC features

The usage chromatin-accessible peaks and target genes together is helpful due to an assumed mechanistic relationship during the control of gene regulation betwene those. Briefly, regulatory regions annotated as promoters and local/distal enhancers act during the early phases of gene expression regulation, and their accessibility increase of decrease can be used as a proxy for changes in its overall state. Hence, the correlation between proximal and distal accessible elements and target genes within a genome neighborhood distance (e.g. less than 200 Mbp), is relevant and useful as a way to annotate regulatory relationships and include those for Gene Regulatory Network (GRN inference.

Using sequencing data describing gene (RNA) and peak (ATAC) features, tools that build correlation matrices between peaks and matrices are useful for summarizing the strongest interactions. In this notebook, we will use the package FigR to describe such GRN building steps on a donor of the NeurIPs dataset. Preparation scripts of this notebook will use cisTopic to describe the most important peak groups that are the cell-to-cell variability in the observed count matrix.

### Environment setup

In [1]:
suppressWarnings(library(FigR))

Loading required package: Matrix

Loading required package: SummarizedExperiment

Loading required package: MatrixGenerics

Loading required package: matrixStats


Attaching package: ‘MatrixGenerics’


The following objects are masked from ‘package:matrixStats’:

    colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
    colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
    colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
    colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
    colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
    colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
    colWeightedMeans, colWeightedMedians, colWeightedSds,
    colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
    rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
    rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
    rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
    rowOrderStats,

The data that downloaded belongs to a shareseq dataset, and it is part of the core FigR tutorials.

In [3]:
setwd('/mnt/c/Users/ignacio.ibarra/Dropbox/workspace/theislab/extended-single-cell-best-practices/jupyter-book/mechanisms')

In [4]:
# for testing purposes, here we define a subset of features
ncells = 100
nfeatures = 6227

Load multiome data from NeurIPs (donor s1d1)

Loading RNA

In [5]:
library(Matrix)
m <- read.csv('../../data/openproblems_bmmc_multiome_genes_filtered_rna_s1d1.csv.gz', nrows=ncells, row.names=1)
# this is a cells x genes matrix, so we have to transpose it to match with FigR (genes x cells)
RNAmat <- t(as(as.matrix(m), "sparseMatrix"))
dim(RNAmat)

### Load the ATAC seq data

In [7]:
obs <- read.csv('../../data/openproblems_bmmc_multiome_genes_filtered_atac_s1d1_obs.csv.gz', row.names=1, nrows=ncells)
dim(obs)

Convert the raw counts from donor s1d1 into a SummarizedExperiment object

In [8]:
library(SummarizedExperiment)

Loading required package: MatrixGenerics

Loading required package: matrixStats


Attaching package: ‘MatrixGenerics’


The following objects are masked from ‘package:matrixStats’:

    colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
    colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
    colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
    colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
    colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
    colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
    colWeightedMeans, colWeightedMedians, colWeightedSds,
    colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
    rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
    rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
    rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
    rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
    rowSdDiffs, rowSds, rowSums2, ro

Load peak counts for donor s1d1

In [22]:
ATAC <- read.csv('../../data/openproblems_bmmc_multiome_genes_filtered_atac_s1d1_counts_transposed_1000.csv.gz', nrows=nfeatures, row.names=1)
print(dim(ATAC))

if(ncells != -1)
    ATAC <- ATAC[c(1:nrow(ATAC)),c(1:ncells)]

[1] 1008 6227


Convert into SummarizedExperiment object

In [23]:
# colnames(ATAC)

In [25]:
ATAC.se <- makeSummarizedExperimentFromDataFrame(ATAC)
ATAC_colData <- read.csv('../../data/openproblems_bmmc_multiome_genes_filtered_atac_s1d1_obs.csv.gz', nrows=ncells, row.names=1)


Setup the colData columns from ATAC

In [26]:
suppressWarnings(for(c in colnames(ATAC_colData)){
    # print(c)
    colData(ATAC.se)[c] <- ATAC_colData[c]
})

In [27]:
c(dim(ATAC.se), dim(RNAmat))

Preprocessing of SummarizedExperimentObject using the number of cells

In [28]:
set.seed(123)
# cellsToKeep <- sample(colnames(ATAC.se), size = 10000, replace = FALSE)
# ATAC.se <- ATAC.se[,cellsToKeep]
# RNAmat <- RNAmat[,cellsToKeep]

# Remove genes with zero expression across all cells
RNAmat <- RNAmat[Matrix::rowSums(RNAmat) != 0,]

### CisTopic step

Here we prepare the input for cistopic from ATAC raw counts, and calculate of the cisTopic object

Installation of cisTopic (if required).

In [91]:
# ```
# devtools::install_github("aertslab/cisTopic")
#

In [92]:
suppressWarnings(library(cisTopic))

In [None]:
cistopic_bkp_path <- "../../data/openproblems_bmmc_multiome_genes_filtered_atac_s1d1_counts_cisTopic_100.rds"

if(!file.exists(cistopic_bkp_path)){
    # the number of topics to test
    n_topics = 1:25

    atac <- read.csv('../../data/openproblems_bmmc_multiome_genes_filtered_atac_s1d1_counts.csv.gz', row.names=1)
    rownames(atac) <- paste0(atac$chr, ':', atac$start, '-', atac$end)
    atac$chr <- NULL
    atac$start <- NULL
    atac$end <- NULL
    # head(atac, 2)
    
    cisTopicObject <- createcisTopicObject(atac, project.name='neurips_s1d1')
    
    cisTopicObject <- runCGSModels(cisTopicObject, topic=c(n_topics), # , 5:15, 20, 25), # topic=c(2, 5:15, 20, 25),
                               seed=987, nCores=2, burnin = 120,
                               iterations = 150, addModels=FALSE)
    
    cisTopicObject <- selectModel(cisTopicObject, type='maximum')

    # cisTopicObject
    cisTopicObject <- runUmap(cisTopicObject, target='cell')

    topic.mat <- modelMatSelection(cisTopicObject, 'cell', 'Probability')
    topic.mat <- t(topic.mat)
    topic.mat <- as.matrix(topic.mat)
    saveRDS(topic.mat, cistopic_bkp_path)
}


In [None]:
print('done calculating CisTopic object')

Once the CisTopic object is generated, we can use it to extract relevant features

In [32]:
cisAssign <- readRDS(cistopic_bkp_path)
dim(cisAssign) # Cells x Topics

# all(cellsToKeep %in% rownames(cisAssign))
# # Subset
# cisAssign <- cisAssign[cellsToKeep,]

Calculate a kNN-graph using the topics matrix from cisTopic

In [33]:
library(dplyr)
library(FNN)

In [34]:
set.seed(123)
cellkNN <- get.knn(cisAssign, k = 30)$nn.index
dim(cellkNN)
# rownames(cellkNN) <- cellsToKeep

Visualize cells using traditional UMAP

In [35]:
# for a custom color scheme
# annoCols <- readRDS("./shareseq/shareseq_skin_annoCols.rds")
# annoCols

In [37]:
colData(ATAC.se)$cellAnnot <- colData(ATAC.se)$cell_type

ERROR: Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'colData': object 'ATAC.se' not found


In [None]:
# Plot
library(ggplot2)
colData(ATAC.se) %>% as.data.frame() %>% ggplot(aes(UMAP1,UMAP2,color=cellAnnot)) + 
  geom_point(size=0.5) + # scale_color_manual(values=annoCols)+
  theme_classic() + guides(colour = guide_legend(override.aes = list(size=2)))

The main FigR algorithm is executed here.

In [114]:
library(BSgenome.Hsapiens.UCSC.hg38)

In [115]:
# check object dimensions
c(dim(ATAC.se), dim(RNAmat))

The following step cannot be run interactively, and it requires to be executed in a terminal using R

In [50]:
# Don't run interactively
bkp_path_ciscorr <- '../../data/openproblems_bmmc_multiome_genes_filtered_s1d1_ciscorr.rds'
if(!file.exists(bkp_path_ciscorr)){
    cisCorr <- FigR::runGenePeakcorr(ATAC.se = ATAC.se,
                               RNAmat = RNAmat,
                               genome = "hg38", # One of hg19, mm10 or hg38
                               nCores = 2,
                               p.cut = NULL, # Set this to NULL and we can filter later
                               n_bg = 100)
    saveRDS(cisCorr, bkp_path_ciscorr)
}
cisCorr <- readRDS(bkp_path_ciscorr)

ERROR: Error in stopifnot(inherits(ATAC.se, "RangedSummarizedExperiment")): object 'ATAC.se' not found


Filter relevant peak-gene correlations by p-value

In [49]:
cisCorr.filt <- cisCorr %>% filter(pvalZ <= 0.05)
# Determine DORC genes
dorcGenes <- cisCorr.filt %>% dorcJPlot(cutoff=7, # Default
                                       returnGeneList = TRUE)

ERROR: Error in filter(., pvalZ <= 0.05): object 'cisCorr' not found


Get DORC scores

In [48]:
dorcMat <- getDORCScores(ATAC.se,dorcTab=cisCorr.filt, geneList=dorcGenes, nCores=4)
# Smooth DORC scores (using cell KNNs)
dorcMat.smooth <- smoothScoresNN(NNmat=cellKNN.mat, mat=dorcMat, nCores=4)

ERROR: Error in is.data.frame(x): object 'cisCorr.filt' not found


To execute the smoothScores NN function with multiple cores, doParallel is required.

In [26]:
library(doParallel)

Loading required package: foreach

Loading required package: iterators

Loading required package: parallel



In [47]:
# Smooth dorc scores using cell KNNs (k=30)
dorcMat.s <- smoothScoresNN(NNmat = cellkNN[,1:30], mat = dorcMat, nCores = 4)

# Smooth RNA using cell KNNs
# This takes longer since it's all genes
RNAmat.s <- smoothScoresNN(NNmat = cellkNN[,1:30], mat = RNAmat, nCores = 4)

ERROR: Error in ncol(mat): object 'dorcMat' not found


In [46]:
library(ggplot2)
library(ggrastr)

In [45]:
# Visualize on pre-computed UMAP
umap.d <- as.data.frame(colData(ATAC.se)[,c("UMAP1","UMAP2")])

ERROR: Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': error in evaluating the argument 'x' in selecting a method for function 'colData': object 'ATAC.se' not found


DORC score for a marker gene (TBX21)

In [44]:

marker_gene = 'TBX1'
dorcg <- plotMarker2D(umap.d,dorcMat.s,markers = c(marker_gene),maxCutoff = "q0.99",
                      colorPalette = "brewer_heat") + ggtitle(paste0(marker_gene, ' DORC'))


ERROR: Error in stopifnot(class(df) == "data.frame"): object 'umap.d' not found


RNA for Dlx3

In [25]:
rnag <- plotMarker2D(umap.d,RNAmat.s,markers = c(marker_gene),maxCutoff = "q0.99",
                     colorPalette = "brewer_purple") + ggtitle(paste0(marker_gene, ' RNA'))

Plotting  Dlx3 


Using patchwork, we can merge the dorcg and rnag objects, and visualize those interatively

In [43]:
library(patchwork)
dorcg + rnag


Attaching package: ‘patchwork’


The following object is masked from ‘package:cowplot’:

    align_plots




ERROR: Error in eval(expr, envir, enclos): object 'dorcg' not found


In [None]:
figR.d <- runFigRGRN(ATAC.se = ATAC.se, # Must be the same input as used in runGenePeakcorr()
                     dorcTab = cisCorr.filt, # Filtered peak-gene associations
                     genome = "mm10",
                     dorcMat = dorcMat.s,
                     rnaMat = RNAmat.s, 
                     nCores = 2)

## Results visualization

TF-DORC regulation scores (scatter plot)

In [None]:
require(ggplot2)
require(ggrastr)
require(BuenColors) # https://github.com/caleblareau/BuenColors

figR.d %>% 
  ggplot(aes(Corr.log10P,Enrichment.log10P,color=Score)) + 
  ggrastr::geom_point_rast(size=0.01,shape=16) + 
  theme_classic() + 
  scale_color_gradientn(colours = jdb_palette("solar_extra"),limits=c(-3,3),oob = scales::squish,breaks=scales::breaks_pretty(n=3))

Rank based visualization of drive TFs

In [None]:
rankDrivers(figR.d,score.cut = 2,rankBy = "nTargets",interactive = TRUE)


Heatmap-based visualization of FigR scores

In [None]:
library(ComplexHeatmap)
plotfigRHeatmap(figR.d = figR.d,
                score.cut = 2,
                column_names_gp = gpar(fontsize=6), # from ComplexHeatmap
                show_row_dend = FALSE # from ComplexHeatmap
                )

Heatmap-based visualization of FigR scores

In [None]:
library(ComplexHeatmap)
plotfigRHeatmap(figR.d = figR.d,
                score.cut = 2,
                column_names_gp = gpar(fontsize=6), # from ComplexHeatmap
                show_row_dend = FALSE # from ComplexHeatmap
                )

Heatmap-based visualization of DORC-scores for candidates genes

In [None]:
library(ComplexHeatmap)
plotfigRHeatmap(figR.d = figR.d,
                score.cut = 1,
                TFs = c("Lef1","Dlx3","Grhl1","Gata6","Klf3","Barx2","Pou2f3"),
                column_names_gp = gpar(fontsize=6), # from ComplexHeatmap
                show_row_dend = FALSE # from ComplexHeatmap
                )

Network-based visualization of generated results (using package networkD3)

In [41]:
library(networkD3)

In [None]:
plotfigRNetwork(figR.d,
                score.cut = 2,
                TFs = c("Lef1","Dlx3","Grhl1","Gata6","Klf3","Barx2","Pou2f3"),
                weight.edges = TRUE)

## Takeaways

In this notebook, we have:

1. Prepare an RNA and ATAC object using R, for processing with FigR and CisTopic.
2. Calculated DORC scores with FigR, and visualize those as scatter, heatmap and networks.

## Quiz

### Theory

1. Why are peak-to-gene associations considered mechanistically valid?
2. What are there more peaks than genes, and as consequence, multiple peaks targeting a single gene, when building GRNs?
3. What is considered a TF activator/repressor as the level of peak, and at the level of gene groups.
4. What additional readouts are complementary to scRNA-seq and scATAC-seq, when interpreting ATAC+RNA GRN models.

### FigR

1. What is the DORC score and how it could be useful to identify regulatory interactions between peaks and genes?

## References

## Contributors
We gratefully acknowledge the contributions of:
### Authorms
* Ignacio Ibarra
### Reviewers
* Lukas Heumos