DGEclustering

DGEclustering is an R package for multidimensional clustering of differential gene expression (DGE) datasets, and it integrates GO annotations to improve the clustering result.

Introduction

DGEclustering performs two primary tasks:
I. Searching from a starting directory for differential gene expression datasets (DESeq2 outputs), and automatically generating diagnostic plots for paired DGE datasets, which includes: Q-Q plots, fish plots, and scatter plots.

II. Conducting multidimensional clustering analysis integrated with GO terms for paired (could be two or more) DGE datasets. The GO terms are intergrated using the InteGO package (Verbanck, M., Lê, S. & Pagès, J. A new unsupervised gene clustering algorithm based on the integration of biological knowledge into expression data. BMC Bioinformatics 14, 42 (2013).)

Installation

1. Clone the github repository:

git clone https://github.com/reneechou123/DGEclustering.git

2. Create and activate conda environment:

conda create -n r-dgeclustering --file DGEclustering/requirements.txt
source activate r-dgeclustering

3. Install package in R:

install.packages('DGEclustering', type='source', repos=NULL)

Alternatively, the user can install DGEclustering from Bioconda, but it may not be the latest version:

conda install r-dgeclustering

*The versions of some dependencies are not specified for the one that is put onto Bioconda

Usage

Import libraries:

library(DGEclustering)
library(AnnotationHub)
library(grid)
# Code from https://github.com/lcdb/lcdb-wf/blob/master/workflows/rnaseq/downstream/helpers.Rmd
hub <- AnnotationHub::.Hub("AnnotationHub",
        getAnnotationHubOption("URL"),
        './AnnotationHubCache',
        httr::use_proxy(Sys.getenv("http_proxy")),
        FALSE)

I. Automation of diagnostic plots: `automation`

This function will create plotting folders for the three types of diagnostic plots in the starting directory: qq_plots, fish_plots, and scatter_plots.

# Specify the column name for gene IDs in DGE datasets as well as the starting directory
gene.col <- 'gene'
dir <- './'

# Automated plotting of diagnostic plots
automation(rootDir=dir, geneCol=gene.col, qqPlot=TRUE, fishPlot=TRUE, scatterPlot=TRUE)

## (optional) search for files in the database
library(RSQLite)
mydb <- dbConnect(RSQLite::SQLite(), 'rnaseq.db')
dbListTables(mydb)

## remember to close and disconnect the database
dbDisconnect(mydb)
unlink('rnaseq.db')

Troubleshooting:

If automation failed, there may be some tsv files that are similar to a DGE dataset but miss some columns. In that case, please look for the error message "KeyError" and find out the missing column(s).
If automation does not work for all the DGE datasets, please be sure that the column name for gene IDs are exactly the same in each dataset.

II. Multidimensional clustering integrated with GO terms

Step 0-1: Specify the column name for gene IDs, organism database, and key type

gene.col <- 'gene'
orgdb <- query(hub, 'Drosophila melanogaster')[['AH57972']]
keytype <- 'FLYBASE'

Step 0-2: Specify the paired input files

# Import example datasets
data(list=c('treatment1.vs.control', 'treatment2.vs.control', 'treatment3.vs.control'))

# Create a list of datasets
## in this example we cluster on only two paired datasets
## the user can also cluster on more paired datasets by putting them into a list
datasets <- list(treatment1.vs.control, treatment2.vs.control)
names(datasets) <- c('treatment1.vs.control', 'treatment2.vs.control')

Step 1-1: Set the (adjusted) p-value thresholds for selecting genes

adjPvalue <- TRUE
x.threshold <- 0.05 ## for the first dataset
y.threshold <- 0.05 ## for the second dataset
## the significant threshold for multiple files will be the smaller one between x. and y.threshold

Step 1-2: Merge datasets and subset the significant genes: `sig.subset`

This function Generates significant scatter plot as well as merged significant subsets

## 'x.dsNumber' and 'y.dsNumber' is for plotting purposes (x- and y-axis)
sig.res <- sig.subset(datasets, geneCol=gene.col, x.dsNumber=1, y.dsNumber=2, x.threshold=x.threshold, 
y.threshold=y.threshold, adjPvalue=adjPvalue)

## show Plot
sig.res$p

## show datasets
## for two paired datasets, there will be discordant and concordant datasets. Discordant dataset contains 
## a set of genes having different signs of log2 fold changes between the paired datasets, whereas 
## concordant dataset contains genes having same signs of log2 fold changes.
if (length(sig.res) == 3) { ## for two paired datasets
  head(sig.res$dis)
  head(sig.res$con)
} else { ## For more paired datasets
  head(sig.res$dat)
}

Step 2: Annotate Genes: `annotate.genes`

# Assign the `dat` variable
if (length(sig.res) == 3) { ## for two paired datasets we can combine the discordant and condordant datasets
  dat <- rbind(sig.res$dis, sig.res$con)
} else { ## for more paired datasets
  dat <- sig.res$dat
}

# Annotate genes
ann <- annotate.genes(OrgDb=orgdb, keyType=keytype, genes=unlist(dat[gene.col]), GOEnrichment=FALSE)
rownames(ann) <- make.names(dat[,gene.col], unique=TRUE)

Step 3: Prepare expression and annotation datasets for clustering

Expression dataset: choose the desired dimensions
Annotation dataset: choose the number of GO terms for optimal clustering
Number of groups: choose the number of group for optimal clustering

# expression dataset
## show column names for dat
colnames(dat)

## this example selects log2 fold changes and adjusted p-values as input
## no scaling required
exp <- dat[,grepl("log2FoldChange|pvalue", colnames(dat))]
rownames(exp) <- make.names(dat[,gene.col], unique=TRUE)

# annotaiton dataset for `intego` integration method
## calculate number of GO terms assign to a specific number of genes
sum(apply(ann, 2, sum) >= 10)

## choose the appropriate number of GO terms for annotation dataset
ann <- ann[, apply(ann, 2, sum) >= 10]

# annotation semantic similarity matrix for `newdis` integration method
GO.sim <- create.GO.sim.mat(genes=rownames(exp), OrgDb=orgdb, ont='BP', keyType=keytype, computeIC=FALSE, measure='Wang', out.file.path=NULL)

# choose the number of clustering groups
nb.group=8

Step 4: Cluster genes, and visualize the result: `DGE.clust`, `cluster.plot`

# Clustering analysis
## the user can choose between two types of integration method through `integrate.method`:
## 1. `intego`, which decomposes expression matrix into a binary matrix (InteGO; DOI: 10.1186/1471-2105-14-42)
## 2. `newdis`, which combines GO term semantic dissimilarity matrix and gene expression dissimilarity matirx
##     to create a new distance matrix
##
## the user can also choose from two types of clustering algorithm through `clust.method`:
## 1. `agnes` (Agglomerative Nesting (Hierarchical Clustering); DOI:10.1002/9780470316801)
## 2. `genclust` (GenClust; DOI: 10.1186/1471-2105-6-289)

# integrate.method='intego'
res <- DGE.clust(expressions=exp, annotations=ann, integrate.method='intego', clust.method='agnes', nb.group=nb.group)

## view clustering result and evaluation scores
res$groups 
res$evaluation 

# visualize the clustering result
p <- cluster.plot(datasets, res$groups, x.dsNumber=1, y.dsNumber=2, geneCol=gene.col, adjPvalue=adjPvalue)

## show plot
p

# integrate.method='newdis'
library(GOSemSim)
semData <- godata(OrgDb=orgdb, ont='BP', keytype=keytype, computeIC=FALSE)
GO.sim <- mgeneSim(rownames(exp), semData, measure='Wang')
res2 <- DGE.clust(expressions=exp, annotations=ann, integrate.method='newdis', clust.method='agnes', nb.group=nb.group,
OrgDb=orgdb, keyType=keytype, alpha=1, nb.dim=2)

## view clustering result and evaluation scores
res2$groups 
res2$evaluation 

# visualize the clustering result
p2 <- cluster.plot(datasets, res2$groups, x.dsNumber=1, y.dsNumber=2, geneCol=gene.col, adjPvalue=adjPvalue)

## show plot
p2

# visualize the clustering result (MCA plot)
p.MCA <- cluster.plot(res.groups=res$groups, res.MCA=res$MCA, MCA=TRUE, geneCol=gene.col, adjPvalue=adjPvalue)

## show plot
p.MCA

Step 5: GO enrichment of the clustering result (visualization): `cluster.enrich`

# Background genes for GO enrichment
bg.genes <- treatment1.vs.control[gene.col]

# Visualizing GO enrichment result, and showing only the top 5 terms for each cluster
p.GO <- cluster.enrich(clusterGroups=res$groups, OrgDb=orgdb, keyType=keytype, BgGenes=bg.genes, 
ont='BP', top=5)

## show plot
p.GO

Session Information

sessionInfo()

## R version 3.4.1 (2017-06-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Fedora 28 (Twenty Eight)

## Matrix products: default
## BLAS: /home/guest/miniconda3/envs/r-dgeclustering/lib/R/lib/libRblas.so
## LAPACK: /home/guest/miniconda3/envs/r-dgeclustering/lib/R/lib/libRlapack.so

## locale:
##  [1] LC_CTYPE=en_US.utf8       LC_NUMERIC=C             
##  [3] LC_TIME=en_US.utf8        LC_COLLATE=en_US.utf8    
##  [5] LC_MONETARY=en_US.utf8    LC_MESSAGES=en_US.utf8   
##  [7] LC_PAPER=en_US.utf8       LC_NAME=C                
##  [9] LC_ADDRESS=C              LC_TELEPHONE=C           
## [11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C      

## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     

## other attached packages:
##  [1] InteGO_2.0            AnnotationDbi_1.40.0  IRanges_2.12.0       
##  [4] S4Vectors_0.16.0      Biobase_2.38.0        clusterProfiler_3.6.0
##  [7] DOSE_3.4.0            ggplot2_2.2.1         stringr_1.3.0        
## [10] RSQLite_2.0           AnnotationHub_2.10.1  BiocGenerics_0.24.0  
## [13] DGEclustering_0.1.0   cluster_2.0.6         FactoMineR_1.41      

## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.15                  lattice_0.20-34              
##  [3] tidyr_0.8.1                   GO.db_3.5.0                  
##  [5] png_0.1-7                     digest_0.6.15                
##  [7] mime_0.5                      R6_2.2.2                     
##  [9] plyr_1.8.4                    BiocInstaller_1.28.0         
## [11] httr_1.3.1                    pillar_1.2.2                 
## [13] rlang_0.2.1                   curl_3.2                     
## [15] lazyeval_0.2.1                data.table_1.10.4            
## [17] blob_1.1.1                    qvalue_2.10.0                
## [19] splines_3.4.1                 BiocParallel_1.12.0          
## [21] igraph_1.2.1                  bit_1.1-12                   
## [23] munsell_0.5.0                 shiny_1.0.5                  
## [25] fgsea_1.4.0                   compiler_3.4.1               
## [27] httpuv_1.3.6.2                pkgconfig_2.0.1              
## [29] htmltools_0.3.6               flashClust_1.01-2            
## [31] tibble_1.4.2                  gridExtra_2.3                
## [33] interactiveDisplayBase_1.16.0 MASS_7.3-48                  
## [35] leaps_3.0                     grid_3.4.1                   
## [37] xtable_1.8-2                  gtable_0.2.0                 
## [39] DBI_1.0.0                     magrittr_1.5                 
## [41] scales_0.5.0                  rlist_0.4.6.1                
## [43] stringi_1.1.7                 GOSemSim_2.4.0               
## [45] reshape2_1.4.3                scatterplot3d_0.3-41         
## [47] DO.db_2.9                     rvcheck_0.0.9                
## [49] fastmatch_1.1-0               tools_3.4.1                  
## [51] bit64_0.9-5                   glue_1.2.0                   
## [53] purrr_0.2.4                   yaml_2.1.18                  
## [55] colorspace_1.3-2              memoise_1.1.0

Name		Name	Last commit message	Last commit date
Latest commit History 315 Commits
R		R
data		data
inst		inst
temp		temp
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
DGEclustering.Rproj		DGEclustering.Rproj
LICENSE		LICENSE
NAMESPACE		NAMESPACE
README.md		README.md
requirements.txt		requirements.txt

License

RTChou/DGEclustering

Folders and files

Latest commit

History

Repository files navigation

DGEclustering

Introduction

Installation

1. Clone the github repository:

2. Create and activate conda environment:

3. Install package in R:

*The versions of some dependencies are not specified for the one that is put onto Bioconda

Usage

I. Automation of diagnostic plots: automation

Troubleshooting:

II. Multidimensional clustering integrated with GO terms

Step 0-1: Specify the column name for gene IDs, organism database, and key type

Step 0-2: Specify the paired input files

Step 1-1: Set the (adjusted) p-value thresholds for selecting genes

Step 1-2: Merge datasets and subset the significant genes: sig.subset

Step 2: Annotate Genes: annotate.genes

Step 3: Prepare expression and annotation datasets for clustering

Step 4: Cluster genes, and visualize the result: DGE.clust, cluster.plot

Step 5: GO enrichment of the clustering result (visualization): cluster.enrich

Session Information

About

Topics

Resources

License

Stars

Watchers

Forks

Languages

I. Automation of diagnostic plots: `automation`

Step 1-2: Merge datasets and subset the significant genes: `sig.subset`

Step 2: Annotate Genes: `annotate.genes`

Step 4: Cluster genes, and visualize the result: `DGE.clust`, `cluster.plot`

Step 5: GO enrichment of the clustering result (visualization): `cluster.enrich`