remotes::install_github("Sage-Bionetworks/sageseqr")
The sageseqr
package integrates the drake
R package, the config
package for R, and Synapse. drake
tracks dependency relationships in the workflow and only updates data when it has changed. A config
file allows inputs and parameters to be explicitly defined in one location. Synapse is a data repository that allows sensitive data to be stored and shared responsibly.
The workflow takes RNA-seq gene counts and sample metadata as inputs, normalizes counts by conditional quantile normalization (CQN), removes outliers based on a user-defined threshold, empirically selects meaningful covariates and returns differential expression analysis results. The data is also visualized in several ways to help you understand meaningful trends. The visualizations include a heatmap identifying highly correlated covariates, a sample-specific x and y marker gene check, boxplots visualizing the distribution of continuous variables and a principal component analysis (PCA) to visualize sample distribution.
The series of steps that make up the workflow are called targets. The target objects are stored in a cache and can either be read or loaded into your environment with the drake
functions readd
or loadd
. Source code for each target can be visualized by setting show_source = TRUE
with loadd
and readd
.
Importantly, running clean
will remove the data stored as targets (but, the data is never completely gone!). You may specific targets by name by passing them to the clean
function.
The targets are called by the sageseqr
rnaseq_plan()
function and are:
Raw data:
import_metadata
- imports the raw metadata directly from synapseimport_counts
- imports the raw counts directly from synapsebiomart_results
- the complete list of genes with biomaRt annotations.
Exploratory data visualizations:
gene_coexpression
- the distribution of correlated gene counts.boxplots
- the distribution of continuous variables.sex_plot
- the distribution of samples by x and y marker genes.sex_plot_pca
- a PCA of sex-specific expression to visualize more dimensionality thansex_plot
.correlation_plot
- the correlation of covariates.significant_covariates_plot
- the correlation of covariates to gene expression.outliers
- the clustering of samples by PCA.plot_de_volcano
- volcano plot of differentially expressed genes.
Transformed or normalized data:
clean_md
- metadata with factor and numeric types.filtered_counts
- counts matrix with low gene expression removed.biotypes
- gene proportions summarized by biotype.cqn_counts
- CQN normalized counts.model
- model selected by multivariate forward stepwise regression (evaluated by Bayesian Information Criteria (BIC)).de
- differential expression results including adjusted p-values and gene list.report
- output markdown report rendered as HTML.
Anyone can create a Synapse account and access public data in a variety of disciplines: Alzheimer's Disease Knowledge portal, CommonMind Consoritum.