Skip to content

A workflow for exploring and analyzing RNA-seq data.

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md
Notifications You must be signed in to change notification settings

tmonjo/sageseqr

 
 

Repository files navigation

R build status

Installation

remotes::install_github("Sage-Bionetworks/sageseqr")

RNA-seq normalization workflow in R

The sageseqr package integrates the drake R package, the config package for R, and Synapse. drake tracks dependency relationships in the workflow and only updates data when it has changed. A config file allows inputs and parameters to be explicitly defined in one location. Synapse is a data repository that allows sensitive data to be stored and shared responsibly.

The workflow takes RNA-seq gene counts and sample metadata as inputs, normalizes counts by conditional quantile normalization (CQN), removes outliers based on a user-defined threshold, empirically selects meaningful covariates and returns differential expression analysis results. The data is also visualized in several ways to help you understand meaningful trends. The visualizations include a heatmap identifying highly correlated covariates, a sample-specific x and y marker gene check, boxplots visualizing the distribution of continuous variables and a principal component analysis (PCA) to visualize sample distribution.

The Targets

The series of steps that make up the workflow are called targets. The target objects are stored in a cache and can either be read or loaded into your environment with the drake functions readd or loadd. Source code for each target can be visualized by setting show_source = TRUE with loadd and readd.

Importantly, running clean will remove the data stored as targets (but, the data is never completely gone!). You may specific targets by name by passing them to the clean function.

The targets are called by the sageseqr rnaseq_plan() function and are:

Raw data:

  • import_metadata- imports the raw metadata directly from synapse
  • import_counts - imports the raw counts directly from synapse
  • biomart_results - the complete list of genes with biomaRt annotations.

Exploratory data visualizations:

  • gene_coexpression - the distribution of correlated gene counts.
  • boxplots - the distribution of continuous variables.
  • sex_plot - the distribution of samples by x and y marker genes.
  • sex_plot_pca - a PCA of sex-specific expression to visualize more dimensionality than sex_plot.
  • correlation_plot - the correlation of covariates.
  • significant_covariates_plot - the correlation of covariates to gene expression.
  • outliers - the clustering of samples by PCA.
  • plot_de_volcano - volcano plot of differentially expressed genes.

Transformed or normalized data:

  • clean_md - metadata with factor and numeric types.
  • filtered_counts - counts matrix with low gene expression removed.
  • biotypes - gene proportions summarized by biotype.
  • cqn_counts - CQN normalized counts.
  • model - model selected by multivariate forward stepwise regression (evaluated by Bayesian Information Criteria (BIC)).
  • de - differential expression results including adjusted p-values and gene list.
  • report - output markdown report rendered as HTML.

Access to Data

Anyone can create a Synapse account and access public data in a variety of disciplines: Alzheimer's Disease Knowledge portal, CommonMind Consoritum.

About

A workflow for exploring and analyzing RNA-seq data.

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Code of conduct

Stars

Watchers

Forks

Packages

No packages published

Languages

  • R 100.0%