Skip to content

Commit

Permalink
Revert "updates to readme"
Browse files Browse the repository at this point in the history
This reverts commit 54a46ca.
  • Loading branch information
kopardev committed Jun 28, 2023
1 parent 54a46ca commit 78dc92c
Showing 1 changed file with 61 additions and 29 deletions.
90 changes: 61 additions & 29 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,36 +1,40 @@
# RENEE - **R**na s**E**quencing a**N**alysis pip**E**lin**E**
# RNA-seek

[![DOI](https://zenodo.org/badge/305525443.svg)](https://zenodo.org/badge/latestdoi/305525443) [![GitHub releases](https://img.shields.io/github/release/skchronicles/RNA-seek)](https://github.com/skchronicles/RNA-seek/releases) ![Docker Pulls](https://img.shields.io/docker/pulls/nciccbr/ccbr_arriba_2.0.0) [![Build](https://github.com/skchronicles/RNA-seek/workflows/Tests/badge.svg)](https://github.com/skchronicles/RNA-seek/actions) [![GitHub issues](https://img.shields.io/github/issues/skchronicles/RNA-seek?color=brightgreen)](https://github.com/skchronicles/RNA-seek/issues) [![GitHub license](https://img.shields.io/github/license/skchronicles/RNA-seek)](https://github.com/skchronicles/RNA-seek/blob/main/LICENSE)

An open-source, reproducible, and scalable solution for analyzing RNA-seq data.

### Table of Contents
- [RENEE - **R**na s**E**quencing a**N**alysis pip**E**lin**E**](#renee---rna-sequencing-analysis-pipeline)
- [Table of Contents](#table-of-contents)
- [1. Introduction](#1-introduction)
- [2. Overview](#2-overview)
- [2.1 RENEE Pipeline](#21-renee-pipeline)
- [2.2 Reference Genomes](#22-reference-genomes)
- [2.3 Dependencies](#23-dependencies)
- [3. Run RENEE pipeline](#3-run-renee-pipeline)
- [3.3 Biowulf](#33-biowulf)
- [5. References](#5-references)
1. [Introduction](#1-Introduction)
2. [Overview](#2-Overview-of-Pipeline)
2.1 [RNA-seek Pipeline](#21-RNA-seek-Pipeline)
2.2 [Reference Genomes](#22-Reference-Genomes)
2.3 [Dependencies](#23-Dependencies)
2.4 [Installation](#24-Installation)
3. [Run RNA-seek pipeline](#3-Run-RNA-seek-pipeline)
3.1 [Using Singularity](#31-Using-Singularity)
3.2 [Using Docker](#32-Using-Docker)
3.3 [Biowulf](#33-Biowulf)
4. [Contribute](#4-Contribute)
5. [References](#5-References)

### 1. Introduction
RNA-sequencing (*RNA-seq*) has a wide variety of applications. This popular transcriptome profiling technique can be used to quantify gene and isoform expression, detect alternative splicing events, predict gene-fusions, call variants and much more.

**RENEE** is a comprehensive, open-source RNA-seq pipeline that relies on technologies like [Docker<sup>20</sup>](https://www.docker.com/why-docker) and [Singularity<sup>21</sup>](https://singularity.lbl.gov/) to maintain the highest-level of reproducibility. The pipeline consists of a series of data processing and quality-control steps orchestrated by [Snakemake<sup>19</sup>](https://snakemake.readthedocs.io/en/stable/), a flexible and scalable workflow management system, to submit jobs to a cluster or cloud provider.
**RNA-seek** is a comprehensive, open-source RNA-seq pipeline that relies on technologies like [Docker<sup>20</sup>](https://www.docker.com/why-docker) and [Singularity<sup>21</sup>](https://singularity.lbl.gov/) to maintain the highest-level of reproducibility. The pipeline consists of a series of data processing and quality-control steps orchestrated by [Snakemake<sup>19</sup>](https://snakemake.readthedocs.io/en/stable/), a flexible and scalable workflow management system, to submit jobs to a cluster or cloud provider.

![RENEE_overview_diagram](./resources/overview.svg)
![RNA-seek_overview_diagram](https://github.com/skchronicles/RNA-seek/blob/main/resources/overview.svg)
<sup>**Fig 1. Run locally on a compute instance, on-premise using a cluster, or on the cloud using AWS.** A user can define the method or mode of execution. The pipeline can submit jobs to a cluster using a job scheduler like SLURM, or run on AWS using Tibanna (feature coming soon!). A hybrid approach ensures the pipeline is accessible to all users. As an optional step, relevelant output files and metadata can be stored in object storage using HPC DME (NIH users) or Amazon S3 for archival purposes (coming soon!).</sup>

### 2. Overview

#### 2.1 RENEE Pipeline
A bioinformatics pipeline is more than the sum of its data processing steps. A pipeline without quality-control steps provides a myopic view of the potential sources of variation within your data (i.e., biological verses technical sources of variation). RENEE pipeline is composed of a series of quality-control and data processing steps.
#### 2.1 RNA-seek Pipeline
A bioinformatics pipeline is more than the sum of its data processing steps. A pipeline without quality-control steps provides a myopic view of the potential sources of variation within your data (i.e., biological verses technical sources of variation). RNA-seek pipeline is composed of a series of quality-control and data processing steps.

The accuracy of the downstream interpretations made from transcriptomic data are highly dependent on initial sample library. Unwanted sources of technical variation, which if not accounted for properly, can influence the results. RENEE's comprehensive quality-control helps ensure your results are reliable and _reproducible across experiments_. In the data processing steps, RENEE quantifies gene and isoform expression and predicts gene fusions. Please note that the detection of alternative splicing events and variant calling will be incorporated in a later release.
The accuracy of the downstream interpretations made from transcriptomic data are highly dependent on initial sample library. Unwanted sources of technical variation, which if not accounted for properly, can influence the results. RNA-seek's comprehensive quality-control helps ensure your results are reliable and _reproducible across experiments_. In the data processing steps, RNA-seek quantifies gene and isoform expression and predicts gene fusions. Please note that the detection of alternative splicing events and variant calling will be incorporated in a later release.


![RNA-seq quantification pipeline](./resources/RENEE_Pipeline.svg) <sup>**Fig 2. An Overview of RENEE Pipeline.** Gene and isoform counts are quantified and a series of QC-checks are performed to assess the quality of the data. This pipeline stops at the generation of a raw counts matrix and gene-fusion calling. To run the pipeline, a user must select their raw data, a reference genome, and output directory (i.e., the location where the pipeline performs the analysis). Quality-control information is summarized across all samples in a MultiQC report.</sup>
![RNA-seq quantification pipeline](https://github.com/skchronicles/RNA-seek/blob/main/resources/RNA-seek_Pipeline.svg) <sup>**Fig 2. An Overview of RNA-seek Pipeline.** Gene and isoform counts are quantified and a series of QC-checks are performed to assess the quality of the data. This pipeline stops at the generation of a raw counts matrix and gene-fusion calling. To run the pipeline, a user must select their raw data, a reference genome, and output directory (i.e., the location where the pipeline performs the analysis). Quality-control information is summarized across all samples in a MultiQC report.</sup>

**Quality Control**
[*FastQC*<sup>2</sup>](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) is used to assess the sequencing quality. FastQC is run twice, before and after adapter trimming. It generates a set of basic statistics to identify problems that can arise during sequencing or library preparation. FastQC will summarize per base and per read QC metrics such as quality scores and GC content. It will also summarize the distribution of sequence lengths and will report the presence of adapter sequences.
Expand All @@ -48,36 +52,53 @@ The accuracy of the downstream interpretations made from transcriptomic data are
**Quantification**
[*Cutadapt*<sup>3</sup>](https://cutadapt.readthedocs.io/en/stable/) is used to remove adapter sequences, perform quality trimming, and remove very short sequences that would otherwise multi-map all over the genome prior to alignment.

[*STAR*<sup>4</sup>](https://github.com/alexdobin/STAR) is used to align reads to the reference genome. The RENEE pipeline runs STAR in a two-passes where splice-junctions are collected and aggregated across all samples and provided to the second-pass of STAR. In the second pass of STAR, the splice-junctions detected in the first pass are inserted into the genome indices prior to alignment.
[*STAR*<sup>4</sup>](https://github.com/alexdobin/STAR) is used to align reads to the reference genome. The RNA-seek pipeline runs STAR in a two-passes where splice-junctions are collected and aggregated across all samples and provided to the second-pass of STAR. In the second pass of STAR, the splice-junctions detected in the first pass are inserted into the genome indices prior to alignment.

[*RSEM*<sup>5</sup>](https://github.com/deweylab/RSEM) is used to quantify gene and isoform expression. The expected counts from RSEM are merged across samples to create a two counts matrices for gene counts and isoform counts.

[*Arriba*<sup>22</sup>](https://arriba.readthedocs.io/en/latest/) is used to predict gene-fusion events. The pre-built human and mouse reference genomes use Arriba blacklists to reduce the false-positive rate.

#### 2.2 Reference Genomes
Reference files are pulled from an S3 bucket to the compute instance or local filesystem prior to execution.
RENEE comes bundled with pre-built reference files for the following genomes:
RNA-seek comes bundled with pre-built reference files for the following genomes:
| Name | Species | Genome | Annotation |
| -------- | ------- | ------------------ | -------- |
| hg38_30 | Homo sapiens (human) | [GRCh38](http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/GRCh38.primary_assembly.genome.fa.gz) | [Gencode<sup>6</sup> Release 30](http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_30/gencode.v30.annotation.gtf.gz) |
| mm10_M21 | Mus musculus (mouse) | [GRCm38](http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M18/GRCm38.primary_assembly.genome.fa.gz) | [Gencode Release M21](http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M21/gencode.vM21.annotation.gtf.gz) |

> **Warning:** This section contains FTP links for downloading each reference file. Open the link in a new tab to start a download.
> **Note:** Release 30 for hg38 and Release M21 for mm10 were the only annotation versions available at the time of writing this documentation. Newer annotations versions may be added upon request and may be already available. Please contact [Vishal Koparde](mailto:vishal.koparde@nih.gov) for details.
#### 2.3 Dependencies
**Requires:** `singularity>=3.5` `snakemake>=6.0`

[Snakemake](https://snakemake.readthedocs.io/en/stable/getting_started/installation.html) and [singularity](https://singularity.lbl.gov/all-releases) must be installed on the target system. Snakemake orchestrates the execution of each step in the pipeline. To guarantee reproducibility, each step relies on pre-built images from [DockerHub](https://hub.docker.com/orgs/nciccbr/repositories). Snakemake pulls these docker images while converting them to singularity on the fly and saves them onto the local filesystem prior to job execution, and as so, snakemake and singularity are the only two dependencies.
[Snakemake](https://snakemake.readthedocs.io/en/stable/getting_started/installation.html) and [singularity](https://singularity.lbl.gov/all-releases) must be installed on the target system. Snakemake orchestrates the execution of each step in the pipeline. To guarantee reproducibility, each step relies on pre-built images from [DockerHub](https://hub.docker.com/orgs/nciccbr/repositories). Snakemake uses singaularity to pull these images onto the local filesystem prior to job execution, and as so, snakemake and singularity are the only two dependencies.

#### 2.4 Installation
Please clone this repository to your local filesystem using the following command:
```bash
# Clone Repository from Github
git clone https://github.com/skchronicles/RNA-seek.git
# Change your working directory to the RNA-seek repo
cd RNA-seek/
```

### 3. Run RNA-seek pipeline

#### 3.1 Using Singularity
```bash
# Coming Soon!
```

### 3. Run RENEE pipeline
#### 3.2 Using Docker
```bash
# Coming Soon!
```

#### 3.3 Biowulf
```bash
# RENEE is configured to use different execution backends: local or slurm
# rna-seek is configured to use different execution backends: local or slurm
# view the help page for more information
module load ccbrpipeliner
RENEE run --help
./rna-seek run --help

# @local: uses local singularity execution method
# The local MODE will run serially on compute
Expand All @@ -91,16 +112,27 @@ RENEE run --help
sinteractive --mem=110g --cpus-per-task=12 --gres=lscratch:200
module purge
module load singularity snakemake
RENEE run --input .tests/*.R?.fastq.gz --output /data/$USER/RNA_hg38 --genome hg38_30 --mode local
./rna-seek run --input .tests/*.R?.fastq.gz --output /data/$USER/RNA_hg38 --genome hg38_30 --mode local

# @slurm: uses slurm and singularity execution method
# The slurm MODE will submit jobs to the cluster.
# It is recommended running RENEE in this mode.
# It is recommended running rna-seek in this mode.
module purge
module load singularity snakemake
./RENEE run --input .tests/*.R?.fastq.gz --output /data/$USER/RNA_hg38 --genome hg38_30 --mode slurm
./rna-seek run --input .tests/*.R?.fastq.gz --output /data/$USER/RNA_hg38 --genome hg38_30 --mode slurm
```

### 4. Contribute

This section is for new developers working with the RNA-seek pipeline. If you have added new features or adding new changes, please consider contributing them back to the original repository:

1. [Fork](https://help.github.com/en/articles/fork-a-repo) the original repo to a personal or org account.
2. [Clone](https://help.github.com/en/articles/cloning-a-repository) the fork to your local filesystem.
3. Copy the modified files to the cloned fork.
4. Commit and push your changes to your fork.
5. Create a [pull request](https://help.github.com/en/articles/creating-a-pull-request) to this repository.


### 5. References

<sup>**1.** Daley, T. and A.D. Smith, Predicting the molecular complexity of sequencing libraries. Nat Methods, 2013. 10(4): p. 325-7.</sup>
Expand Down Expand Up @@ -129,5 +161,5 @@ module load singularity snakemake

<hr>
<p align="center">
<a href="#RENEE">Back to Top</a>
<a href="#RNA-seek">Back to Top</a>
</p>

0 comments on commit 78dc92c

Please sign in to comment.