rNA

Please note: This repository is no longer maintained. All development of rNA has moved to another location. Please visit our new repository for all new and future development.

View a demo of an interactive rNA report with The Cancer Genome Atlas Glioblastoma Multiforme (TCGA-GBM) RNA-seq data.

1. Introduction

report Not Applicable, as known as rNA, is an interactive report to allow users to identify problematic samples prior to downstream analysis. rNA has been designed to work with the output from our RNA-seq pipeline.

Before drawing biological conclusions, it is important to assess the quality control of each sample to ensure that there are no signs of sequencing error or systematic biases in your data. Modern high-throughput sequencers generate millions of reads per run, and in the real world, problems can arise. rNA allows a user to interactively filter samples based on different quality-control metrics. This is especially useful when working with large cohorts with hundreds of samples.

2. Overview

rNA has been designed to work with the output from our RNA-seq pipeline. Here is an overview of the pipeline's major data processing and quality control steps.

2.1 Primary Analysis

The quality of each sample is independently assessed using FastQC², Preseq¹, Picard tools¹⁰, RSeQC⁹, SAMtools¹³, and QualiMap¹⁶. FastQ Screen¹⁷ and Kraken¹⁴ + Krona¹⁵ are used to screen for various sources of contamination. Adapter sequences are removed using Cutadapt³ prior to mapping to hg38 reference genome. STAR⁴ is run in two-pass mode where splice-junctions are collected and aggregated across all samples and provided to the second-pass of STAR. Gene expression levels are quantified using RSEM. The expected counts from RSEM⁵ are merged across samples to create a counts matrix for downstream analysis. RSeQC¹⁹ tin.py is used to calculate transcript integrity numbers (TIN counts matrix) for all canonical protein-coding transcripts.

2.2 Downstream Analysis

rNA takes the raw counts matrix, TIN counts matrix, and the QC metadata table generated in the pipeline described above as input. These are a few of the required command-line arguments to rNA.R. rNA performs the following steps listed below. The expected counts from RSEM are filtered to remove lowly expressed genes with edgeR's¹⁸ filterByExpr() function using the following critea: genes must have 10 reads in >= 70% samples. The normalisation factors calculated using edgeR's TMM method are used as scaling factors for library size. Using the voom() function in limma⁸, the counts are converted log2-counts-per-million (logCPM) and quantile normalized.

voom⁷ is an acronym for mean-variance modelling at the observational level. The key concern is to estimate the mean-variance relationship in the data, then use this to compute appropriate weights for each observation. Count data almost show non-trivial mean-variance relationships. Raw counts show increasing variance with increasing count size, while log-counts typically show a decreasing mean-variance trend. This function estimates the mean-variance trend for log-counts, then assigns a weight to each observation based on its predicted variance. The weights are then used in the linear modelling process to adjust for heteroscedasticity.

3. Run rNA

3.1 Usage

$ ./rNA.R [OPTIONS] -m RMARKDOWN -r RAW_COUNTS -t TIN_COUNTS -q QC_TABLE -o OUTPUT_DIR

3.2 Required Arguments

Flag	Type	Description
-m, --rmarkdown	Script	rNA Rmarkdown file
-r, --raw_counts	File	Input Raw counts matrix
-t, --tin_count	File	Input TIN counts matrix
-q, --qc_table	File	Input QC Metadata Table
-o, --output_dir	Path	Path to Output Directory

3.3 OPTIONS

Flag	Description
-h, --help	Displays help and usage information
-f, --output_filename	Output HTML filename, Default: rNA.html

3.4 Example

Rscript rNA.R -m src/rNA.Rmd \
              -r data/TCGA-GBM_Raw_RSEM_Genes.txt \
              -t data/TCGA-GBM_TINs.txt \
              -q data/multiqc_matrix.txt -o "$PWD"

If you are recieving an error that pandoc was not found, you will need to set the following bash enviroment variable: RSTUDIO_PANDO. You can add the line below to your ~/.bash_profile or your ~/.bashrc so it will be set as soon as you open a new shell.

echo 'export RSTUDIO_PANDOC=/Applications/RStudio.app/Contents/MacOS/pandoc' >> ~/.bash_profile
source ~/.bash_profile

4. Filtering Critea

General Recommendations

Here is a set of generalized guidelines for filtering based on different QC metrics:

Metric	Guideline
medTIN	> 65
Trimmed Reads	> 10000000
% Aligned to Reference	> 65%
% Duplicates	< 65 %
% rRNA	< 10%
% Coding	Coding > 35%

Please note: Some of these metrics will vary genome-to-genome depending on the quality of the assembly and annotation but that has been taken into consideration for our set of supported reference genomes.

5. References

^{1. Daley, T. and A.D. Smith, Predicting the molecular complexity of sequencing libraries. Nat Methods, 2013. 10(4): p. 325-7.}
^{2. Andrews, S. (2010). FastQC: a quality control tool for high throughput sequence data.}
^{3. Martin, M. (2011). "Cutadapt removes adapter sequences from high-throughput sequencing reads." EMBnet 17(1): 10-12.}
^{4. Dobin, A., et al., STAR: ultrafast universal RNA-seq aligner. Bioinformatics, 2013. 29(1): p. 15-21.}
^{5. Li, B. and C.N. Dewey, RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics, 2011. 12: p. 323.}
^{6. Harrow, J., et al., GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res, 2012. 22(9): p. 1760-74.}
^{7. Law, C.W., et al., voom: Precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol, 2014. 15(2): p. R29.}
^{8. Smyth, G.K., Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol, 2004. 3: p. Article3.}
^{9. Wang, L., et al. (2012). "RSeQC: quality control of RNA-seq experiments." Bioinformatics 28(16): 2184-2185.}
^{10. The Picard toolkit. https://broadinstitute.github.io/picard/.}
^{11. Ewels, P., et al. (2016). "MultiQC: summarize analysis results for multiple tools and samples in a single report." Bioinformatics 32(19): 3047-3048.}
^{12. R Core Team (2018). R: A Language and Environment for Statistical Computing. Vienna, Austria, R Foundation for Statistical Computing.}
^{13. Li, H., et al. (2009). "The Sequence Alignment/Map format and SAMtools." Bioinformatics 25(16): 2078-2079.}
^{14. Wood, D. E. and S. L. Salzberg (2014). "Kraken: ultrafast metagenomic sequence classification using exact alignments." Genome Biol 15(3): R46.}
^{15. Ondov, B. D., et al. (2011). "Interactive metagenomic visualization in a Web browser." BMC Bioinformatics 12(1): 385.}
^{16. Okonechnikov, K., et al. (2015). "Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data." Bioinformatics 32(2): 292-294.}
^{17. Wingett, S. and S. Andrews (2018). "FastQ Screen: A tool for multi-genome mapping and quality control." F1000Research 7(2): 1338.}
^{18. Robinson, M. D., et al. (2009). "edgeR: a Bioconductor package for differential expression analysis of digital gene expression data." Bioinformatics 26(1): 139-140.}

Back to Top

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
data		data
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
rNA.R		rNA.R
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

rNA

Table of Contents

1. Introduction

2. Overview

2.1 Primary Analysis

2.2 Downstream Analysis

3. Run rNA

3.1 Usage

3.2 Required Arguments

3.3 OPTIONS

3.4 Example

4. Filtering Critea

5. References

About

Releases

Packages

Languages

License

skchronicles/rNA

Folders and files

Latest commit

History

Repository files navigation

rNA

Table of Contents

1. Introduction

2. Overview

2.1 Primary Analysis

2.2 Downstream Analysis

3. Run rNA

3.1 Usage

3.2 Required Arguments

3.3 OPTIONS

3.4 Example

4. Filtering Critea

5. References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages