# Background

This RNA-seq workflow is adapted from the RNA-seq tutuorial at [rnabio.org](https://rnabio.org).

``` Malachi Griffith*, Jason R. Walker, Nicholas C. Spies, Benjamin J. Ainscough, Obi L. Griffith*. 2015. Informatics for RNA-seq: A web resource for analysis on the cloud. PLoS Comp Biol. 11(8):e1004393```

The [ENCODE project](https://www.encodeproject.org) publishes [guidelines and best practices for RNA-seq](https://www.encodeproject.org/documents/cede0cbe-d324-4ce7-ace4-f0c3eddf5972/@@download/attachment/ENCODE%20Best%20Practices%20for%20RNA_v2.pdf) that are a must read.

## Initial setup
The tutorial being followed uses AWS cloud but I am writing this pipeline to be ran locally on a linux-based machine. More specifically, I am using Manjaro with kernel version 6.1.7-1-MANJARO (64-bit) and KDE plasma version 5.26.4 on X11.

[Configure your environment](https://rnabio.org/module-00-setup/0000/09/01/Environment/) and then [install all necessary software](https://rnabio.org/module-00-setup/0000/10/01/Installation/) according to the rnabio website. The specific steps you follow while doing this will depend on your system and how you want to install the tools. I would suggest looking for maintained packages for your specific distribution before trying to install from source.

Specific verison information of software installed on my setup:
- SAMtools v1.16.1, installed from AUR
    - run with ```samtools``` from CL
- bam-readcount v1.0.1, compiled from source
    - run from ```~$HOME/bioinformatics/bam-readcount/build/bin``` with ```./bam-readcount```
- HISAT2 v2.2.1-1, installed from AUR
    - run with ```hisat2``` from CL
- stringtie v2.2.1, downloaded pre-compiled binary
    - run from ```~$HOME/bioinformatics/stringtie-2.2.1.Linux_x86_64``` with ```./stringtie```
- gffcompare v0.12.6-2, installed from AUR
    - run with ```gffcompare``` from CL
- htseq-count v.0.11.3-1, installed from AUR and also in bioinformatics conda env
    - run by importing in Python?
- tophat v2.1.1, downloaded pre-compiled binary
    - run from ```~$HOME/bioinformatics/tophat-2.1.1.Linux_x86_64``` with ```./gtf_to_fasta```
- kallisto v0.48.0-1, installed from AUR
    - run with ```kallisto``` from CL
- fastqc v0.11.9-1, installed from AUR
    - run with ```fastqc``` from CL
- multiqc v1.14, installed with pip3 in bioinformatics conda env
    - run with ```multiqc --help``` from CL
- picard v2.26.4, downloaded as JAR file
    - run from ```~$HOME/bioinformatics/``` with ```java -jar picard.jar```
- flexbar v3.5.0, downloaded pre-compiled binary
    - run from ```~$HOME/bioinformatics/flexbar-3.5.0-linux``` with ```./flexbar```
- regtools v1.0.0, installed from AUR (git source)
    - run with ```regtools``` from CL
- rseqc v5.0.1, installed with pip3 in bioinformatics conda env
    - import in python with ```rseqc``` or run with ```read_GC.py``` in CL?
- bedops v2.4.40, downloaded pre-compiled binary
    - run from ```~$HOME/bioinformatics/bedops_linux_x86_64-v2.4.40/bin``` with```./bedops``` and ```./gff2bed```
- gtfToGenePred v unknown, downloaded pre-compiled binary
    - run from ```~$HOME/bioinformatics/gtfToGenePred``` with ```./gtfToGenePred```
- genePredToBed v unknown, downloaded pre-compiled binary
    - run from ```~$HOME/bioinformatics/genePredToBed``` with ```./genePredToBed```
- how_are_we_stranded_here, v unknown, installed with pip3 in bioinformatics conda env
    - run with ```check_strandedness``` in CL
- cell ranger v7.1.0, have to register to download, pre-compiled binary
    - run from ```~$HOME/bioinformatics/cellranger-7.1.0``` with ```./cellranger```
- tabix, website only offers apt installation, not relevant on arch-based distro. Tabix seems to be included with HTSlib which is already installed as a dependency for samtools and bcftools
- BWA v0.7.17=r1188, compiled from source
    - run from ```~$HOME/bioinformatics/bwa``` with ```./bwa``` or ```bwa```
- BCFtools v1.16, installed from AUR
    - run with ```bcftools``` from CL
- peddy v0.4.8, cloned and installed with pip in bioinformatics conda env
    - run with ```peddy``` from CL in conda env
- slivar v0.2.7, pre-complied binary
    - run from ```~$HOME/bioinformatics``` with ```./slivar```
- STRling v0.5.1, pre-complied binary
    - run from ```~$HOME/bioinformatics``` with ```./strling```
- freebayes v1.3.6, pre-compiled binary
    - run from ```~$HOME/bioinformatics``` with ```./freebayes-1.3.6-linux-amd64-static```

### R Libraries needed:
- devtools
- dplyr
- gplots
- ggplot2

install with 
```install.packages(c("devtools","dplyr","gplots","ggplot2"),repos="http://cran.us.r-project.org")```

### Bioconductor libraries needed:
- [genefilter](http://bioconductor.org/packages/release/bioc/html/genefilter.html)
- [ballgown](http://bioconductor.org/packages/release/bioc/html/ballgown.html)
- [edgeR](http://www.bioconductor.org/packages/release/bioc/html/edgeR.html)
- [GenomicRanges](http://bioconductor.org/packages/release/bioc/html/GenomicRanges.html)
- [rhdf5](https://www.bioconductor.org/packages/release/bioc/html/rhdf5.html)
- [biomaRt](https://bioconductor.org/packages/release/bioc/html/biomaRt.html)

### Sleuth
[Source](https://pachterlab.github.io/sleuth/download)

devtools need to be installed if not already ```install.packages("devtools")```
install with ```devtools::install_github("pachterlab/sleuth")```


# General notes on RNA-seq workflow

Due to inconsistencies in the naming conventions for chromosomes, you should obtain your reference genome and annotation packages from the same source (Ensembl, NCBI, UCSC). Additionally, your annotations must correspond with the same reference genome version (i.e. both correspond to NCBI human GRCh38).

[This guide](https://pmbio.org/module-02-inputs/0002/02/01/Reference_Genome/) explains reference genomes and how to obtain them.



# Obtain known gene/transcript annotations

Download the annotations from Ensembl via ```wget http://genomedata.org/rnaseq-tutorial/annotations/GRCh38/chr22_with_ERCC92.gtf```. 

You can preview some of the file with ```cat chr22_with_ERCC92.gtf | column -t | less -p exon -S```.


```cat chr22_with_ERCC92.gtf | grep -w gene | wc -l```

You can view the structure of one transcript with ```grep ENST00000342247 $RNA_REF_GTF | less -p "exon\s" -S```. Exit with q.

Other sources of annontation files for a HISAT2/StringTie/Ballgown pipeline, refer to [here](https://rnabio.org/module-01-inputs/0001/03/01/Annotations/).

HISAT2 has prebuilt indexes for DNA and RNA for both Ensembl and UCSC genomes [here](https://ccb.jhu.edu/software/hisat2/index.shtml).

**Important**: chromosome names in the .gtf file must be the same as those in the reference genome .fa file. If StringTie returns expression values that are all 0, this is the likley cause of that error. Download the annotation and reference genome files from the same source.

**Make sure your reference genome and annotations are for the same build of the genome (i.e. both for GRCh38).**

# Creating index with HISAT2

This was done directly using python files from the [HISAT2 github repo](https://github.com/DaehwanKimLab/hisat2), not sure how to do it using the installed program listed above or if it's possible. Just clone the repo and use the files necessary.

Run the following commands from the location of your reference .gtf and .fa files. The chr22_with_ERCC92.gtf and chr22_with_ERCC92.fa. You may need a ```./``` in front of the first two commands

```hisat2_extract_splice_sites.py chr22_with_ERCC92.gtf > splicesites.tsv```

```hisat2_extract_exons.py chr22_with_ERCC92.gtf >exons.tsv```

```hisat2-build -p 8 --ss splicesites.tsv --exon exons.tsv chr22_with_ERCC92.fa chr22_with_ERCC92```

The output should be a collection of binary files with names like chr22_with_ERCC92.x.ht2, the x being a number.

# RNA-seq test data

```mkdir test_data && cd test_data```

```wget http://genomedata.org/rnaseq-tutorial/HBR_UHR_ERCC_ds_5pc.tar```

unpack the tarball

```tar -xvf HBR_UHR_ERCC_ds_5pc.tar```

Look at the head of one of the files.

```zcat UHR_Rep1_ERCC-Mix1_Build37-ErccTranscripts-chr22.read1.fastq.gz | head -n 8```

Determine how many reads are in the first library.

```zcat UHR_Rep1_ERCC-Mix1_Build37-ErccTranscripts-chr22.read1.fastq.gz | grep -P "^\@HWI" | wc -l```


# Check_strandedness (optional, but not really)

This step is not required but due to differences in usage of RNA-seq library construction kits, it is **highly** recommended to empirically verify strandedness using the below method.

The check_strandedness should already be installed at the beginning. It can also be installed as a [Docker image](https://hub.docker.com/r/mgibio/checkstrandedness).

Convert gtf to genepred

```gtfToGenePred chr22_with_ERCC92.gtf chr22_with_ERCC92.genePred```

Convert genPred to bed12

```genePredToBed chr22_with_ERCC92.genePred chr22_with_ERCC92.bed12```

Use bedtools to create fasta from GTF

```bedtools getfasta -fi chr22_with_ERCC92.fa -bed chr22_with_ERCC92.bed12 -s -split -name -fo chr22_ERCC92.transcripts.fa```

Clean up the header lines of the transcript file

```cat chr22_ERCC92_transcripts.fa | perl -ne 'if($_ =~/^\>\S+\:\:(ERCC\-\d+)\:.*/){print ">$1\n"}elsif ($_ =~/^\>(\S+)\:\:.*/){print ">$1\n"}else{print $_}' > chr22_ERCC92_transcripts.clean.fa```

View the clean file

```less chr22_ERCC92_transcripts.clean.fa```

Add in transcript_id field to .gtf file

```awk '{ if ($0 ~ "transcript_id") print $0; else print $0" transcript_id \"\";"; }' chr22_with_ERCC92.gtf > chr22_with_ERCC92_tidy.gtf```

check strandedness 

```check_strandedness --gtf chr22_with_ERCC92_tidy.gtf --transcripts chr22_ERCC92_transcripts.clean.fa --reads_1 test_data/HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz --reads_2 test_data/HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read2.fastq.gz```

OUTPUT:

Fraction of reads failed to determine: 0.1123

Fraction of reads explained by "1++,1--,2+-,2-+": 0.0155 (1.7% of explainable reads)

Fraction of reads explained by "1+-,1-+,2++,2--": 0.8722 (98.3% of explainable reads)

Over 90% of reads explained by "1+-,1-+,2++,2--"

Data is likely RF/fr-firststrand

More info on strandedness can be found [here](https://rnabio.org/module-09-appendix/0009/12/01/StrandSettings/)

# Pre-alignment QC

To get a sense of the raw reads quality in your fastq files, run the following programs.

```fastqc *.fastq.gz``` in the same directory as your reads.

It will output an HTML file for each read in the same directory that contains a full report of the quality measures.

In the "over-represented sequences" section, try [BLASTing](https://blast.ncbi.nlm.nih.gov/Blast.cgi) the sequences to try determining a source organism.

Run multiqc to combine all the fastqc reports.

```multiqc .``` run in the same directory as the reports.

Make a new directory for the fastqc .html outputs and move them

```mkdir fastqc && mv *_fastqc* fastqc```

# Trim adapters

Make a new folder for the trimmed reads

```mkdir trimmed```

Download the Illumina adapter sequence files, or the relevant files for your sequencing machine.

```wget http://genomedata.org/rnaseq-tutorial/illumina_multiplex.fa```

Use flexbar to remove adapter sequences and trim first 13 bases of each read. Adjust the number of threads to something manageable for your system. It is set to 24 threads and takes a few seconds to run for each command. It may take much longer with a slower CPU.

Possible options for flexbar and their meaning:

   - ```–adapter-min-overlap 7``` requires a minimum of 7 bases to match the adapter
   - ```–adapter-trim-end RIGHT``` uses a trimming strategy to remove the adapter from the 3 prime or RIGHT end of the read
   - ```–max-uncalled 300``` allows as many as 300 uncalled or N bases (MiSeq read lengths can be 300bp)
   - ```–min-read-length``` the minimum read length allowed after trimming is 25bp.
   - ```–threads 8``` use 8 threads
   - ```–zip-output GZ``` the input FASTQ files are gzipped so we will output gzipped FASTQ to save space
   - ```–adapters``` define the path to the adapter FASTA file to trim
   - ```–reads``` define the path to the read 1 FASTQ file of reads
   - ```–reads2``` define the path to the read 2 FASTQ file of reads
   - ```–target``` a base path for the output files. The value will _1.fastq.gz and _2.fastq.gz for read 1 and read 2 respectively
   - ```–pre-trim-left``` trim a fixed number of bases at left read end. For example, to trim 5 bases at the left side of reads: –pre-trim-left 5
   - ```–pre-trim-right``` trim a fixed number of bases at right read end. For example, to trim 5 bases at the right side of reads: –pre-trim-right 5
   - ```–pre-trim-phred``` trim based on phred quality value to deal with higher error rates towards the end of reads. For example, to trim the 3 prime end until quality offset value 30 or higher is reached, specify: –pre-trim-phred 30

The commands to run for flexbar for this dataset are as follows:

```./flexbar --adapter-min-overlap 7 --adapter-trim-end RIGHT --adapters /home/user/bioinformatics/ref_genome/illumina_multiplex.fa --pre-trim-left 13 --max-uncalled 300 --min-read-length 25 --threads 24 --zip-output GZ --reads /home/user/bioinformatics/ref_genome/test_data/UHR_Rep1_ERCC-Mix1_Build37-ErccTranscripts-chr22.read1.fastq.gz --reads2 /home/user/bioinformatics/ref_genome/test_data/UHR_Rep1_ERCC-Mix1_Build37-ErccTranscripts-chr22.read2.fastq.gz --target /home/user/bioinformatics/ref_genome/test_data/trimmed/UHR_Rep1_ERCC-Mix1_Build37-ErccTranscripts-chr22```

```./flexbar --adapter-min-overlap 7 --adapter-trim-end RIGHT --adapters /home/user/bioinformatics/ref_genome/illumina_multiplex.fa --pre-trim-left 13 --max-uncalled 300 --min-read-length 25 --threads 24 --zip-output GZ --reads /home/user/bioinformatics/ref_genome/test_data/UHR_Rep2_ERCC-Mix1_Build37-ErccTranscripts-chr22.read1.fastq.gz --reads2 /home/user/bioinformatics/ref_genome/test_data/UHR_Rep2_ERCC-Mix1_Build37-ErccTranscripts-chr22.read2.fastq.gz --target /home/user/bioinformatics/ref_genome/test_data/trimmed/UHR_Rep2_ERCC-Mix1_Build37-ErccTranscripts-chr22```

```./flexbar --adapter-min-overlap 7 --adapter-trim-end RIGHT --adapters /home/user/bioinformatics/ref_genome/illumina_multiplex.fa --pre-trim-left 13 --max-uncalled 300 --min-read-length 25 --threads 24 --zip-output GZ --reads /home/user/bioinformatics/ref_genome/test_data/UHR_Rep3_ERCC-Mix1_Build37-ErccTranscripts-chr22.read1.fastq.gz --reads2 /home/user/bioinformatics/ref_genome/test_data/UHR_Rep3_ERCC-Mix1_Build37-ErccTranscripts-chr22.read2.fastq.gz --target /home/user/bioinformatics/ref_genome/test_data/trimmed/UHR_Rep3_ERCC-Mix1_Build37-ErccTranscripts-chr22```

```./flexbar --adapter-min-overlap 7 --adapter-trim-end RIGHT --adapters /home/user/bioinformatics/ref_genome/illumina_multiplex.fa --pre-trim-left 13 --max-uncalled 300 --min-read-length 25 --threads 24 --zip-output GZ --reads /home/user/bioinformatics/ref_genome/test_data/HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz --reads2 /home/user/bioinformatics/ref_genome/test_data/HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read2.fastq.gz --target /home/user/bioinformatics/ref_genome/test_data/trimmed/HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22```

```./flexbar --adapter-min-overlap 7 --adapter-trim-end RIGHT --adapters /home/user/bioinformatics/ref_genome/illumina_multiplex.fa --pre-trim-left 13 --max-uncalled 300 --min-read-length 25 --threads 24 --zip-output GZ --reads /home/user/bioinformatics/ref_genome/test_data/HBR_Rep2_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz --reads2 /home/user/bioinformatics/ref_genome/test_data/HBR_Rep2_ERCC-Mix2_Build37-ErccTranscripts-chr22.read2.fastq.gz --target /home/user/bioinformatics/ref_genome/test_data/trimmed/HBR_Rep2_ERCC-Mix2_Build37-ErccTranscripts-chr22```

```./flexbar --adapter-min-overlap 7 --adapter-trim-end RIGHT --adapters /home/user/bioinformatics/ref_genome/illumina_multiplex.fa --pre-trim-left 13 --max-uncalled 300 --min-read-length 25 --threads 24 --zip-output GZ --reads /home/user/bioinformatics/ref_genome/test_data/HBR_Rep3_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz --reads2 /home/user/bioinformatics/ref_genome/test_data/HBR_Rep3_ERCC-Mix2_Build37-ErccTranscripts-chr22.read2.fastq.gz --target /home/user/bioinformatics/ref_genome/test_data/trimmed/HBR_Rep3_ERCC-Mix2_Build37-ErccTranscripts-chr22```

Compare the impact of trimming with the original QC reports by running fastqc and multiqc on the trimmed reads

In the directory of the trimmed reads, run:

```fastqc *.fastq.gz```

```multiqc .```

Same as before, make a new directory for the fastqc .html outputs and move them

```mkdir fastqc && mv *_fastqc* fastqc```


# Alignment with HISAT2

For more details on HISAT2, refer to the [manual on Github](http://daehwankimlab.github.io/hisat2/manual/)

The outputs after running HISAT2 will be a SAM/BAM file for each data set.

Options for HISAT2 include:


   - ```-p 8``` tells HISAT2 to use eight CPUs for bowtie alignments.
   - ```–rna-strandness RF``` specifies strandness of RNAseq library. We will specify RF since the TruSeq strand-specific library was used to make these libraries. See here for options.
   - ```–rg-id $ID``` specifies a read group ID that is a unique identifier.
   - ```–rg SM:$SAMPLE_NAME``` specifies a read group sample name. This together with rg-id will allow you to determine which reads came from which sample in the merged bam later on.
   - ```–rg LB:$LIBRARY_NAME``` specifies a read group library name. This together with rg-id will allow you to determine which reads came from which library in the merged bam later on.
   - ```–rg PL:ILLUMINA``` specifies a read group sequencing platform.
   - ```–rg PU:$PLATFORM_UNIT``` specifies a read group sequencing platform unit. Typically this consists of FLOWCELL-BARCODE.LANE
   - ```–dta``` Reports alignments tailored for transcript assemblers.
   - ```-x /path/to/hisat2/index``` The HISAT2 index filename prefix (minus the trailing .X.ht2) built earlier including splice sites and exons.
   - ```-1 /path/to/read1.fastq.gz``` The read 1 FASTQ file, optionally gzip(.gz) or bzip2(.bz2) compressed.
   - ```-2 /path/to/read2.fastq.gz``` The read 2 FASTQ file, optionally gzip(.gz) or bzip2(.bz2) compressed.
   - ```-S /path/to/output.sam``` The output SAM format text file of alignments.

