# Introduction

This RNA-seq pipeline uses the reference genome GRCm39 and HISAT2 for alignment and HTseq-count for generating raw counts. The rest of the analysis, at the moment, is done using iDEP.96

## Initial setup
The tutorial being followed uses AWS cloud but I am writing this pipeline to be ran locally on a linux-based machine. More specifically, I am using Manjaro with kernel version 6.1.11-1-MANJARO (64-bit).

[Configure your environment](https://rnabio.org/module-00-setup/0000/09/01/Environment/) and then [install all necessary software](https://rnabio.org/module-00-setup/0000/10/01/Installation/) according to the rnabio website. The specific steps you follow while doing this will depend on your system and how you want to install the tools. I would suggest looking for maintained packages for your specific distribution before trying to install from source.

Specific verison information of software installed on my setup:
- SAMtools v1.16.1, installed from AUR
    - run with ```samtools``` from CL
- bam-readcount v1.0.1, compiled from source
    - run from ```~$HOME/bioinformatics/bam-readcount/build/bin``` with ```./bam-readcount```
- HISAT2 v2.2.1-1, installed from AUR
    - run with ```hisat2``` from CL
- stringtie v2.2.1, downloaded pre-compiled binary
    - run from ```~$HOME/bioinformatics/stringtie-2.2.1.Linux_x86_64``` with ```./stringtie```
- gffcompare v0.12.6-2, installed from AUR
    - run with ```gffcompare``` from CL
- htseq-count v.0.11.3-1, installed from AUR and also in bioinformatics conda env
    - run with ```htseq-count``` from terminal with conda bioinformatics env active
- tophat v2.1.1, downloaded pre-compiled binary
    - run from ```~$HOME/bioinformatics/tophat-2.1.1.Linux_x86_64``` with ```./gtf_to_fasta```
- kallisto v0.48.0-1, installed from AUR
    - run with ```kallisto``` from CL
- fastqc v0.11.9-1, installed from AUR
    - run with ```fastqc``` from CL
- multiqc v1.14, installed with pip3 in bioinformatics conda env
    - run with ```multiqc --help``` from CL
- picard v2.26.4, downloaded as JAR file
    - run from ```~$HOME/bioinformatics/``` with ```java -jar picard.jar```
- flexbar v3.5.0, downloaded pre-compiled binary
    - run from ```~$HOME/bioinformatics/flexbar-3.5.0-linux``` with ```./flexbar```
- regtools v1.0.0, installed from AUR (git source)
    - run with ```regtools``` from CL
- rseqc v5.0.1, installed with pip3 in bioinformatics conda env
    - import in python with ```rseqc``` or run with ```read_GC.py``` in CL?
- bedops v2.4.40, downloaded pre-compiled binary
    - run from ```~$HOME/bioinformatics/bedops_linux_x86_64-v2.4.40/bin``` with```./bedops``` and ```./gff2bed```
- gtfToGenePred v unknown, downloaded pre-compiled binary
    - run from ```~$HOME/bioinformatics/gtfToGenePred``` with ```./gtfToGenePred```
- genePredToBed v unknown, downloaded pre-compiled binary
    - run from ```~$HOME/bioinformatics/genePredToBed``` with ```./genePredToBed```
- how_are_we_stranded_here, v unknown, installed with pip3 in bioinformatics conda env
    - run with ```check_strandedness``` in CL
- cell ranger v7.1.0, have to register to download, pre-compiled binary
    - run from ```~$HOME/bioinformatics/cellranger-7.1.0``` with ```./cellranger```
- tabix, website only offers apt installation, not relevant on arch-based distro. Tabix seems to be included with HTSlib which is already installed as a dependency for samtools and bcftools
- BWA v0.7.17=r1188, compiled from source
    - run from ```~$HOME/bioinformatics/bwa``` with ```./bwa``` or ```bwa```
- BCFtools v1.16, installed from AUR
    - run with ```bcftools``` from CL
- peddy v0.4.8, cloned and installed with pip in bioinformatics conda env
    - run with ```peddy``` from CL in conda env
- slivar v0.2.7, pre-complied binary
    - run from ```~$HOME/bioinformatics``` with ```./slivar```
- STRling v0.5.1, pre-complied binary
    - run from ```~$HOME/bioinformatics``` with ```./strling```
- freebayes v1.3.6, pre-compiled binary
    - run from ```~$HOME/bioinformatics``` with ```./freebayes-1.3.6-linux-amd64-static```

### R Libraries needed:
- devtools
- dplyr
- gplots
- ggplot2
- UpSetR

install with 
```install.packages(c("devtools","dplyr","gplots","ggplot2","UpSetR"),repos="http://cran.us.r-project.org")```

### Bioconductor libraries needed:
- [genefilter](http://bioconductor.org/packages/release/bioc/html/genefilter.html)
- [ballgown](http://bioconductor.org/packages/release/bioc/html/ballgown.html)
- [edgeR](http://www.bioconductor.org/packages/release/bioc/html/edgeR.html)
- [GenomicRanges](http://bioconductor.org/packages/release/bioc/html/GenomicRanges.html)
- [rhdf5](https://www.bioconductor.org/packages/release/bioc/html/rhdf5.html)
- [biomaRt](https://bioconductor.org/packages/release/bioc/html/biomaRt.html)
- [sva](https://www.bioconductor.org/packages/release/bioc/html/sva.html)
- [GAGE](https://bioconductor.org/packages/release/bioc/html/gage.html)

### Sleuth
[Source](https://pachterlab.github.io/sleuth/download)

devtools need to be installed if not already ```install.packages("devtools")```

install with ```devtools::install_github("pachterlab/sleuth")```


***
# General notes on RNA-seq workflow

Due to inconsistencies in the naming conventions for chromosomes, you should obtain your reference genome and annotation packages from the same source (Ensembl, NCBI, UCSC). Additionally, your annotations must correspond with the same reference genome version (i.e. both correspond to NCBI human GRCh38).

[This guide](https://pmbio.org/module-02-inputs/0002/02/01/Reference_Genome/) explains reference genomes and how to obtain them under the "Reference Genome Options" section at the bottom.



***
# Obtain reference genome and known gene/transcript annotations

### Reference genome

The Mus musculus GRCm39 reference genome in FASTA format was downloaded ```wget https://ftp.ensembl.org/pub/release-109/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.primary_assembly.fa.gz```

Location: ```/home/user/bioinformatics/reference_genomes/GRCm39/Mus_musculus.GRCm39.dna.primary_assembly.fa```

### Gene annotations in .gtf format

Gene annotations in .gtf format were downloaded from Ensembl as well via the command ```wget ftp://ftp.ensembl.org/pub/release-109/gtf/mus_musculus/Mus_musculus.GRCm39.109.gtf.gz```.

Location: ```/home/user/bioinformatics/reference_genomes/GRCm39/Mus_musculus.GRCm39.109.gtf```

You can preview the file with ```head -n 50 /home/user/bioinformatics/reference_genomes/GRCm39/Mus_musculus.GRCm39.109.gtf | column -t | less -p exon -S```

```cat /home/user/bioinformatics/reference_genomes/GRCm39/Mus_musculus.GRCm39.109.gtf | grep -w gene | wc -l```

You can view the structure of one transcript with ```grep ENSMUSG00000102693 /home/user/bioinformatics/reference_genomes/GRCm39/Mus_musculus.GRCm39.109.gtf | less -p "exon\s" -S```

**Important**: chromosome names in the .gtf file must be the same as those in the reference genome .fa file. If StringTie returns expression values that are all 0, this is the likley cause of that error. Download the annotation and reference genome files from the same source.

**Make sure your reference genome and annotations are for the same build of the genome (i.e. both for GRCh38).**

***
# Creating index with HISAT2

**WARNING** This step will take **A LOT** of RAM, well over 100 gb so make sure the system has enough.

It is much easier and faster, if possible, to download a pre-built index provided by the HISAT2 devs at [their website](https://daehwankimlab.github.io/hisat2/download/).

I downloaded the GRCm38 index, files were renamed with ```rename genome Mus_musculus.GRCm38.dna.primary_assembly *.ht2```.

This was done directly using python files from the [HISAT2 github repo](https://github.com/DaehwanKimLab/hisat2), not sure how to do it using the installed program listed above or if it's possible. Just clone the repo and use the files necessary.

Switch to hisat cloned repo ```cd /home/user/bioinformatics/hisat2```, or provide a full file path.

Run the following commands.

```/home/user/bioinformatics/hisat2/hisat2_extract_splice_sites.py /home/user/bioinformatics/reference_genomes/GRCm39/Mus_musculus.GRCm39.109.gtf > /home/user/bioinformatics/reference_genomes/GRCm39/splicesites.tsv```

```/home/user/bioinformatics/hisat2/hisat2_extract_exons.py /home/user/bioinformatics/reference_genomes/GRCm39/Mus_musculus.GRCm39.109.gtf > /home/user/bioinformatics/reference_genomes/GRCm39/exons.tsv```

I have hisat2 installed separately from the repo, hence why I don't need to provide a file path. This index was not built on my machine, it was done on a Google Cloud Platform instance with 256 GB of RAM.

```hisat2-build -p 24 --ss /home/user/bioinformatics/reference_genomes/GRCm39/splicesites.tsv --exon /home/user/bioinformatics/reference_genomes/GRCm39/exons.tsv /home/user/bioinformatics/reference_genomes/GRCm39/Mus_musculus.GRCm39.dna.primary_assembly.fa /home/user/bioinformatics/reference_genomes/GRCm39/Mus_musculus.GRCm39.dna.primary_assembly```

The output should be a collection of binary files with names like Mus_musculus.GRCm38.dna.primary_assembly.x.ht2 or .rf, the x being a number.

***
# Look at some of the actual RNA-seq data

Look at the head of the files.

```cd ~/bioinformatics/PyMT_Exercise_Paper_2023```

```zcat EP_01_S1_R1_001.fastq.gz | head -n 100```

A cursory look shows us that the average read length appears to be around 50 bp. Since these are paired-end reads, that makes sense. For the quality lines, most seem to be ```C```, which corresponds to a Phred quality score of 33, or a 5% chance that the base was incorrectly called.

Determine how many reads are in a file.

```zcat EP_01_S1_R1_001.fastq.gz | grep -P "^\@VH00" | wc -l```

Note, you may have to change the grep command to match the instrument ID.

There is a bash file in the Github repo under the PyMT_Exercise_Paper_2023 folder titled ```total_PE_read_counts.sh``` that will count all the reads for all FASTQ files and output them to a txt file. Simply define the directory for the FASTQ files, the output directory for the text file, and if there is a different naming scheme besides ```R1``` and ```R2``` for the paired reads, change those values in the script. You may also need to change the search value for grep from ```VH00``` to what is in your FASTQ files.

### After looking at the total read counts there are some potential issues

This script confirms all read pairs are equal in length.

***
# Check strandedness

This step is not required but due to differences in usage of RNA-seq library construction kits, it is **highly** recommended to empirically verify strandedness using the below method.

The check_strandedness should already be installed at the beginning. It can also be installed as a [Docker image](https://hub.docker.com/r/mgibio/checkstrandedness).

I suppose putting these in the GRCm39 folder should work for now.

```/home/user/bioinformatics/reference_genomes/GRCm39```

Convert gtf to genepred

```cd /home/user/bioinformatics/gtfToGenePred```, or just provide full file path.

```/home/user/bioinformatics/gtfToGenePred/gtfToGenePred /home/user/bioinformatics/reference_genomes/GRCm39/Mus_musculus.GRCm39.109.gtf /home/user/bioinformatics/reference_genomes/GRCm39/Mus_musculus.GRCm39.109.genePred```

Convert genePred to bed12

```cd /home/user/bioinformatics/genePredToBed```, or just provide full file path.

```/home/user/bioinformatics/genePredToBed/genePredToBed /home/user/bioinformatics/reference_genomes/GRCm39/Mus_musculus.GRCm39.109.genePred /home/user/bioinformatics/reference_genomes/GRCm39/Mus_musculus.GRCm39.109.bed12```

Use bedtools to create fasta index from GTF

```bedtools getfasta -fi /home/user/bioinformatics/reference_genomes/GRCm39/Mus_musculus.GRCm39.dna.primary_assembly.fa -bed /home/user/bioinformatics/reference_genomes/GRCm39/Mus_musculus.GRCm39.109.bed12 -s -split -name -fo /home/user/bioinformatics/reference_genomes/GRCm39/Mus_musculus.GRCm39.dna.primary_assembly.transcripts.fa```

Clean up the header lines of the transcript file

```cat /home/user/bioinformatics/reference_genomes/GRCm39/Mus_musculus.GRCm39.dna.primary_assembly.transcripts.fa | perl -ne 'if($_ =~/^\>\S+\:\:(ERCC\-\d+)\:.*/){print ">$1\n"}elsif ($_ =~/^\>(\S+)\:\:.*/){print ">$1\n"}else{print $_}' > /home/user/bioinformatics/reference_genomes/GRCm39/Mus_musculus.GRCm39.dna.primary_assembly.transcripts.clean.fa```

View the clean file

```less /home/user/bioinformatics/reference_genomes/GRCm39/Mus_musculus.GRCm39.dna.primary_assembly.transcripts.clean.fa```

Add in transcript_id field to .gtf file

```awk '{ if ($0 ~ "transcript_id") print $0; else print $0" transcript_id \"\";"; }' /home/user/bioinformatics/reference_genomes/GRCm39/Mus_musculus.GRCm39.109.gtf > /home/user/bioinformatics/reference_genomes/GRCm39/Mus_musculus.GRCm39.109_tidy.gtf```

check strandedness

**Everything up to here should be done already**.

```conda activate bioinformatics```

This command saves some files to the directory it is run from, best to change to where the fastq files are.

```cd /home/user/bioinformatics/PyMT_Exercise_Paper_2023```

```check_strandedness --gtf /home/user/bioinformatics/reference_genomes/GRCm39/Mus_musculus.GRCm39.109_tidy.gtf --transcripts /home/user/bioinformatics/reference_genomes/GRCm39/Mus_musculus.GRCm39.dna.primary_assembly.transcripts.clean.fa --reads_1 /home/user/bioinformatics/PyMT_Exercise_Paper_2023/EP_01_S1_R1_001.fastq.gz --reads_2 /home/user/bioinformatics/PyMT_Exercise_Paper_2023/EP_01_S1_R2_001.fastq.gz```

OUTPUT:

Total 200000 usable reads were sampled

This is PairEnd Data

Fraction of reads failed to determine: 0.1196

Fraction of reads explained by "1++,1--,2+-,2-+": 0.0044 (0.5% of explainable reads)

Fraction of reads explained by "1+-,1-+,2++,2--": 0.8760 (99.5% of explainable reads)

Over 90% of reads explained by "1+-,1-+,2++,2--"

Data is likely RF/fr-firststrand

More info on strandedness can be found [here](https://rnabio.org/module-09-appendix/0009/12/01/StrandSettings/)

Based on the strandedness, this sequencing was likely done using the dUTP method.

***
# Pre-alignment QC

To get a sense of the raw reads quality in your fastq files, run the following programs.

### fastqc [docs](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/)

```cd /home/user/bioinformatics/PyMT_Exercise_Paper_2023```

```fastqc -t 24 *.fastq.gz``` in the same directory as your reads.

It will output an HTML file for each read in the same directory that contains a full report of the quality measures.

In the "over-represented sequences" section, try [BLASTing](https://blast.ncbi.nlm.nih.gov/Blast.cgi) the sequences to try determining a source organism.

### multiqc

Run multiqc to combine all the fastqc reports.

```cd /home/user/bioinformatics/PyMT_Exercise_Paper_2023```

Run ```multiqc .``` in the same directory as the reports.

Renamed file to ```pre-alignment_multiqc_report.html```

Make a new directory for the fastqc .html outputs and move them

```mkdir fastqc && mv *_fastqc* fastqc```

## Bash script

I made a bash script to handle this step, ```pre-alignment_QC.sh```. All that should need to be changed is the fastq_dir variable. Which should point to where the FASTQ files are. The script does not yet account for any sub-folders.

## View reports

Looking at the MultiQC report, none of the warnings are concerning or unusual. 

***
# Alignment with HISAT2

For more details on HISAT2, refer to the [manual on Github](http://daehwankimlab.github.io/hisat2/manual/)

The outputs after running HISAT2 will be a SAM/BAM file for each data set.

Options for HISAT2 include:


   - ```-p 24``` tells HISAT2 to use 24 threads for bowtie alignments.
   - ```–rna-strandness``` specifies strandness of RNAseq library. Check_strandedness said it is likely RF/fr. So we remove use R/RF.
   - ```–rg-id $ID``` specifies a read group ID that is a unique identifier.
   - ```–rg SM:$SAMPLE_NAME``` specifies a read group sample name. This together with rg-id will allow you to determine which reads came from which sample in the merged bam later on.
   - ```–rg LB:$LIBRARY_NAME``` specifies a read group library name. This together with rg-id will allow you to determine which reads came from which library in the merged bam later on.
   - ```–rg PL:ILLUMINA``` specifies a read group sequencing platform.
   - ```–rg PU:$PLATFORM_UNIT``` specifies a read group sequencing platform unit. Typically this consists of FLOWCELL-BARCODE.LANE
   - ```–dta``` Reports alignments tailored for transcript assemblers.
   - ```-x /path/to/hisat2/index``` The HISAT2 index filename prefix (minus the trailing .X.ht2) built earlier including splice sites and exons.
   - ```-1 /path/to/read1.fastq.gz``` The read 1 FASTQ file, optionally gzip(.gz) or bzip2(.bz2) compressed.
   - ```-2 /path/to/read2.fastq.gz``` The read 2 FASTQ file, optionally gzip(.gz) or bzip2(.bz2) compressed.
   - ```-U /path/to/unstranded.fastq.gz``` If data is unstranded, provide link to FASTQ or .gz version here.
   - ```-S /path/to/output.sam``` The output SAM format text file of alignments.
   
Create an alignment folder for the outputs.

```cd /home/user/bioinformatics/PyMT_Exercise_Paper_2023 && mkdir -p alignments && cd alignments```

Run a command for each of the replicates. I will try running based off the trimmed fastq files and see how it goes.

I wrote a script to handle this, ```HISAT2_alignment.sh```

```hisat2 -p 24 --rg-id EP_01_S1_001 --rg SM:EP_01_S1_001 --rg LB:EP_01_S1_001_Lib --rg PL:ILLUMINA -x /home/user/bioinformatics/reference_genomes/GRCm39/Mus_musculus.GRCm39.dna.primary_assembly --rna-strandness RF --dta -1 /home/user/bioinformatics/PyMT_Exercise_Paper_2023/EP_01_S1_001.read1.fastq.gz -2 /home/user/bioinformatics/PyMT_Exercise_Paper_2023/EP_01_S1_001.read2.fastq.gz -S /home/user/bioinformatics/PyMT_Exercise_Paper_2023/EP_01_S1_001.sam```


### Convert SAM to BAM files

I also wrote a script for this, ```Convert_SAM_to_BAM.sh```

```-@ int``` is the number of threads to use.

```cd /home/user/bioinformatics/PyMT_Exercise_Paper_2023/alignments```

```samtools sort -@ 24 -o EP_01_S1_001.bam EP_01_S1_001.sam```

And so on so forth...

The SAM files are very large and will take up hundreds of gigs, so run ```rm -f *.sam``` in the alignments directory.

### Merge all BAM files
Other tools such as "samtools merge" and "bamtools merge" can also achieve this. Create a BAM file for each group. Can't do this yet because I'm not sure which sample belongs to which group.

```java -Xmx2g -jar /home/user/bioinformatics/picard.jar MergeSamFiles -OUTPUT /home/user/bioinformatics/PyMT_Paper_2023/alignments/FC.bam -INPUT /home/user/bioinformatics/PyMT_Paper_2023/alignments/Pistilli_P11B_FC1_S1_L001_001.bam -INPUT /home/user/bioinformatics/PyMT_Paper_2023/alignments/Pistilli_P11D_FC2_S2_L001_001.bam -INPUT /home/user/bioinformatics/PyMT_Paper_2023/alignments/Pistilli_P13B_FC3_S3_L001_001.bam -INPUT /home/user/bioinformatics/PyMT_Paper_2023/alignments/Pistilli_P13C_FC4_S4_L001_001.bam -INPUT /home/user/bioinformatics/PyMT_Paper_2023/alignments/Pistilli_P13E_FC5_S5_L001_001.bam```

```java -Xmx2g -jar /home/user/bioinformatics/picard.jar MergeSamFiles -OUTPUT /home/user/bioinformatics/PyMT_Paper_2023/alignments/FT.bam -INPUT /home/user/bioinformatics/PyMT_Paper_2023/alignments/Pistilli_P12BL_FT1_S6_L001_001.bam -INPUT /home/user/bioinformatics/PyMT_Paper_2023/alignments/Pistilli_P12CRR_FT2_S7_L001_001.bam -INPUT /home/user/bioinformatics/PyMT_Paper_2023/alignments/Pistilli_P13A_FT3_S8_L001_001.bam -INPUT /home/user/bioinformatics/PyMT_Paper_2023/alignments/Pistilli_P14E_FT4_S9_L001_001.bam -INPUT /home/user/bioinformatics/PyMT_Paper_2023/alignments/Pistilli_P13F_FT5_S10_L001_001.bam```

```java -Xmx2g -jar /home/user/bioinformatics/picard.jar MergeSamFiles -OUTPUT /home/user/bioinformatics/PyMT_Paper_2023/alignments/MC.bam -INPUT /home/user/bioinformatics/PyMT_Paper_2023/alignments/Pistilli_P186-5_MC1_S11_L001_001.bam -INPUT /home/user/bioinformatics/PyMT_Paper_2023/alignments/Pistilli_P186-7_MC2_S12_L001_001.bam -INPUT /home/user/bioinformatics/PyMT_Paper_2023/alignments/Pistilli_P189-1_MC3_S13_L001_001.bam -INPUT /home/user/bioinformatics/PyMT_Paper_2023/alignments/Pistilli_P187-1_MC4_S14_L001_001.bam -INPUT /home/user/bioinformatics/PyMT_Paper_2023/alignments/Pistilli_P188-2_MC5_S15_L001_001.bam```

```java -Xmx2g -jar /home/user/bioinformatics/picard.jar MergeSamFiles -OUTPUT /home/user/bioinformatics/PyMT_Paper_2023/alignments/MT.bam -INPUT /home/user/bioinformatics/PyMT_Paper_2023/alignments/Pistilli_P187-3_MT1_S16_L001_001.bam -INPUT /home/user/bioinformatics/PyMT_Paper_2023/alignments/Pistilli_P186-6_MT2_S17_L001_001.bam -INPUT /home/user/bioinformatics/PyMT_Paper_2023/alignments/Pistilli_P188-1_MT3_S18_L001_001.bam -INPUT /home/user/bioinformatics/PyMT_Paper_2023/alignments/Pistilli_P188-4_MT4_S19_L001_001.bam -INPUT /home/user/bioinformatics/PyMT_Paper_2023/alignments/Pistilli_P188-5_MT5_S20_L001_001.bam```

You can verify that all bam files were created successfully, there should be 24.

In the same directory as the aligned .bam files, ```ls -l *.bam | wc -l```

***
# Using the Integrative Genomics Viewer (IGV)

```cd /home/user/bioinformatics/IGV_Linux_2.15.4```

IGV can be run from the download directory using ```./igv.sh```

To re-visit the shortish tutorial on navigating IGV, go [here](https://rnabio.org/module-02-alignment/0002/04/01/IGV/). You can follow the tutorial there using the provided files.

For our dataset, we need to index the BAM files before loading them into IGV. This will be done with samtools. From the directory where the BAM files are located, run:

You can just run the following command to index all .bam files in the directory. This way is easier than running multiple commands.

```cd /home/user/bioinformatics/PyMT_Exercise_Paper_2023/alignments```

```find *.bam -exec echo samtools index -@28 {} \; | sh```

Load the .bam files in IGV and look around. The .bam.bai index needs to be in the same directory as the .bam files.

***
# Alignment QC

The command ```samtools flagstat``` can provide a summary of the alignments.

```mkdir -p /home/user/bioinformatics/PyMT_Exercise_Paper_2023/alignments/flagstat```

```cd /home/user/bioinformatics/PyMT_Exercise_Paper_2023/alignments```

```samtools flagstat /home/user/bioinformatics/PyMT_Exercise_Paper_2023/alignments/EP_01_S1_001.bam > /home/user/bioinformatics/PyMT_Exercise_Paper_2023/alignments/EP_01_S1_001.bam.flagstat```

And so on....

Or, you can also use ```find``` to do it in one command.

```find *_*.bam -exec echo samtools flagstat {} \> flagstat/{}.flagstat \; | sh```

### FastQc on bam file

From your alignment directory (where the bam files are).

```mkdir /home/user/bioinformatics/PyMT_Exercise_Paper_2023/alignments/fastqc```

```cd /home/user/bioinformatics/PyMT_Exercise_Paper_2023/alignments```

```fastqc -t 24 *.bam```

Clean up the current directory and move the reports into the fastqc folder.

```mv *fastqc.html fastqc/ && mv *fastqc.zip fastqc/```

### Picard

Using Picard we can generate metrics and figures that express the quality of our RNA-seq data.

We first need to create input files for Picard based on the reference genome so it can properly run the function CollectRnaSeqMetrics.

```java -jar /home/user/bioinformatics/picard.jar CreateSequenceDictionary -R /home/user/bioinformatics/reference_genomes/GRCm39/Mus_musculus.GRCm39.dna.primary_assembly.fa -O /home/user/bioinformatics/reference_genomes/GRCm39/Mus_musculus.GRCm39.dna.primary_assembly.dict```

Create a bed file of the location of ribosomal sequences in our reference (first extract from the gtf then convert to bed). Note that here we pull all the "rrna" transcripts from the GTF.

```grep --color=none -i -P "rrna" /home/user/bioinformatics/reference_genomes/GRCm39/Mus_musculus.GRCm39.109.gtf > /home/user/bioinformatics/reference_genomes/GRCm39/ref_ribosome.gtf```

Note: ```gff2bed``` is part of BEDOPS. Run the script directly from the bedops/bin folder. You may need to add the /bin folder to your path variable ```export PATH="$PATH:/home/user/bioinformatics/bedops_linux_x86_64-v2.4.40/bin"```, otherwise convert2bed may not be found.

```cd /home/user/bioinformatics/bedops_linux_x86_64-v2.4.40/bin```

```gff2bed < /home/user/bioinformatics/reference_genomes/GRCm39/ref_ribosome.gtf > /home/user/bioinformatics/reference_genomes/GRCm39/ref_ribosome.bed```

Create interval list file to identify the location of the ribosomal sequences in the reference

```java -jar /home/user/bioinformatics/picard.jar BedToIntervalList -I /home/user/bioinformatics/reference_genomes/GRCm39/ref_ribosome.bed -O /home/user/bioinformatics/reference_genomes/GRCm39/ref_ribosome.interval_list -SD /home/user/bioinformatics/reference_genomes/GRCm39/Mus_musculus.GRCm39.dna.primary_assembly.dict```

Create genePred file for the reference transcriptome

```cd /home/user/bioinformatics/gtfToGenePred```

```./gtfToGenePred -genePredExt /home/user/bioinformatics/reference_genomes/GRCm39/Mus_musculus.GRCm39.109.gtf /home/user/bioinformatics/reference_genomes/GRCm39/Mus_musculus.GRCm39.109.ref_flat.txt```

Reformat the genePred file

```cat /home/user/bioinformatics/reference_genomes/GRCm39/Mus_musculus.GRCm39.109.ref_flat.txt | awk '{print $12"\t"$0}' | cut -d$'\t' -f1-11 > /home/user/bioinformatics/reference_genomes/GRCm39/tmp.txt```

```mv /home/user/bioinformatics/reference_genomes/GRCm39/tmp.txt /home/user/bioinformatics/reference_genomes/GRCm39/Mus_musculus.GRCm39.109.ref_flat.txt```

Go to the directory with your alignments.

```cd /home/user/bioinformatics/PyMT_Exercise_Paper_2023/alignments && mkdir picard```

This data is stranded paired-end reads, specifically ```RF/fr-firststrand stranded (dUTP)``` as determined in the check strandedness step, so the STRAND flag should be set to ```STRAND=SECOND_READ_TRANSCRIPTION_STRAND```.

```find *_*.bam -exec echo java -jar /home/user/bioinformatics/picard.jar CollectRnaSeqMetrics I={} O=picard/{}.RNA_Metrics REF_FLAT=/home/user/bioinformatics/reference_genomes/GRCm39/Mus_musculus.GRCm39.109.ref_flat.txt STRAND=SECOND_READ_TRANSCRIPTION_STRAND RIBOSOMAL_INTERVALS=/home/user/bioinformatics/reference_genomes/GRCm39/ref_ribosome.interval_list \; | sh```

### RSeQC

Another tool for generating RNA-seq QC reports.

You will need
- aligned bam files
- index file for each bam file
- transcript bed file (in bed12 format)

Start by converting the reference gtf to genePred. This step has probably already been done.

```gtfToGenePred /home/user/bioinformatics/reference_genomes/GRCm39/Mus_musculus.GRCm39.109.gtf /home/user/bioinformatics/reference_genomes/GRCm39/Mus_musculus.GRCm39.109.genePred```

Then convert the genePred to bed12. This step also probably done already.

```genePredToBed /home/user/bioinformatics/reference_genomes/GRCm39/Mus_musculus.GRCm39.109.genePred /home/user/bioinformatics/reference_genomes/GRCm39/Mus_musculus.GRCm39.109.bed12```

Go back to the alignment directory. Make sure the bioinformatics env is active.

```cd /home/user/bioinformatics/PyMT_Exercise_Paper_2023/alignments && mkdir rseqc```

```geneBody_coverage.py -i Pistilli_P11B_FC1_S1_L001_001.bam, Pistilli_P11D_FC2_S2_L001_001.bam, Pistilli_P13B_FC3_S3_L001_001.bam, Pistilli_P13C_FC4_S4_L001_001.bam, Pistilli_P13E_FC5_S5_L001_001.bam -r /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102.bed12 -o rseqc/FC```

```geneBody_coverage.py -i Pistilli_P12BL_FT1_S6_L001_001.bam, Pistilli_P12CRR_FT2_S7_L001_001.bam, Pistilli_P13A_FT3_S8_L001_001.bam, Pistilli_P14E_FT4_S9_L001_001.bam, Pistilli_P13F_FT5_S10_L001_001.bam -r /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102.bed12 -o rseqc/FT```

```geneBody_coverage.py -i Pistilli_P186-5_MC1_S11_L001_001.bam, Pistilli_P186-7_MC2_S12_L001_001.bam, Pistilli_P189-1_MC3_S13_L001_001.bam, Pistilli_P187-1_MC4_S14_L001_001.bam, Pistilli_P188-2_MC5_S15_L001_001.bam -r /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102.bed12 -o rseqc/MC```

```geneBody_coverage.py -i Pistilli_P187-3_MT1_S16_L001_001.bam, Pistilli_P186-6_MT2_S17_L001_001.bam, Pistilli_P188-1_MT3_S18_L001_001.bam, Pistilli_P188-4_MT4_S19_L001_001.bam, Pistilli_P188-5_MT5_S20_L001_001.bam -r /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102.bed12 -o rseqc/MT```

And then that same geneBody_coverage.py command for each group.

The following commands should also be run from the alignment directory ```cd /home/user/bioinformatics/PyMT_Exercise_Paper_2023/alignments```

```find *_*.bam -exec echo inner_distance.py -i {} -r /home/user/bioinformatics/reference_genomes/GRCm39/Mus_musculus.GRCm39.109.bed12 -o rseqc/{} \; | sh && find *_*.bam -exec echo junction_annotation.py -i {} -r /home/user/bioinformatics/reference_genomes/GRCm39/Mus_musculus.GRCm39.109.bed12 -o rseqc/{} \; | sh && find *_*.bam -exec echo junction_saturation.py -i {} -r /home/user/bioinformatics/reference_genomes/GRCm39/Mus_musculus.GRCm39.109.bed12 -o rseqc/{} \; | sh && find *_*.bam -exec echo read_distribution.py -i {} -r /home/user/bioinformatics/reference_genomes/GRCm39/Mus_musculus.GRCm39.109.bed12 \> rseqc/{}.read_dist.txt \; | sh && find *_*.bam -exec echo RNA_fragment_size.py -i {} -r /home/user/bioinformatics/reference_genomes/GRCm39/Mus_musculus.GRCm39.109.bed12 \> rseqc/{}.frag_size.txt \; | sh && find *_*.bam -exec echo bam_stat.py -i {} \> {}.bam_stat.txt \; | sh && rm -f log.txt```

### MultiQC

Now we will compile a multiQC report from all of the QC done above.

In the alignment directory ```cd /home/user/bioinformatics/PyMT_Exercise_Paper_2023/alignments```.

```multiqc ./```

***
# Expression Analysis

For this run of the pipeline, **I will not be generating normalized expression values and instead only the raw counts, as below**.

### BamQC

Before generating counts, especially raw counts, it is useful to perform some QC on the aligned BAM files produced by HISAT2. This will let us better examine the quality of our alignments and set a reasonable quality filter value in the ```--minaqual``` flag for HTSeq-count.

This is done using the [BamQC](https://github.com/s-andrews/BamQC) program. After compiling the program, you can generate a report for your BAM file using the following command:

```cd /home/user/bioinformatics/PyMT_Exercise_Paper_2023/alignments```

Ran the below command for now. You can also run this on BAM 

```/home/user/bioinformatics/BamQC/bin/bamqc --gff /home/user/bioinformatics/reference_genomes/GRCm39/Mus_musculus.GRCm39.109.gtf *.bam```

You can list multiple BAM files in one command. It will save the report to the same directory as your BAM file.

Of particular interest is the "Mapping Quality Distribution" section, as it will show the spread of the MAPQ quality values generated by HISAT2. Which should range from 0 to 60. A value closer to 60 means a higher quality alignment.

Looking at the BamQC files for each group, the majority of reads, around 50-60% are MAPQ score of 60, the rest are 1 or 0. Based on running the HTSeq commands with 1, 10, and 40, the counts do not really change. If there was a more varied spread, then 40 would probably be a good value, but here it doesn't change anything since there aren't any alignments with a score between 2 and 59.

### Generate raw counts instead of FPKM/TPM values

If you only want to generate the raw counts this can be done using HTSeq-count [docs](https://htseq.readthedocs.io/).

Some notes about the flags used:


- ```–format``` specify the input file format one of BAM or SAM. Since we have BAM format files, select ‘bam’ for this option.
- ```–order``` provide the expected sort order of the input file. Previously we generated position sorted BAM files so use ‘pos’.
- ```–mode``` determines how to deal with reads that overlap more than one feature. Various sources say ‘intersection-strict’ mode is best in most cases.
- ```–stranded``` specifies whether data is stranded or not. The TruSeq strand-specific RNA libraries suggest the ‘reverse’ option for this parameter. This data is RF/fr-firststrand so the flag is set to 'reverse'.
- ```–minaqual``` will skip all reads with alignment quality lower than the given minimum value.
- ```–type``` specifies the feature type (3rd column in GFF file) to be used. (default, suitable for RNA-Seq and Ensembl GTF files: exon)
- ```–idattr``` The feature ID used to identify the counts in the output table. The default, suitable for RNA-SEq and Ensembl GTF files, is gene_id.

Change directories to where the expression folder is ```cd /home/user/bioinformatics/PyMT_Exercise_Paper_2023/alignments```

```mkdir -p expression/htseq_counts && cd expression/htseq_counts```

Run each htseq-count command to get the raw counts for each read file. So the command below for each .bam file. Or, use the "run_htseq-count.sh" script to do them all in parallel because htseq-count can't multithread.

```htseq-count --format bam --order pos --mode intersection-strict --stranded reverse --minaqual 1 --type exon --idattr gene_id /home/user/bioinformatics/PyMT_Exercise_Paper_2023/alignments/EP_01_S1_001.bam /home/user/bioinformatics/reference_genomes/GRCm39/Mus_musculus.GRCm39.109.gtf > /home/user/bioinformatics/PyMT_Exercise_Paper_2023/alignments/expression/htseq_counts/EP_01_gene.tsv""

Merge all the results files into one file.

```cd /home/user/bioinformatics/PyMT_Paper_2023/alignments/expression/htseq_counts```

```join FC1_gene.tsv FC2_gene.tsv | join - FC3_gene.tsv | join - FC4_gene.tsv | join - FC5_gene.tsv | join - FT1_gene.tsv | join - FT2_gene.tsv | join - FT3_gene.tsv | join - FT4_gene.tsv | join - FT5_gene.tsv | join - MC1_gene.tsv | join - MC2_gene.tsv | join - MC3_gene.tsv | join - MC4_gene.tsv | join - MC5_gene.tsv | join - MT1_gene.tsv | join - MT2_gene.tsv | join - MT3_gene.tsv | join - MT4_gene.tsv | join - MT5_gene.tsv > gene_read_counts_table_all.tsv```

```echo "GeneID FC1 FC2 FC3 FC4 FC5 FT1 FT2 FT3 FT4 FT5 MC1 MC2 MC3 MC4 MC5 MT1 MT2 MT3 MT4 MT5" > header.txt```

```cat header.txt gene_read_counts_table_all.tsv | grep -v "__" | awk -v OFS="\t" '$1=$1' > gene_read_counts_table_all_final.tsv```

If you're curious, ```grep -v "__"``` is used to filter out summary lines at the end of htseq-count files that did not align, had poor quality alignment, etc.

```rm -f gene_read_counts_table_all.tsv header.txt```

```head gene_read_counts_table_all_final.tsv | column -t```

***
# Differential Expression (DESeq2)

The differential expression is done using the bioconductor package [DESeq2](https://master.bioconductor.org/packages/release/bioc/html/DESeq2.html).

```cd /home/user/bioinformatics/PyMT_Paper_2023/alignments```

```mkdir -p de/deseq2/ && cd de/deseq2/```

Start an R session by typing ```R``` in a terminal.

The following code is an R script that will perform DE using Ballgown. You can run each command line by line.