# Introduction

This RNA-seq pipeline uses HISAT2 for alignment and HTseq-count for generating raw counts. The rest of the analysis, at the moment, is done using iDEP.96

## Initial setup
The tutorial being followed uses AWS cloud but I am writing this pipeline to be ran locally on a linux-based machine. More specifically, I am using Manjaro with kernel version 6.1.11-1-MANJARO (64-bit).

[Configure your environment](https://rnabio.org/module-00-setup/0000/09/01/Environment/) and then [install all necessary software](https://rnabio.org/module-00-setup/0000/10/01/Installation/) according to the rnabio website. The specific steps you follow while doing this will depend on your system and how you want to install the tools. I would suggest looking for maintained packages for your specific distribution before trying to install from source.

Specific verison information of software installed on my setup:
- SAMtools v1.16.1, installed from AUR
    - run with ```samtools``` from CL
- bam-readcount v1.0.1, compiled from source
    - run from ```~$HOME/bioinformatics/bam-readcount/build/bin``` with ```./bam-readcount```
- HISAT2 v2.2.1-1, installed from AUR
    - run with ```hisat2``` from CL
- stringtie v2.2.1, downloaded pre-compiled binary
    - run from ```~$HOME/bioinformatics/stringtie-2.2.1.Linux_x86_64``` with ```./stringtie```
- gffcompare v0.12.6-2, installed from AUR
    - run with ```gffcompare``` from CL
- htseq-count v.0.11.3-1, installed from AUR and also in bioinformatics conda env
    - run with ```htseq-count``` from terminal with conda bioinformatics env active
- tophat v2.1.1, downloaded pre-compiled binary
    - run from ```~$HOME/bioinformatics/tophat-2.1.1.Linux_x86_64``` with ```./gtf_to_fasta```
- kallisto v0.48.0-1, installed from AUR
    - run with ```kallisto``` from CL
- fastqc v0.11.9-1, installed from AUR
    - run with ```fastqc``` from CL
- multiqc v1.14, installed with pip3 in bioinformatics conda env
    - run with ```multiqc --help``` from CL
- picard v2.26.4, downloaded as JAR file
    - run from ```~$HOME/bioinformatics/``` with ```java -jar picard.jar```
- flexbar v3.5.0, downloaded pre-compiled binary
    - run from ```~$HOME/bioinformatics/flexbar-3.5.0-linux``` with ```./flexbar```
- regtools v1.0.0, installed from AUR (git source)
    - run with ```regtools``` from CL
- rseqc v5.0.1, installed with pip3 in bioinformatics conda env
    - import in python with ```rseqc``` or run with ```read_GC.py``` in CL?
- bedops v2.4.40, downloaded pre-compiled binary
    - run from ```~$HOME/bioinformatics/bedops_linux_x86_64-v2.4.40/bin``` with```./bedops``` and ```./gff2bed```
- gtfToGenePred v unknown, downloaded pre-compiled binary
    - run from ```~$HOME/bioinformatics/gtfToGenePred``` with ```./gtfToGenePred```
- genePredToBed v unknown, downloaded pre-compiled binary
    - run from ```~$HOME/bioinformatics/genePredToBed``` with ```./genePredToBed```
- how_are_we_stranded_here, v unknown, installed with pip3 in bioinformatics conda env
    - run with ```check_strandedness``` in CL
- cell ranger v7.1.0, have to register to download, pre-compiled binary
    - run from ```~$HOME/bioinformatics/cellranger-7.1.0``` with ```./cellranger```
- tabix, website only offers apt installation, not relevant on arch-based distro. Tabix seems to be included with HTSlib which is already installed as a dependency for samtools and bcftools
- BWA v0.7.17=r1188, compiled from source
    - run from ```~$HOME/bioinformatics/bwa``` with ```./bwa``` or ```bwa```
- BCFtools v1.16, installed from AUR
    - run with ```bcftools``` from CL
- peddy v0.4.8, cloned and installed with pip in bioinformatics conda env
    - run with ```peddy``` from CL in conda env
- slivar v0.2.7, pre-complied binary
    - run from ```~$HOME/bioinformatics``` with ```./slivar```
- STRling v0.5.1, pre-complied binary
    - run from ```~$HOME/bioinformatics``` with ```./strling```
- freebayes v1.3.6, pre-compiled binary
    - run from ```~$HOME/bioinformatics``` with ```./freebayes-1.3.6-linux-amd64-static```

### R Libraries needed:
- devtools
- dplyr
- gplots
- ggplot2
- UpSetR

install with 
```install.packages(c("devtools","dplyr","gplots","ggplot2","UpSetR"),repos="http://cran.us.r-project.org")```

### Bioconductor libraries needed:
- [genefilter](http://bioconductor.org/packages/release/bioc/html/genefilter.html)
- [ballgown](http://bioconductor.org/packages/release/bioc/html/ballgown.html)
- [edgeR](http://www.bioconductor.org/packages/release/bioc/html/edgeR.html)
- [GenomicRanges](http://bioconductor.org/packages/release/bioc/html/GenomicRanges.html)
- [rhdf5](https://www.bioconductor.org/packages/release/bioc/html/rhdf5.html)
- [biomaRt](https://bioconductor.org/packages/release/bioc/html/biomaRt.html)
- [sva](https://www.bioconductor.org/packages/release/bioc/html/sva.html)
- [GAGE](https://bioconductor.org/packages/release/bioc/html/gage.html)

### Sleuth
[Source](https://pachterlab.github.io/sleuth/download)

devtools need to be installed if not already ```install.packages("devtools")```

install with ```devtools::install_github("pachterlab/sleuth")```


***
# General notes on RNA-seq workflow

Due to inconsistencies in the naming conventions for chromosomes, you should obtain your reference genome and annotation packages from the same source (Ensembl, NCBI, UCSC). Additionally, your annotations must correspond with the same reference genome version (i.e. both correspond to NCBI human GRCh38).

[This guide](https://pmbio.org/module-02-inputs/0002/02/01/Reference_Genome/) explains reference genomes and how to obtain them under the "Reference Genome Options" section at the bottom.



***
# Obtain reference genome and known gene/transcript annotations

### Reference genome

The Mus musculus GRCm38 reference genome in FASTA format was downloaded from [Ensembl](http://nov2020.archive.ensembl.org/Mus_musculus/Info/Index#) via the command ```wget ftp://ftp.ensembl.org:21/pub/release-102/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna.primary_assembly.fa.gz```. This is the reference genome that was stated to be used in the paper.

Location on disk: ```~/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.dna.primary_assembly.fa```

The Mus musculus GRCm39 reference genome in FASTA format was also downloaded ```wget https://ftp.ensembl.org/pub/release-109/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.primary_assembly.fa.gz```, but it was not used in this analysis.

Location: ```~/bioinformatics/reference_genomes/GRCm39/Mus_musculus.GRCm39.dna.primary_assembly.fa```

### Gene annotations in .gtf format

Gene annotations in .gtf format were downloaded from Ensembl as well via the command ```wget ftp://ftp.ensembl.org:21/pub/release-102/gtf/mus_musculus/Mus_musculus.GRCm38.102.gtf.gz```. There are other versions of the .gtf files so I will see if those are needed later on.

Location: ```~/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102.gtf```

I also downloaded the gene annotations in .gtf from Ensembl for GRCm39 via the command ```ftp://ftp.ensembl.org/pub/release-109/gtf/mus_musculus/Mus_musculus.GRCm39.109.gtf.gz```

You can preview the file with ```head -n 50 ~/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102.gtf | column -t | less -p exon -S```

```cat /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102.gtf | grep -w gene | wc -l```

You can view the structure of one transcript with ```grep ENSMUSG00000102693 /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102.gtf | less -p "exon\s" -S```

**Important**: chromosome names in the .gtf file must be the same as those in the reference genome .fa file. If StringTie returns expression values that are all 0, this is the likley cause of that error. Download the annotation and reference genome files from the same source.

**Make sure your reference genome and annotations are for the same build of the genome (i.e. both for GRCh38).**

***
# Creating index with HISAT2

**WARNING** This step will take **A LOT** of RAM, well over 100 gb so make sure the system has enough.

It is much easier and faster, if possible, to download a pre-built index provided by the HISAT2 devs at [their website](https://daehwankimlab.github.io/hisat2/download/).

I downloaded the GRCm38 index, files were renamed with ```rename genome Mus_musculus.GRCm38.dna.primary_assembly *.ht2```.

This was done directly using python files from the [HISAT2 github repo](https://github.com/DaehwanKimLab/hisat2), not sure how to do it using the installed program listed above or if it's possible. Just clone the repo and use the files necessary.

Switch to hisat cloned repo ```cd /home/user/bioinformatics/hisat2```

Run the following commands.

```./hisat2_extract_splice_sites.py ~/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102.gtf > ~/bioinformatics/BP_PRJNA496042/splicesites.tsv```

```./hisat2_extract_exons.py ~/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102.gtf > ~/bioinformatics/BP_PRJNA496042/exons.tsv```

```hisat2-build -p 24 --ss ~/bioinformatics/BP_PRJNA496042/splicesites.tsv --exon ~/bioinformatics/BP_PRJNA496042/exons.tsv ~/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.dna.primary_assembly.fa ~/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.dna.primary_assembly```

The output should be a collection of binary files with names like Mus_musculus.GRCm38.dna.primary_assembly.x.ht2 or .rf, the x being a number.

***
# Look at some of the actual RNA-seq data

Look at the head of the files.

```cd ~/bioinformatics/PyMT_Paper_2023```

```zcat Pistilli_P11B_FC1_S1_L001_R1_001.fastq.gz | head -n 100```

A cursory look shows us that the average read length appears to be around 50 bp. Since these are paired-end reads, that makes sense. For the quality lines, most seem to be ```C```, which corresponds to a Phred quality score of 33, or a 5% chance that the base was incorrectly called.

Determine how many reads are in a file.

```zcat Pistilli_P11B_FC1_S1_L001_R1_001.fastq.gz | grep -P "^\@VH00" | wc -l```

Note, you may have to change the grep command to match the instrument ID.

There are 22,898,344 reads in this file.

There is a bash file in the Github repo under the PyMT_Paper_2023 folder titled ```total_PE_read_counts.sh``` that will count all the reads for all FASTQ files and output them to a txt file. Simply define the directory for the FASTQ files, the output directory for the text file, and if there is a different naming scheme besides ```R1``` and ```R2``` for the paired reads, change those value sin the script. You may also need to change the search value for grep from ```VH00``` to what is in your FASTQ files.

### After looking at the total read counts there are some potential issues

Three files returned an 'unexpected end of file' error from gzip. They had only around 10% at most the number of reads of the rest of the files (1-5 million vs 20+ million) Those being:

```Pistilli_P13C_FC4_S4_L001_R2_001.fastq.gz```

```Pistilli_P188-2_MC5_S15_L001_R2_001.fastq.gz```

```Pistilli_P188-4_MT4_S19_L001_R2_001.fastq.gz```

I will need to figure out what to do about these. First I will check the source files to make sure they are correctly downloaded. 

After re-downloading the files, it just seems they were not correctly copied. The full size files have been added to the directory. The results from the bash script are located at ```/home/user/bioinformatics/PyMT_Paper_2023/total_read_counts_2023-02-21_19-15-35.txt```.

This script confirms all read pairs are equal in length.

***
# Check strandedness

This step is not required but due to differences in usage of RNA-seq library construction kits, it is **highly** recommended to empirically verify strandedness using the below method.

The check_strandedness should already be installed at the beginning. It can also be installed as a [Docker image](https://hub.docker.com/r/mgibio/checkstrandedness).

I suppose putting these in the GRCm38 folder should work for now.

```/home/user/bioinformatics/reference_genomes/GRCm38```

Convert gtf to genepred

```cd /home/user/bioinformatics/gtfToGenePred```

```./gtfToGenePred /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102.gtf /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102.genePred```

Convert genPred to bed12

```cd /home/user/bioinformatics/genePredToBed```

```./genePredToBed /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102.genePred /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102.bed12```

Use bedtools to create fasta index from GTF

```bedtools getfasta -fi /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.dna.primary_assembly.fa -bed /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102.bed12 -s -split -name -fo /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.dna.primary_assembly.transcripts.fa```

Clean up the header lines of the transcript file

```cat /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.dna.primary_assembly.transcripts.fa | perl -ne 'if($_ =~/^\>\S+\:\:(ERCC\-\d+)\:.*/){print ">$1\n"}elsif ($_ =~/^\>(\S+)\:\:.*/){print ">$1\n"}else{print $_}' > /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.dna.primary_assembly.transcripts.clean.fa```

View the clean file

```less /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.dna.primary_assembly.transcripts.fa```

Add in transcript_id field to .gtf file

```awk '{ if ($0 ~ "transcript_id") print $0; else print $0" transcript_id \"\";"; }' /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102.gtf > /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102_tidy.gtf```

check strandedness

**Everything up to here should be done already**.

```conda activate bioinformatics```

This command saves some files to the directory it is run from, best to change to where the fastq files are.

```cd /home/user/bioinformatics/PyMT_Paper_2023```

```check_strandedness --gtf /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102_tidy.gtf --transcripts /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.dna.primary_assembly.transcripts.clean.fa --reads_1 /home/user/bioinformatics/PyMT_Paper_2023/Pistilli_P11B_FC1_S1_L001_R1_001.fastq.gz --reads_2 /home/user/bioinformatics/PyMT_Paper_2023/Pistilli_P11B_FC1_S1_L001_R2_001.fastq.gz```

OUTPUT:

Total 200000 usable reads were sampled

This is PairEnd Data

Fraction of reads failed to determine: 0.1266

Fraction of reads explained by "1++,1--,2+-,2-+": 0.0249 (2.9% of explainable reads)

Fraction of reads explained by "1+-,1-+,2++,2--": 0.8485 (97.1% of explainable reads)

Over 90% of reads explained by "1+-,1-+,2++,2--"

Data is likely RF/fr-firststrand

More info on strandedness can be found [here](https://rnabio.org/module-09-appendix/0009/12/01/StrandSettings/)

Based on the strandedness, this sequencing was likely done using the dUTP method.

***
# Pre-alignment QC

To get a sense of the raw reads quality in your fastq files, run the following programs.

### fastqc [docs](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/)

```cd /home/user/bioinformatics/BP_PRJNA496042```

```fastqc -t 24 *.fastq.gz``` in the same directory as your reads.

It will output an HTML file for each read in the same directory that contains a full report of the quality measures.

In the "over-represented sequences" section, try [BLASTing](https://blast.ncbi.nlm.nih.gov/Blast.cgi) the sequences to try determining a source organism.

### multiqc

Run multiqc to combine all the fastqc reports.

```cd /home/user/bioinformatics/BP_PRJNA496042```

Run ```multiqc .``` in the same directory as the reports.

Renamed file to ```pre-alignment_multiqc_report.html```

Make a new directory for the fastqc .html outputs and move them

```mkdir fastqc && mv *_fastqc* fastqc```

## Bash script

I made a bash script to handle this step, ```pre-alignment_QC.sh```. All that should need to be changed is the fastq_dir variable. Which should point to where the FASTQ files are. The script does not yet account for any sub-folders.

## View reports

Looking at the MultiQC report, none of the warnings are concerning or unusual. 

***
# Remove adapters and trim

Make a new folder for the trimmed reads

```cd /home/user/bioinformatics/PyMT_Paper_2023```

```mkdir trimmed && cd trimmed```

Download the Illumina adapter sequence files, or the relevant files for your sequencing machine.

```wget http://genomedata.org/rnaseq-tutorial/illumina_multiplex.fa```

Downloaded to ```/home/user/bioinformatics/reference_genomes/GRCm38/illumina_multiplex.fa```

Use flexbar to remove adapter sequences and trim first 13 bases of each read. Adjust the number of threads to something manageable for your system. It is set to 24 threads but may take much longer with a slower CPU.

Possible options for flexbar and their meaning:

   - ```–adapter-min-overlap 7``` requires a minimum of 7 bases to match the adapter
   - ```–adapter-trim-end RIGHT``` uses a trimming strategy to remove the adapter from the 3 prime or RIGHT end of the read
   - ```–max-uncalled 300``` allows as many as 300 uncalled or N bases (MiSeq read lengths can be 300bp)
   - ```–min-read-length``` the minimum read length allowed after trimming is 25bp.
   - ```–threads 24``` use 24 threads
   - ```–zip-output GZ``` the input FASTQ files are gzipped so we will output gzipped FASTQ to save space
   - ```–adapters``` define the path to the adapter FASTA file to trim
   - ```–reads``` define the path to the read 1 FASTQ file of reads
   - ```–reads2``` define the path to the read 2 FASTQ file of reads
   - ```–target``` a base path for the output files. The value will _1.fastq.gz and _2.fastq.gz for read 1 and read 2 respectively
   - ```–pre-trim-left``` trim a fixed number of bases at left read end. For example, to trim 5 bases at the left side of reads: –pre-trim-left 5
   - ```–pre-trim-right``` trim a fixed number of bases at right read end. For example, to trim 5 bases at the right side of reads: –pre-trim-right 5
   - ```–pre-trim-phred``` trim based on phred quality value to deal with higher error rates towards the end of reads. For example, to trim the 3 prime end until quality offset value 30 or higher is reached, specify: –pre-trim-phred 30

The commands to run for flexbar for this dataset are as follows:

```cd /home/user/bioinformatics/flexbar-3.5.0-linux```

You will need to export the library path environment variable for this session, or add to shell startup scrits.

```export LD_LIBRARY_PATH=/home/user/bioinformatics/flexbar-3.5.0-linux:$LD_LIBRARY_PATH```

```/home/user/bioinformatics/flexbar-3.5.0-linux/flexbar --adapter-min-overlap 7 --adapter-trim-end RIGHT --adapters /home/user/bioinformatics/reference_genomes/GRCm38/illumina_multiplex.fa --pre-trim-left 13 --max-uncalled 300 --min-read-length 25 --threads 24 --zip-output GZ --reads /home/user/bioinformatics/PyMT_Paper_2023/Pistilli_P11B_FC1_S1_L001_001.read1.fastq.gz --reads2 /home/user/bioinformatics/PyMT_Paper_2023/Pistilli_P11B_FC1_S1_L001_001.read2.fastq.gz --target /home/user/bioinformatics/PyMT_Paper_2023/trimmed/Pistilli_P11B_FC1_S1_L001_001```

The rest of the commands were generated using the script ```remove_adapters_and_trim.sh```, but they all follow this same format.

Compare the impact of trimming with the original QC reports by running fastqc and multiqc on the trimmed reads

In the directory of the trimmed reads ```cd /home/user/bioinformatics/PyMT_Paper_2023/trimmed```, run:

```fastqc *.fastq.gz```

```multiqc .```

Same as before, make a new directory for the fastqc .html outputs and move them

```mkdir fastqc && mv *_fastqc* fastqc```

A script for the QC was also written, ```post-trim_QC.sh```

Looking at the MultiQC report after trimming, the overall quality looks better. Less failures and warnings. Mostly just sequence duplication now, which is expected.


***
# Alignment with HISAT2

For more details on HISAT2, refer to the [manual on Github](http://daehwankimlab.github.io/hisat2/manual/)

The outputs after running HISAT2 will be a SAM/BAM file for each data set.

Options for HISAT2 include:


   - ```-p 24``` tells HISAT2 to use 24 threads for bowtie alignments.
   - ```–rna-strandness``` specifies strandness of RNAseq library. Check_strandedness said it is likely unstranded, which is the default. So we remove this flag in this run.
   - ```–rg-id $ID``` specifies a read group ID that is a unique identifier.
   - ```–rg SM:$SAMPLE_NAME``` specifies a read group sample name. This together with rg-id will allow you to determine which reads came from which sample in the merged bam later on.
   - ```–rg LB:$LIBRARY_NAME``` specifies a read group library name. This together with rg-id will allow you to determine which reads came from which library in the merged bam later on.
   - ```–rg PL:ILLUMINA``` specifies a read group sequencing platform.
   - ```–rg PU:$PLATFORM_UNIT``` specifies a read group sequencing platform unit. Typically this consists of FLOWCELL-BARCODE.LANE
   - ```–dta``` Reports alignments tailored for transcript assemblers.
   - ```-x /path/to/hisat2/index``` The HISAT2 index filename prefix (minus the trailing .X.ht2) built earlier including splice sites and exons.
   - ```-1 /path/to/read1.fastq.gz``` The read 1 FASTQ file, optionally gzip(.gz) or bzip2(.bz2) compressed.
   - ```-2 /path/to/read2.fastq.gz``` The read 2 FASTQ file, optionally gzip(.gz) or bzip2(.bz2) compressed.
   - ```-U /path/to/unstranded.fastq.gz``` If data is unstranded, provide link to FASTQ or .gz version here.
   - ```-S /path/to/output.sam``` The output SAM format text file of alignments.
   
Create an alignment folder for the outputs.

```cd /home/user/bioinformatics/PyMT_Paper_2023 && mkdir -p alignments && cd alignments```

Run a command for each of the replicates. I will try running based off the trimmed fastq files and see how it goes.

I wrote a script to handle this, ```HISAT2_alignment.sh```

```hisat2 -p 24 --rg-id Pistilli_P11B_FC1_S1_L001_001 --rg SM:Pistilli_P11B_FC1_S1_L001_001 --rg LB:Pistilli_P11B_FC1_S1_L001_001_Lib --rg PL:ILLUMINA -x /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.dna.primary_assembly --dta -1 /home/user/bioinformatics/PyMT_Paper_2023/trimmed/Pistilli_P11B_FC1_S1_L001_001_1.fastq.gz -2 /home/user/bioinformatics/PyMT_Paper_2023/trimmed/Pistilli_P11B_FC1_S1_L001_001_2.fastq.gz -S /home/user/bioinformatics/PyMT_Paper_2023/trimmed/Pistilli_P11B_FC1_S1_L001_001.sam```



### Convert SAM to BAM files

I also wrote a script for this, ```Convert_SAM_to_BAM.sh```

```-@ int``` is the number of threads to use.

```cd /home/user/bioinformatics/PyMT_Paper_2023/alignments```

```samtools sort -@ 24 -o Pistilli_P11B_FC1_S1_L001_001.bam Pistilli_P11B_FC1_S1_L001_001.sam```

And so on so forth...

### Merge all BAM files
Other tools such as "samtools merge" and "bamtools merge" can also achieve this. Create a BAM file for each group (FC, FT, MC, MT).

```java -Xmx2g -jar /home/user/bioinformatics/picard.jar MergeSamFiles -OUTPUT /home/user/bioinformatics/PyMT_Paper_2023/alignments/FC.bam -INPUT /home/user/bioinformatics/PyMT_Paper_2023/alignments/Pistilli_P11B_FC1_S1_L001_001.bam -INPUT /home/user/bioinformatics/PyMT_Paper_2023/alignments/Pistilli_P11D_FC2_S2_L001_001.bam -INPUT /home/user/bioinformatics/PyMT_Paper_2023/alignments/Pistilli_P13B_FC3_S3_L001_001.bam -INPUT /home/user/bioinformatics/PyMT_Paper_2023/alignments/Pistilli_P13C_FC4_S4_L001_001.bam -INPUT /home/user/bioinformatics/PyMT_Paper_2023/alignments/Pistilli_P13E_FC5_S5_L001_001.bam```

```java -Xmx2g -jar /home/user/bioinformatics/picard.jar MergeSamFiles -OUTPUT /home/user/bioinformatics/PyMT_Paper_2023/alignments/FT.bam -INPUT /home/user/bioinformatics/PyMT_Paper_2023/alignments/Pistilli_P12BL_FT1_S6_L001_001.bam -INPUT /home/user/bioinformatics/PyMT_Paper_2023/alignments/Pistilli_P12CRR_FT2_S7_L001_001.bam -INPUT /home/user/bioinformatics/PyMT_Paper_2023/alignments/Pistilli_P13A_FT3_S8_L001_001.bam -INPUT /home/user/bioinformatics/PyMT_Paper_2023/alignments/Pistilli_P14E_FT4_S9_L001_001.bam -INPUT /home/user/bioinformatics/PyMT_Paper_2023/alignments/Pistilli_P13F_FT5_S10_L001_001.bam```

```java -Xmx2g -jar /home/user/bioinformatics/picard.jar MergeSamFiles -OUTPUT /home/user/bioinformatics/PyMT_Paper_2023/alignments/MC.bam -INPUT /home/user/bioinformatics/PyMT_Paper_2023/alignments/Pistilli_P186-5_MC1_S11_L001_001.bam -INPUT /home/user/bioinformatics/PyMT_Paper_2023/alignments/Pistilli_P186-7_MC2_S12_L001_001.bam -INPUT /home/user/bioinformatics/PyMT_Paper_2023/alignments/Pistilli_P189-1_MC3_S13_L001_001.bam -INPUT /home/user/bioinformatics/PyMT_Paper_2023/alignments/Pistilli_P187-1_MC4_S14_L001_001.bam -INPUT /home/user/bioinformatics/PyMT_Paper_2023/alignments/Pistilli_P188-2_MC5_S15_L001_001.bam```

```java -Xmx2g -jar /home/user/bioinformatics/picard.jar MergeSamFiles -OUTPUT /home/user/bioinformatics/PyMT_Paper_2023/alignments/MT.bam -INPUT /home/user/bioinformatics/PyMT_Paper_2023/alignments/Pistilli_P187-3_MT1_S16_L001_001.bam -INPUT /home/user/bioinformatics/PyMT_Paper_2023/alignments/Pistilli_P186-6_MT2_S17_L001_001.bam -INPUT /home/user/bioinformatics/PyMT_Paper_2023/alignments/Pistilli_P188-1_MT3_S18_L001_001.bam -INPUT /home/user/bioinformatics/PyMT_Paper_2023/alignments/Pistilli_P188-4_MT4_S19_L001_001.bam -INPUT /home/user/bioinformatics/PyMT_Paper_2023/alignments/Pistilli_P188-5_MT5_S20_L001_001.bam```

You can verify that all bam files were created successfully, there should be 24.

In the same directory as the aligned .bam files, ```ls -l *.bam | wc -l```

***
# Using the Integrative Genomics Viewer (IGV)

```cd /home/user/bioinformatics/IGV_Linux_2.15.4```

IGV can be run from the download directory using ```./igv.sh```

To re-visit the shortish tutorial on navigating IGV, go [here](https://rnabio.org/module-02-alignment/0002/04/01/IGV/). You can follow the tutorial there using the provided files.

For our dataset, we need to index the BAM files before loading them into IGV. This will be done with samtools. From the directory where the BAM files are located, run:

You can just run the following command to index all .bam files in the directory. This way is easier than running multiple commands.

```cd /home/user/bioinformatics/PyMT_Paper_2023/alignments```

```find *.bam -exec echo samtools index {} \; | sh```

But in this case, since I don't want to index all the individual sample files, I will just do the joined files.

```find FC.bam FT.bam MC.bam MT.bam -exec echo samtools index {} \; | sh```

Load both the .bam files in IGV and look around. The .bam.bai index needs to be in the same directory as the .bam files.

***
# Alignment QC

Use samtools to view the format of the alignment bam files.

```samtools view /home/user/bioinformatics/PyMT_Paper_2023/alignments/FC.bam | head | column -t | less -S```

```samtools view /home/user/bioinformatics/PyMT_Paper_2023/alignments/FT.bam | head | column -t | less -S```

```samtools view /home/user/bioinformatics/PyMT_Paper_2023/alignments/MC.bam | head | column -t | less -S```

```samtools view /home/user/bioinformatics/PyMT_Paper_2023/alignments/MT.bam | head | column -t | less -S```

You can add additional filtering flags to the samtools command. [This useful guide explains all of them](https://broadinstitute.github.io/picard/explain-flags.html). ```-f``` are required flags and ```-F``` are filtering flags.

Look for alignments that are only for 'PCR or optical duplicates'.

```samtools view -f 1024 /home/user/bioinformatics/PyMT_Paper_2023/alignments/FC.bam | head```

Output: nothing

The command ```samtools flagstat``` can provide a summary of the alignments.

```mkdir -p /home/user/bioinformatics/PyMT_Paper_2023/alignments/flagstat```

```cd /home/user/bioinformatics/PyMT_Paper_2023/alignments```

```samtools flagstat /home/user/bioinformatics/BP_PRJNA496042/alignments/Control_1.bam > /home/user/bioinformatics/BP_PRJNA496042/alignments/flagstat/Control_1.bam.flagstat```

```samtools flagstat /home/user/bioinformatics/BP_PRJNA496042/alignments/Control_2.bam > /home/user/bioinformatics/BP_PRJNA496042/alignments/flagstat/Control_2.bam.flagstat```

```samtools flagstat /home/user/bioinformatics/BP_PRJNA496042/alignments/Control_3.bam > /home/user/bioinformatics/BP_PRJNA496042/alignments/flagstat/Control_3.bam.flagstat```

```samtools flagstat /home/user/bioinformatics/BP_PRJNA496042/alignments/Control_4.bam > /home/user/bioinformatics/BP_PRJNA496042/alignments/flagstat/Control_4.bam.flagstat```

```samtools flagstat /home/user/bioinformatics/BP_PRJNA496042/alignments/PDX_1.bam > /home/user/bioinformatics/BP_PRJNA496042/alignments/flagstat/PDX_1.bam.flagstat```

```samtools flagstat /home/user/bioinformatics/BP_PRJNA496042/alignments/PDX_2.bam > /home/user/bioinformatics/BP_PRJNA496042/alignments/flagstat/PDX_2.bam.flagstat```

```samtools flagstat /home/user/bioinformatics/BP_PRJNA496042/alignments/PDX_3.bam > /home/user/bioinformatics/BP_PRJNA496042/alignments/flagstat/PDX_3.bam.flagstat```

```samtools flagstat /home/user/bioinformatics/BP_PRJNA496042/alignments/PDX_4.bam > /home/user/bioinformatics/BP_PRJNA496042/alignments/flagstat/PDX_4.bam.flagstat```

You can also run an expression to do it in one command.

```find *_*.bam -exec echo samtools flagstat {} \> flagstat/{}.flagstat \; | sh```

### FastQc on bam file

From your alignment directory (where the bam files are).

```mkdir /home/user/bioinformatics/BP_PRJNA496042/alignments/fastqc```

```cd /home/user/bioinformatics/BP_PRJNA496042/alignments```

```fastqc Control_*.bam PDX_*.bam -t 24```

Clean up the current directory and move the reports into the fastqc folder.

```mv *fastqc.html fastqc/ && mv *fastqc.zip fastqc/```

### Picard

Using Picard we can generate metrics and figures that express the quality of our RNA-seq data.

We first need to create input files for Picard based on the reference genome so it can properly run the function CollectRnaSeqMetrics.

```java -jar /home/user/bioinformatics/picard.jar CreateSequenceDictionary -R /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.dna.primary_assembly.fa -O /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.dna.primary_assembly.dict```

Create a bed file of the location of ribosomal sequences in our reference (first extract from the gtf then convert to bed). Note that here we pull all the "rrna" transcripts from the GTF.

```grep --color=none -i -P "rrna" /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102.gtf > /home/user/bioinformatics/reference_genomes/GRCm38/ref_ribosome.gtf```

Note: ```gff2bed``` is part of BEDOPS. Run the script directly from the bedops/bin folder. You may need to add the /bin folder to your path variable ```export PATH="$PATH:/home/user/bioinformatics/bedops_linux_x86_64-v2.4.40/bin"```, otherwise convert2bed may not be found.

```cd /home/user/bioinformatics/bedops_linux_x86_64-v2.4.40/bin```

```gff2bed < /home/user/bioinformatics/reference_genomes/GRCm38/ref_ribosome.gtf > /home/user/bioinformatics/reference_genomes/GRCm38/ref_ribosome.bed```

Create interval list file to identify the location of the ribosomal sequences in the reference

```java -jar /home/user/bioinformatics/picard.jar BedToIntervalList -I /home/user/bioinformatics/reference_genomes/GRCm38/ref_ribosome.bed -O /home/user/bioinformatics/reference_genomes/GRCm38/ref_ribosome.interval_list -SD /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.dna.primary_assembly.dict```

Create genePred file for the reference transcriptome

```cd /home/user/bioinformatics/gtfToGenePred```

```./gtfToGenePred -genePredExt /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102.gtf /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102.ref_flat.txt```

Reformat the genePred file

```cat /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102.ref_flat.txt | awk '{print $12"\t"$0}' | cut -d$'\t' -f1-11 > /home/user/bioinformatics/reference_genomes/GRCm38/tmp.txt```

```mv /home/user/bioinformatics/reference_genomes/GRCm38/tmp.txt /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102.ref_flat.txt```

Go to the directory with your alignments.

```cd /home/user/bioinformatics/BP_PRJNA496042/alignments && mkdir picard```

Since this data is unstranded, I removed the ```STRAND=SECOND_READ_TRANSCRIPTION_STRAND``` flag since half the reads were not mapped correctly in the quality report. Instead it was set to none. 

```find *_*.bam -exec echo java -jar /home/user/bioinformatics/picard.jar CollectRnaSeqMetrics I={} O=picard/{}.RNA_Metrics REF_FLAT=/home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102.ref_flat.txt STRAND=NONE RIBOSOMAL_INTERVALS=/home/user/bioinformatics/reference_genomes/GRCm38/ref_ribosome.interval_list \; | sh```

### RSeQC

Another tool for generating RNA-seq QC reports.

You will need
- aligned bam files
- index file for each bam file
- transcript bed file (in bed12 format)

Start by converting the reference gtf to genePred. This step has probably already been done.

```gtfToGenePred /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102.gtf /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102.genePred```

Then convert the genePred to bed12. This step also probably done already.

```genePredToBed /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102.genePred /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102.bed12```

Go back to the alignment directory. Make sure the bioinformatics env is active.

```cd /home/user/bioinformatics/BP_PRJNA496042/alignments && mkdir rseqc```

```geneBody_coverage.py -i Control_1.bam, Control_2.bam, Control_3.bam, Control_4.bam -r /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102.bed12 -o rseqc/CONTROL```

```geneBody_coverage.py -i PDX_1.bam, PDX_2.bam, PDX_3.bam, PDX_4.bam -r /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102.bed12 -o rseqc/PDX```

The following commands should also be run from the alignment directory ```cd /home/user/bioinformatics/BP_PRJNA496042/alignments```

```find *_*.bam -exec echo inner_distance.py -i {} -r /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102.bed12 -o rseqc/{} \; | sh```

```find *_*.bam -exec echo junction_annotation.py -i {} -r /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102.bed12 -o rseqc/{} \; | sh```

```find *_*.bam -exec echo junction_saturation.py -i {} -r /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102.bed12 -o rseqc/{} \; | sh```

```find *_*.bam -exec echo read_distribution.py -i {} -r /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102.bed12 \> rseqc/{}.read_dist.txt \; | sh```

```find *_*.bam -exec echo RNA_fragment_size.py -i {} -r /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102.bed12 \> rseqc/{}.frag_size.txt \; | sh```

```find *_*.bam -exec echo bam_stat.py -i {} \> {}.bam_stat.txt \; | sh```

```rm -f log.txt```

### MultiQC

Now we will compile a multiQC report from all of the QC done above.

In the alignment directory ```cd /home/user/bioinformatics/BP_PRJNA496042/alignments```.

```multiqc ./```

***
# Expression Analysis

Some notes before we start.

First, there are multiple things you can do with your aligned reads from HISAT2. This pipeline is written with the intention of going through the differential expression analysis and visualizing it. But, you can also just use **HTSEQ-count** to get raw read counts instead of FPKM values. 

FPKM = fragments per kilobase of transcript per million mapped reads. FPKM attempts to normalize for gene size and library depth. RPKM is essentially the same thing, however there are differing opinions on just how similar they are, and which is better. 

Mathematically, FPKM can be represented as:

```FPKM = (C / (N / 1000000)) / (L / 1000)```, where C = number of mappable fragments for a gene, N = total number of mappable fragments in the library, and L = number of base pairs in the gene.

Another normalization metric that may be mentioned is TPM (transcript per kilobase million). The only difference between FPKM and TPM is the order of operations. At least, in the calculation of it.

### Stringtie

If you want to know more about the math behind how Stringtie works, you can [read the paper](https://www.nature.com/articles/nbt.3122).

An outline of the Stringtie flags used:


- ```–rf``` tells StringTie that our data is stranded and to use the correct strand specific mode (i.e. assume a stranded library fr-firststrand). In this case, our data is unstranded so this fla g is not used.
- ```-p 24``` tells StringTie to use 24 threads.
- ```-G``` reference annotation to use for guiding the assembly process (GTF/GFF3)
- ```-e``` only estimate the abundance of given reference transcripts (requires -G)
- ```-B``` enable output of Ballgown table files which will be created in the same directory as the output GTF (requires -G, -o recommended)
- ```-o``` output path/file name for the assembled transcripts GTF (default: stdout)
- ```-A``` output path/file name for gene abundance estimates

I'll make this folder within the alignment folder ```cd /home/user/bioinformatics/BP_PRJNA496042/alignments```.

```mkdir -p expression/stringtie/ref_only/ && cd expression/stringtie/ref_only/```

```cd /home/user/bioinformatics/stringtie-2.2.1.Linux_x86_64```

```./stringtie -p 24 -G /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102.gtf -e -B -o /home/user/bioinformatics/BP_PRJNA496042/alignments/expression/stringtie/ref_only/Control_1/transcripts.gtf -A /home/user/bioinformatics/BP_PRJNA496042/alignments/expression/stringtie/ref_only/Control_1/gene_abundances.tsv /home/user/bioinformatics/BP_PRJNA496042/alignments/Control_1.bam```

```./stringtie -p 24 -G /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102.gtf -e -B -o /home/user/bioinformatics/BP_PRJNA496042/alignments/expression/stringtie/ref_only/Control_2/transcripts.gtf -A /home/user/bioinformatics/BP_PRJNA496042/alignments/expression/stringtie/ref_only/Control_2/gene_abundances.tsv /home/user/bioinformatics/BP_PRJNA496042/alignments/Control_2.bam```

```./stringtie -p 24 -G /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102.gtf -e -B -o /home/user/bioinformatics/BP_PRJNA496042/alignments/expression/stringtie/ref_only/Control_3/transcripts.gtf -A /home/user/bioinformatics/BP_PRJNA496042/alignments/expression/stringtie/ref_only/Control_3/gene_abundances.tsv /home/user/bioinformatics/BP_PRJNA496042/alignments/Control_3.bam```

```./stringtie -p 24 -G /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102.gtf -e -B -o /home/user/bioinformatics/BP_PRJNA496042/alignments/expression/stringtie/ref_only/Control_4/transcripts.gtf -A /home/user/bioinformatics/BP_PRJNA496042/alignments/expression/stringtie/ref_only/Control_4/gene_abundances.tsv /home/user/bioinformatics/BP_PRJNA496042/alignments/Control_4.bam```

```./stringtie -p 24 -G /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102.gtf -e -B -o /home/user/bioinformatics/BP_PRJNA496042/alignments/expression/stringtie/ref_only/PDX_1/transcripts.gtf -A /home/user/bioinformatics/BP_PRJNA496042/alignments/expression/stringtie/ref_only/PDX_1/gene_abundances.tsv /home/user/bioinformatics/BP_PRJNA496042/alignments/PDX_1.bam```

```./stringtie -p 24 -G /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102.gtf -e -B -o /home/user/bioinformatics/BP_PRJNA496042/alignments/expression/stringtie/ref_only/PDX_2/transcripts.gtf -A /home/user/bioinformatics/BP_PRJNA496042/alignments/expression/stringtie/ref_only/PDX_2/gene_abundances.tsv /home/user/bioinformatics/BP_PRJNA496042/alignments/PDX_2.bam```

```./stringtie -p 24 -G /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102.gtf -e -B -o /home/user/bioinformatics/BP_PRJNA496042/alignments/expression/stringtie/ref_only/PDX_3/transcripts.gtf -A /home/user/bioinformatics/BP_PRJNA496042/alignments/expression/stringtie/ref_only/PDX_3/gene_abundances.tsv /home/user/bioinformatics/BP_PRJNA496042/alignments/PDX_3.bam```

```./stringtie -p 24 -G /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102.gtf -e -B -o /home/user/bioinformatics/BP_PRJNA496042/alignments/expression/stringtie/ref_only/PDX_4/transcripts.gtf -A /home/user/bioinformatics/BP_PRJNA496042/alignments/expression/stringtie/ref_only/PDX_4/gene_abundances.tsv /home/user/bioinformatics/BP_PRJNA496042/alignments/PDX_4.bam```


Have a look at what the raw output from StringTie looks like.

```less -S /home/user/bioinformatics/BP_PRJNA496042/alignments/expression/stringtie/ref_only/Control_1/transcripts.gtf```

Improve formatting of the *less* output.

```grep -v "^#" /home/user/bioinformatics/BP_PRJNA496042/alignments/expression/stringtie/ref_only/Control_1/transcripts.gtf | grep -w "transcript" | column -t | less -S```

You can also limit the view to just the transcript records and the correspoding FPKM values.

```awk '{if ($3=="transcript") print}' /home/user/bioinformatics/BP_PRJNA496042/alignments/expression/stringtie/ref_only/Control_1/transcripts.gtf | cut -f 1,4,9 | column -t | less -S```

You can also view the transcript level expression values for each set of reads in the two files ```t_data.ctab``` and ```gene_abundances.tsv```. They should be located in the folders ```~/expression/stringtie/ref_only/``` and then in the Control or PDX folders.

Download the script for creating an expression matrix based off the Stringtie outputs.

```cd /home/user/bioinformatics/BP_PRJNA496042/alignments/expression/stringtie/ref_only```

```wget https://raw.githubusercontent.com/griffithlab/rnabio.org/master/assets/scripts/stringtie_expression_matrix.pl```

```chmod +x stringtie_expression_matrix.pl```

```./stringtie_expression_matrix.pl --expression_metric=TPM --result_dirs='Control_1,Control_2,Control_3,Control_4,PDX_1,PDX_2,PDX_3,PDX_4' --transcript_matrix_file=transcript_tpm_all_samples.tsv --gene_matrix_file=gene_tpm_all_samples.tsv```

```./stringtie_expression_matrix.pl --expression_metric=FPKM --result_dirs='Control_1,Control_2,Control_3,Control_4,PDX_1,PDX_2,PDX_3,PDX_4' --transcript_matrix_file=transcript_fpkm_all_samples.tsv --gene_matrix_file=gene_fpkm_all_samples.tsv```

```./stringtie_expression_matrix.pl --expression_metric=Coverage --result_dirs='Control_1,Control_2,Control_3,Control_4,PDX_1,PDX_2,PDX_3,PDX_4' --transcript_matrix_file=transcript_coverage_all_samples.tsv --gene_matrix_file=gene_coverage_all_samples.tsv```

View the TPM output files.

```column -t transcript_tpm_all_samples.tsv | less -S```

```column -t gene_tpm_all_samples.tsv | less -S```

### Generate raw counts instead of FPKM/TPM values

If you only want to generate the raw counts this can be done using HTSeq-count [docs](https://htseq.readthedocs.io/).

Some notes about the flags used:


- ```–format``` specify the input file format one of BAM or SAM. Since we have BAM format files, select ‘bam’ for this option.
- ```–order``` provide the expected sort order of the input file. Previously we generated position sorted BAM files so use ‘pos’.
- ```–mode``` determines how to deal with reads that overlap more than one feature. We believe the ‘intersection-strict’ mode is best.
- ```–stranded``` specifies whether data is stranded or not. The TruSeq strand-specific RNA libraries suggest the ‘reverse’ option for this parameter. This data is unstranded so the flag is set to 'no'.
- ```–minaqual``` will skip all reads with alignment quality lower than the given minimum value.
- ```–type``` specifies the feature type (3rd column in GFF file) to be used. (default, suitable for RNA-Seq and Ensembl GTF files: exon)
- ```–idattr``` The feature ID used to identify the counts in the output table. The default, suitable for RNA-SEq and Ensembl GTF files, is gene_id.

Change directories to where the expression folder is ```cd /home/user/bioinformatics/BP_PRJNA496042/alignments```

```mkdir -p expression/htseq_counts && cd expression/htseq_counts```

Run each htseq-count command to get the raw counts for each read file.

```htseq-count --format bam --order pos --mode intersection-strict --stranded no --minaqual 1 --type exon --idattr gene_id /home/user/bioinformatics/BP_PRJNA496042/alignments/Control_1.bam /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102.gtf > /home/user/bioinformatics/BP_PRJNA496042/alignments/expression/htseq_counts/Control_1_gene.tsv```

```htseq-count --format bam --order pos --mode intersection-strict --stranded no --minaqual 1 --type exon --idattr gene_id /home/user/bioinformatics/BP_PRJNA496042/alignments/Control_2.bam /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102.gtf > /home/user/bioinformatics/BP_PRJNA496042/alignments/expression/htseq_counts/Control_2_gene.tsv```

```htseq-count --format bam --order pos --mode intersection-strict --stranded no --minaqual 1 --type exon --idattr gene_id /home/user/bioinformatics/BP_PRJNA496042/alignments/Control_3.bam /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102.gtf > /home/user/bioinformatics/BP_PRJNA496042/alignments/expression/htseq_counts/Control_3_gene.tsv```

```htseq-count --format bam --order pos --mode intersection-strict --stranded no --minaqual 1 --type exon --idattr gene_id /home/user/bioinformatics/BP_PRJNA496042/alignments/Control_4.bam /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102.gtf > /home/user/bioinformatics/BP_PRJNA496042/alignments/expression/htseq_counts/Control_4_gene.tsv```

```htseq-count --format bam --order pos --mode intersection-strict --stranded no --minaqual 1 --type exon --idattr gene_id /home/user/bioinformatics/BP_PRJNA496042/alignments/PDX_1.bam /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102.gtf > /home/user/bioinformatics/BP_PRJNA496042/alignments/expression/htseq_counts/PDX_1_gene.tsv```

```htseq-count --format bam --order pos --mode intersection-strict --stranded no --minaqual 1 --type exon --idattr gene_id /home/user/bioinformatics/BP_PRJNA496042/alignments/PDX_2.bam /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102.gtf > /home/user/bioinformatics/BP_PRJNA496042/alignments/expression/htseq_counts/PDX_2_gene.tsv```

```htseq-count --format bam --order pos --mode intersection-strict --stranded no --minaqual 1 --type exon --idattr gene_id /home/user/bioinformatics/BP_PRJNA496042/alignments/PDX_3.bam /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102.gtf > /home/user/bioinformatics/BP_PRJNA496042/alignments/expression/htseq_counts/PDX_3_gene.tsv```

```htseq-count --format bam --order pos --mode intersection-strict --stranded no --minaqual 1 --type exon --idattr gene_id /home/user/bioinformatics/BP_PRJNA496042/alignments/PDX_4.bam /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102.gtf > /home/user/bioinformatics/BP_PRJNA496042/alignments/expression/htseq_counts/PDX_4_gene.tsv```

I don't feel like paraphrasing this next section:

"Merge results files into a single matrix for use in edgeR. The following joins the results for each replicate together, adds a header, reformats the result as a tab delimited file, and shows you the first 10 lines of the resulting file :"

```cd /home/user/bioinformatics/BP_PRJNA496042/alignments/expression/htseq_counts```

```join Control_1_gene.tsv Control_2_gene.tsv | join - Control_3_gene.tsv | join - Control_4_gene.tsv | join - PDX_1_gene.tsv | join - PDX_2_gene.tsv | join - PDX_3_gene.tsv | join - PDX_4_gene.tsv > gene_read_counts_table_all.tsv```

```echo "GeneID Control_1 Control_2 Control_3 Control_4 PDX_1 PDX_2 PDX_3 PDX_4" > header.txt```

```cat header.txt gene_read_counts_table_all.tsv | grep -v "__" | awk -v OFS="\t" '$1=$1' > gene_read_counts_table_all_final.tsv```

If you're curious, ```grep -v "__"``` is used to filter out summary lines at the end of htseq-count files that did not align, had poor quality alignment, etc.

```rm -f gene_read_counts_table_all.tsv header.txt```

```head gene_read_counts_table_all_final.tsv | column -t```

***
# Differential Expression (Ballgown)

The differential expression is done using the bioconductor package [Ballgown](https://www.bioconductor.org/packages/release/bioc/html/ballgown.html).

```cd /home/user/bioinformatics/BP_PRJNA496042/alignments```

```mkdir -p de/ballgown/ref_only/ && cd de/ballgown/ref_only/```

Start an R session by typing ```R``` in a terminal.

The following code is an R script that will perform DE using Ballgown. You can run each command line by line.

In [1]:
# load the required libraries
library(ballgown)
library(genefilter)
library(dplyr)
library(devtools)

# Create phenotype data needed for ballgown analysis
ids=c("Control_1","Control_2","Control_3","Control_4","PDX_1","PDX_2","PDX_3","PDX_4")
type=c("Control","Control","Control","Control","PDX","PDX","PDX","PDX")
results="/home/user/bioinformatics/BP_PRJNA496042/alignments/expression/stringtie/ref_only/" # be sure to include the trailing "/"
path=paste(results,ids,sep="")
pheno_data=data.frame(ids,type,path)

# Load ballgown data structure and save it to a variable "bg"
bg = ballgown(samples=as.vector(pheno_data$path), pData=pheno_data)

# Display a description of this object
bg

# Load all attributes including gene name
bg_table = texpr(bg, 'all')
bg_gene_names = unique(bg_table[, 9:10])

# Save the ballgown object to a file for later use
save(bg, file='/home/user/bioinformatics/BP_PRJNA496042/alignments/de/ballgown/ref_only/bg.rda')

# Perform differential expression (DE) analysis with no filtering
results_transcripts = stattest(bg, feature="transcript", covariate="type", getFC=TRUE, meas="FPKM")
results_genes = stattest(bg, feature="gene", covariate="type", getFC=TRUE, meas="FPKM")
results_genes = merge(results_genes, bg_gene_names, by.x=c("id"), by.y=c("gene_id"))

# Save a tab delimited file for both the transcript and gene results
write.table(results_transcripts, "/home/user/bioinformatics/BP_PRJNA496042/alignments/de/ballgown/ref_only/Control_vs_PDX_transcript_results.tsv", sep="\t", quote=FALSE, row.names = FALSE)
write.table(results_genes, "/home/user/bioinformatics/BP_PRJNA496042/alignments/de/ballgown/ref_only/Control_vs_PDX_gene_results.tsv", sep="\t", quote=FALSE, row.names = FALSE)

# Filter low-abundance genes. Here we remove all transcripts with a variance across the samples of less than one
bg_filt = subset (bg,"rowVars(texpr(bg)) > 1", genomesubset=TRUE)

# Load all attributes including gene name
bg_filt_table = texpr(bg_filt , 'all')
bg_filt_gene_names = unique(bg_filt_table[, 9:10])

# Perform DE analysis now using the filtered data
results_transcripts = stattest(bg_filt, feature="transcript", covariate="type", getFC=TRUE, meas="FPKM")
results_genes = stattest(bg_filt, feature="gene", covariate="type", getFC=TRUE, meas="FPKM")
results_genes = merge(results_genes, bg_filt_gene_names, by.x=c("id"), by.y=c("gene_id"))

# Output the filtered list of genes and transcripts and save to tab delimited files
write.table(results_transcripts, "/home/user/bioinformatics/BP_PRJNA496042/alignments/de/ballgown/ref_only/Control_vs_PDX_transcript_results_filtered.tsv", sep="\t", quote=FALSE, row.names = FALSE)
write.table(results_genes, "/home/user/bioinformatics/BP_PRJNA496042/alignments/de/ballgown/ref_only/Control_vs_PDX_gene_results_filtered.tsv", sep="\t", quote=FALSE, row.names = FALSE)

# Identify the significant genes with p-value < 0.05
sig_transcripts = subset(results_transcripts, results_transcripts$pval<0.05)
sig_genes = subset(results_genes, results_genes$pval<0.05)

# Output the significant gene results to a pair of tab delimited files
write.table(sig_transcripts, "/home/user/bioinformatics/BP_PRJNA496042/alignments/de/ballgown/ref_only/Control_vs_PDX_transcript_results_sig.tsv", sep="\t", quote=FALSE, row.names = FALSE)
write.table(sig_genes, "/home/user/bioinformatics/BP_PRJNA496042/alignments/de/ballgown/ref_only/Control_vs_PDX_gene_results_sig.tsv", sep="\t", quote=FALSE, row.names = FALSE)

# Exit the R session
quit(save="no")


Attaching package: ‘ballgown’


The following object is masked from ‘package:base’:

    structure



Attaching package: ‘dplyr’


The following objects are masked from ‘package:ballgown’:

    contains, expr, last


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


Loading required package: usethis

Fri Feb 17 23:41:07 2023

Fri Feb 17 23:41:07 2023: Reading linking tables

Fri Feb 17 23:41:08 2023: Reading intron data files

Fri Feb 17 23:41:10 2023: Merging intron data

Fri Feb 17 23:41:11 2023: Reading exon data files

Fri Feb 17 23:41:17 2023: Merging exon data

Fri Feb 17 23:41:19 2023: Reading transcript data files

Fri Feb 17 23:41:21 2023: Merging transcript data

Wrapping up the results

Fri Feb 17 23:41:22 2023



ballgown instance with 142699 transcripts and 8 samples

View the raw output from Ballgown.

```head /home/user/bioinformatics/BP_PRJNA496042/alignments/de/ballgown/ref_only/Control_vs_PDX_gene_results.tsv | column -t```

View how many genes are present on this chromosome.

```grep -v feature /home/user/bioinformatics/BP_PRJNA496042/alignments/de/ballgown/ref_only/Control_vs_PDX_gene_results.tsv | wc -l```

Output: 55487

View the genes that passed the filtering step.

```grep -v feature /home/user/bioinformatics/BP_PRJNA496042/alignments/de/ballgown/ref_only/Control_vs_PDX_gene_results_filtered.tsv | wc -l```

Output: 5301

View how many genes were considered differentially expressed (p < 0.05).

```grep -v feature /home/user/bioinformatics/BP_PRJNA496042/alignments/de/ballgown/ref_only/Control_vs_PDX_gene_results_sig.tsv | wc -l```

Output: 1254

As an exercise or just a visualization, you can look at the top 20 DE genes in IGV and see if they seem to make sense.

```grep -v feature /home/user/bioinformatics/BP_PRJNA496042/alignments/de/ballgown/ref_only/Control_vs_PDX_gene_results_sig.tsv | sort -rnk 3 | head -n 20 | column -t``` higher abundance in CON?

```grep -v feature /home/user/bioinformatics/BP_PRJNA496042/alignments/de/ballgown/ref_only/Control_vs_PDX_gene_results_sig.tsv | sort -nk 3 | head -n 20 | column -t``` higher abundance in PDX?

Save all the genes that have a p < 0.05.

```grep -v feature /home/user/bioinformatics/BP_PRJNA496042/alignments/de/ballgown/ref_only/Control_vs_PDX_gene_results_sig.tsv | cut -f 6 | sed 's/\"//g' > /home/user/bioinformatics/BP_PRJNA496042/alignments/de/ballgown/ref_only/DE_genes.txt```

```head /home/user/bioinformatics/BP_PRJNA496042/alignments/de/ballgown/ref_only/DE_genes.txt```


***
# Ballgown DE Visualization

```cd /home/user/bioinformatics/BP_PRJNA496042/alignments/de/ballgown/ref_only```

Run the R code in the cell below. Uncomment all the lines with ctrl + / if needed.

In [None]:
#load libraries
library(ballgown)
library(genefilter)
library(dplyr)
library(devtools)

#designate output file
outfile="/home/user/bioinformatics/BP_PRJNA496042/alignments/de/ballgown/ref_only/BP_PRJNA496042_ballgown_output.pdf"

# Generate phenotype data
ids = c("Control_1","Control_2","Control_3","Control_4","PDX_1","PDX_2","PDX_3","PDX_4")
type = c("Control","Control","Control","Control","PDX","PDX","PDX","PDX")
results = "/home/user/bioinformatics/ref_genome/test_data/align_output/expression/stringtie/ref_only/" # be sure to include the trailing "/"
path = paste(results,ids,sep="")
pheno_data = data.frame(ids,type,path)

# Display the phenotype data
pheno_data

# Load the ballgown object from file
# Make sure to set the path to 'bg.rda', this file was created in the ballgown DE R script
load('/home/user/bioinformatics/BP_PRJNA496042/alignments/de/ballgown/ref_only/bg.rda')

# The load command, loads an R object from a file into memory in our R session.
# You can use ls() to view the names of variables that have been loaded
ls()

# Print a summary of the ballgown object
bg

# Open a PDF file where we will save some plots. We will save all figures and then view the PDF at the end
pdf(file=outfile)

# Extract FPKM values from the 'bg' object
fpkm = texpr(bg,meas="FPKM")

# View the last several rows of the FPKM table
tail(fpkm)

# Transform the FPKM values by adding 1 and convert to a log2 scale
# Adding 1 to the FPKM is known as Laplace smoothing, it prevents issues with taking the log of a 
# zero or near-zero value by ensuring it is always positive.
# The log2 transform compresses the range of values, allowing easier visualization.
fpkm = log2(fpkm+1)

# View the last several rows of the transformed FPKM table
tail(fpkm)

# Create boxplots to display summary statistics for the FPKM values for each sample
boxplot(fpkm,col=as.numeric(as.factor(pheno_data$type))+1,las=2,ylab='log2(FPKM+1)')

# col=as.numeric(as.factor(pheno_data$type))+1 - set color based on pheno_data$type which is Control vs PDX
# las=2 - labels are perpendicular to axis 
# ylab='log2(FPKM+1)' - set ylab to indicate that values are log2 transformed

# Display the transcript ID for a single row of data
ballgown::transcriptNames(bg)[2763]

# Display the gene name for a single row of data
ballgown::geneNames(bg)[2763]

# Create a BoxPlot comparing the expression of a single gene for all replicates of both conditions
boxplot(fpkm[2763,] ~ pheno_data$type, border=c(2,3), main=paste(ballgown::geneNames(bg)[2763],' : ', ballgown::transcriptNames(bg)[2763]),pch=19, xlab="Type", ylab='log2(FPKM+1)')

# border=c(2,3) - set border color for each of the boxplots
# main=paste(ballgown::geneNames(bg)[2763],' : ', ballgown::transcriptNames(bg)[2763]) - set title to gene : transcript
# xlab="Type" - set x label to Type
# ylab='log2(FPKM+1)' - set ylab to indicate that values are log2 transformed


# Add the FPKM values for each sample onto the plot
points(fpkm[2763,] ~ jitter(c(2,2,2,1,1,1)), col=c(2,2,2,1,1,1)+1, pch=16)
# pch=16 - set plot symbol to solid circle, default is empty circle


# Create a plot of transcript structures observed in each replicate and color transcripts by expression level
plotTranscripts(ballgown::geneIDs(bg)[2763], bg, main=c('TST in all HBR samples'), sample=c('HBR_Rep1', 'HBR_Rep2', 'HBR_Rep3'), labelTranscripts=TRUE)
plotTranscripts(ballgown::geneIDs(bg)[2763], bg, main=c('TST in all UHR samples'), sample=c('UHR_Rep1', 'UHR_Rep2', 'UHR_Rep3'), labelTranscripts=TRUE)

#plotMeans('TST',bg,groupvar="type",legend=FALSE)

# Close the PDF device where we have been saving our plots
dev.off()