# Introduction

This RNA-seq pipeline is based off of **RNA-seq_pipeline_practice_v1.0.0** and is meant to test how well it works on a full dataset with known outcomes to compare too.

The full dataset is available on [BioProject - Accession PRJNA496042](https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA496042).

The paper this dataset is from:

```Wilson, Hannah E., Kacey Rhodes, Daniel Rodriguez, Daniel A. Rodriguez, Daniel Rodriguez, Ikttesh Chahal, David A. Stanton, et al. “Human Breast Cancer Xenograft Model Implicates Peroxisome Proliferator–Activated Receptor Signaling as Driver of Cancer-Induced Muscle Fatigue.” Clinical Cancer Research 25, no. 7 (April 1, 2019): 2336–47. https://doi.org/10.1158/1078-0432.ccr-18-1565.```

## Initial setup
The tutorial being followed uses AWS cloud but I am writing this pipeline to be ran locally on a linux-based machine. More specifically, I am using Manjaro with kernel version 6.1.11-1-MANJARO (64-bit) and KDE plasma version 5.26.4 on Wayland.

[Configure your environment](https://rnabio.org/module-00-setup/0000/09/01/Environment/) and then [install all necessary software](https://rnabio.org/module-00-setup/0000/10/01/Installation/) according to the rnabio website. The specific steps you follow while doing this will depend on your system and how you want to install the tools. I would suggest looking for maintained packages for your specific distribution before trying to install from source.

Specific verison information of software installed on my setup:
- SAMtools v1.16.1, installed from AUR
    - run with ```samtools``` from CL
- bam-readcount v1.0.1, compiled from source
    - run from ```~$HOME/bioinformatics/bam-readcount/build/bin``` with ```./bam-readcount```
- HISAT2 v2.2.1-1, installed from AUR
    - run with ```hisat2``` from CL
- stringtie v2.2.1, downloaded pre-compiled binary
    - run from ```~$HOME/bioinformatics/stringtie-2.2.1.Linux_x86_64``` with ```./stringtie```
- gffcompare v0.12.6-2, installed from AUR
    - run with ```gffcompare``` from CL
- htseq-count v.0.11.3-1, installed from AUR and also in bioinformatics conda env
    - run with ```htseq-count``` from terminal with conda bioinformatics env active
- tophat v2.1.1, downloaded pre-compiled binary
    - run from ```~$HOME/bioinformatics/tophat-2.1.1.Linux_x86_64``` with ```./gtf_to_fasta```
- kallisto v0.48.0-1, installed from AUR
    - run with ```kallisto``` from CL
- fastqc v0.11.9-1, installed from AUR
    - run with ```fastqc``` from CL
- multiqc v1.14, installed with pip3 in bioinformatics conda env
    - run with ```multiqc --help``` from CL
- picard v2.26.4, downloaded as JAR file
    - run from ```~$HOME/bioinformatics/``` with ```java -jar picard.jar```
- flexbar v3.5.0, downloaded pre-compiled binary
    - run from ```~$HOME/bioinformatics/flexbar-3.5.0-linux``` with ```./flexbar```
- regtools v1.0.0, installed from AUR (git source)
    - run with ```regtools``` from CL
- rseqc v5.0.1, installed with pip3 in bioinformatics conda env
    - import in python with ```rseqc``` or run with ```read_GC.py``` in CL?
- bedops v2.4.40, downloaded pre-compiled binary
    - run from ```~$HOME/bioinformatics/bedops_linux_x86_64-v2.4.40/bin``` with```./bedops``` and ```./gff2bed```
- gtfToGenePred v unknown, downloaded pre-compiled binary
    - run from ```~$HOME/bioinformatics/gtfToGenePred``` with ```./gtfToGenePred```
- genePredToBed v unknown, downloaded pre-compiled binary
    - run from ```~$HOME/bioinformatics/genePredToBed``` with ```./genePredToBed```
- how_are_we_stranded_here, v unknown, installed with pip3 in bioinformatics conda env
    - run with ```check_strandedness``` in CL
- cell ranger v7.1.0, have to register to download, pre-compiled binary
    - run from ```~$HOME/bioinformatics/cellranger-7.1.0``` with ```./cellranger```
- tabix, website only offers apt installation, not relevant on arch-based distro. Tabix seems to be included with HTSlib which is already installed as a dependency for samtools and bcftools
- BWA v0.7.17=r1188, compiled from source
    - run from ```~$HOME/bioinformatics/bwa``` with ```./bwa``` or ```bwa```
- BCFtools v1.16, installed from AUR
    - run with ```bcftools``` from CL
- peddy v0.4.8, cloned and installed with pip in bioinformatics conda env
    - run with ```peddy``` from CL in conda env
- slivar v0.2.7, pre-complied binary
    - run from ```~$HOME/bioinformatics``` with ```./slivar```
- STRling v0.5.1, pre-complied binary
    - run from ```~$HOME/bioinformatics``` with ```./strling```
- freebayes v1.3.6, pre-compiled binary
    - run from ```~$HOME/bioinformatics``` with ```./freebayes-1.3.6-linux-amd64-static```

### R Libraries needed:
- devtools
- dplyr
- gplots
- ggplot2
- UpSetR

install with 
```install.packages(c("devtools","dplyr","gplots","ggplot2","UpSetR"),repos="http://cran.us.r-project.org")```

### Bioconductor libraries needed:
- [genefilter](http://bioconductor.org/packages/release/bioc/html/genefilter.html)
- [ballgown](http://bioconductor.org/packages/release/bioc/html/ballgown.html)
- [edgeR](http://www.bioconductor.org/packages/release/bioc/html/edgeR.html)
- [GenomicRanges](http://bioconductor.org/packages/release/bioc/html/GenomicRanges.html)
- [rhdf5](https://www.bioconductor.org/packages/release/bioc/html/rhdf5.html)
- [biomaRt](https://bioconductor.org/packages/release/bioc/html/biomaRt.html)
- [sva](https://www.bioconductor.org/packages/release/bioc/html/sva.html)
- [GAGE](https://bioconductor.org/packages/release/bioc/html/gage.html)

### Sleuth
[Source](https://pachterlab.github.io/sleuth/download)

devtools need to be installed if not already ```install.packages("devtools")```

install with ```devtools::install_github("pachterlab/sleuth")```


***
# General notes on RNA-seq workflow

Due to inconsistencies in the naming conventions for chromosomes, you should obtain your reference genome and annotation packages from the same source (Ensembl, NCBI, UCSC). Additionally, your annotations must correspond with the same reference genome version (i.e. both correspond to NCBI human GRCh38).

[This guide](https://pmbio.org/module-02-inputs/0002/02/01/Reference_Genome/) explains reference genomes and how to obtain them under the "Reference Genome Options" section at the bottom.



***
# Obtain reference genome and known gene/transcript annotations

### Reference genome

The Mus musculus GRCm38 reference genome in FASTA format was downloaded from [Ensembl](http://nov2020.archive.ensembl.org/Mus_musculus/Info/Index#) via the command ```wget ftp://ftp.ensembl.org:21/pub/release-102/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna.primary_assembly.fa.gz```. This is the reference genome that was stated to be used in the paper.

Location on disk: ```/home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.dna.primary_assembly.fa```

The Mus musculus GRCm39 reference genome in FASTA format was also downloaded ```wget https://ftp.ensembl.org/pub/release-109/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.primary_assembly.fa.gz```, but it was not used in this analysis.

### Gene annotations in .gtf format

Gene annotations in .gtf format were downloaded from Ensembl as well via the command ```wget ftp://ftp.ensembl.org:21/pub/release-102/gtf/mus_musculus/Mus_musculus.GRCm38.102.gtf.gz```. There are other versions of the .gtf files so I will see if those are needed later on.

Location: ```/home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102.gtf```

You can preview the file with ```head -n 50 /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102.gtf | column -t | less -p exon -S```

```cat /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102.gtf | grep -w gene | wc -l```

You can view the structure of one transcript with ```grep ENSMUSG00000102693 /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102.gtf | less -p "exon\s" -S```

**Important**: chromosome names in the .gtf file must be the same as those in the reference genome .fa file. If StringTie returns expression values that are all 0, this is the likley cause of that error. Download the annotation and reference genome files from the same source.

**Make sure your reference genome and annotations are for the same build of the genome (i.e. both for GRCh38).**

***
# Creating index with HISAT2

**WARNING** This step will take **A LOT** of RAM, well over 100 gb so make sure the system has enough.

It is much easier and faster, if possible, to download a pre-built index provided by the HISAT2 devs at [their website](https://daehwankimlab.github.io/hisat2/download/).

I downloaded the GRCm38 index, files were renamed with ```rename genome Mus_musculus.GRCm38.dna.primary_assembly *.ht2```.

This was done directly using python files from the [HISAT2 github repo](https://github.com/DaehwanKimLab/hisat2), not sure how to do it using the installed program listed above or if it's possible. Just clone the repo and use the files necessary.

Switch to hisat cloned repo ```cd /home/user/bioinformatics/hisat2```

Run the following commands.

```./hisat2_extract_splice_sites.py /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102.gtf > /home/user/bioinformatics/BP_PRJNA496042/splicesites.tsv```

```./hisat2_extract_exons.py /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102.gtf > /home/user/bioinformatics/BP_PRJNA496042/exons.tsv```

```hisat2-build -p 24 --ss /home/user/bioinformatics/BP_PRJNA496042/splicesites.tsv --exon /home/user/bioinformatics/BP_PRJNA496042/exons.tsv /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.dna.primary_assembly.fa /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.dna.primary_assembly```

The output should be a collection of binary files with names like Mus_musculus.GRCm38.dna.primary_assembly.x.ht2 or .rf, the x being a number.

***
# Look at some of the actual RNA-seq data

Look at the head of the files.

```cd /home/user/bioinformatics/BP_PRJNA496042```

```zcat Control_1_Illumina_HiSeq2500_paired.fastq.gz | head -n 100```

A cursory look shows us that the average read length appears to be around 100 bp. The quality encoding string of all reads in the head are just question marks. A ```?``` seems to indicate a Phred Q of 30, which means a 0.001 probability that the base was called incorrectly. In other words, it has a 0.99 base call accuracy.

Determine how many reads are in a file.

```zcat Control_1_Illumina_HiSeq2500_paired.fastq.gz | grep -P "^\@SRR" | wc -l```

There are 30,789,570 reads in this file.

***
# Check strandedness (optional, but not really)

This step is not required but due to differences in usage of RNA-seq library construction kits, it is **highly** recommended to empirically verify strandedness using the below method.

The check_strandedness should already be installed at the beginning. It can also be installed as a [Docker image](https://hub.docker.com/r/mgibio/checkstrandedness).

I suppose putting these in the GRCm38 folder should work for now.

```/home/user/bioinformatics/reference_genomes/GRCm38```

Convert gtf to genepred

```cd /home/user/bioinformatics/gtfToGenePred```

```./gtfToGenePred /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102.gtf /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102.genePred```

Convert genPred to bed12

```cd /home/user/bioinformatics/genePredToBed```

```./genePredToBed /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102.genePred /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102.bed12```

Use bedtools to create fasta index from GTF

```bedtools getfasta -fi /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.dna.primary_assembly.fa -bed /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102.bed12 -s -split -name -fo /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.dna.primary_assembly.transcripts.fa```

Clean up the header lines of the transcript file

```cat /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.dna.primary_assembly.transcripts.fa | perl -ne 'if($_ =~/^\>\S+\:\:(ERCC\-\d+)\:.*/){print ">$1\n"}elsif ($_ =~/^\>(\S+)\:\:.*/){print ">$1\n"}else{print $_}' > /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.dna.primary_assembly.transcripts.clean.fa```

View the clean file

```less /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.dna.primary_assembly.transcripts.fa```

Add in transcript_id field to .gtf file

```awk '{ if ($0 ~ "transcript_id") print $0; else print $0" transcript_id \"\";"; }' /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102.gtf > /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102_tidy.gtf```

check strandedness

```conda activate bioinformatics```

```check_strandedness --gtf /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.102_tidy.gtf --transcripts /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.dna.primary_assembly.transcripts.clean.fa --reads_1 /home/user/bioinformatics/BP_PRJNA496042/Control_1_Illumina_HiSeq2500_paired.fastq.gz --reads_2 /home/user/bioinformatics/BP_PRJNA496042/Control_2_Illumina_HiSeq2500_paired.fastq.gz```

OUTPUT:

Total 50314 usable reads were sampled

This is PairEnd Data

Fraction of reads failed to determine: 0.1124

Fraction of reads explained by "1++,1--,2+-,2-+": 0.4433 (49.9% of explainable reads)

Fraction of reads explained by "1+-,1-+,2++,2--": 0.4443 (50.1% of explainable reads)

Under 60% of reads explained by one direction

Data is likely unstranded

More info on strandedness can be found [here](https://rnabio.org/module-09-appendix/0009/12/01/StrandSettings/)

***
# Pre-alignment QC

To get a sense of the raw reads quality in your fastq files, run the following programs.

### fastqc

```cd /home/user/bioinformatics/BP_PRJNA496042```

```fastqc *.fastq.gz``` in the same directory as your reads.

It will output an HTML file for each read in the same directory that contains a full report of the quality measures.

In the "over-represented sequences" section, try [BLASTing](https://blast.ncbi.nlm.nih.gov/Blast.cgi) the sequences to try determining a source organism.

### multiqc

Run multiqc to combine all the fastqc reports.

```cd /home/user/bioinformatics/BP_PRJNA496042```

Run ```multiqc .``` in the same directory as the reports.

Renamed file to ```pre-alignment_multiqc_report.html```

Make a new directory for the fastqc .html outputs and move them

```mkdir fastqc && mv *_fastqc* fastqc```

## View reports

The per base sequence content section shows significant distribution up to around 13 bp. These are likely adapters and should be trimmed in the next step.

Blasting some of the overrepresented sequences shows 18s rRNA as a common occurrence, not sure what that means.

***
# Trim adapters

Make a new folder for the trimmed reads

```cd /home/user/bioinformatics/BP_PRJNA496042```

```mkdir trimmed && cd trimmed```

Download the Illumina adapter sequence files, or the relevant files for your sequencing machine.

```wget http://genomedata.org/rnaseq-tutorial/illumina_multiplex.fa```

Downloaded to ```/home/user/bioinformatics/BP_PRJNA496042/trimmed/illumina_multiplex.fa```

Use flexbar to remove adapter sequences and trim first 13 bases of each read. Adjust the number of threads to something manageable for your system. It is set to 24 threads but may take much longer with a slower CPU.

Possible options for flexbar and their meaning:

   - ```–adapter-min-overlap 7``` requires a minimum of 7 bases to match the adapter
   - ```–adapter-trim-end RIGHT``` uses a trimming strategy to remove the adapter from the 3 prime or RIGHT end of the read
   - ```–max-uncalled 300``` allows as many as 300 uncalled or N bases (MiSeq read lengths can be 300bp)
   - ```–min-read-length``` the minimum read length allowed after trimming is 25bp.
   - ```–threads 24``` use 24 threads
   - ```–zip-output GZ``` the input FASTQ files are gzipped so we will output gzipped FASTQ to save space
   - ```–adapters``` define the path to the adapter FASTA file to trim
   - ```–reads``` define the path to the read 1 FASTQ file of reads
   - ```–reads2``` define the path to the read 2 FASTQ file of reads
   - ```–target``` a base path for the output files. The value will _1.fastq.gz and _2.fastq.gz for read 1 and read 2 respectively
   - ```–pre-trim-left``` trim a fixed number of bases at left read end. For example, to trim 5 bases at the left side of reads: –pre-trim-left 5
   - ```–pre-trim-right``` trim a fixed number of bases at right read end. For example, to trim 5 bases at the right side of reads: –pre-trim-right 5
   - ```–pre-trim-phred``` trim based on phred quality value to deal with higher error rates towards the end of reads. For example, to trim the 3 prime end until quality offset value 30 or higher is reached, specify: –pre-trim-phred 30

The commands to run for flexbar for this dataset are as follows:

```cd /home/user/bioinformatics/flexbar-3.5.0-linux```

You will need to export the library path environment variable for this session, or add to shell startup scrits.

```export LD_LIBRARY_PATH=/home/user/bioinformatics/flexbar-3.5.0-linux:$LD_LIBRARY_PATH```

```./flexbar --adapter-min-overlap 7 --adapter-trim-end RIGHT --adapters /home/user/bioinformatics/BP_PRJNA496042/trimmed/illumina_multiplex.fa --pre-trim-left 13 --max-uncalled 300 --min-read-length 25 --threads 24 --zip-output GZ --reads /home/user/bioinformatics/BP_PRJNA496042/Control_1_Illumina_HiSeq2500_paired.fastq.gz --target /home/user/bioinformatics/BP_PRJNA496042/trimmed/Control_1_Illumina_HiSeq2500_paired```

```./flexbar --adapter-min-overlap 7 --adapter-trim-end RIGHT --adapters /home/user/bioinformatics/BP_PRJNA496042/trimmed/illumina_multiplex.fa --pre-trim-left 13 --max-uncalled 300 --min-read-length 25 --threads 24 --zip-output GZ --reads /home/user/bioinformatics/BP_PRJNA496042/Control_2_Illumina_HiSeq2500_paired.fastq.gz --target /home/user/bioinformatics/BP_PRJNA496042/trimmed/Control_2_Illumina_HiSeq2500_paired```

```./flexbar --adapter-min-overlap 7 --adapter-trim-end RIGHT --adapters /home/user/bioinformatics/BP_PRJNA496042/trimmed/illumina_multiplex.fa --pre-trim-left 13 --max-uncalled 300 --min-read-length 25 --threads 24 --zip-output GZ --reads /home/user/bioinformatics/BP_PRJNA496042/Control_3_Illumina_HiSeq2500_paired.fastq.gz --target /home/user/bioinformatics/BP_PRJNA496042/trimmed/Control_3_Illumina_HiSeq2500_paired```

```./flexbar --adapter-min-overlap 7 --adapter-trim-end RIGHT --adapters /home/user/bioinformatics/BP_PRJNA496042/trimmed/illumina_multiplex.fa --pre-trim-left 13 --max-uncalled 300 --min-read-length 25 --threads 24 --zip-output GZ --reads /home/user/bioinformatics/BP_PRJNA496042/Control_4_Illumina_HiSeq2500_paired.fastq.gz --target /home/user/bioinformatics/BP_PRJNA496042/trimmed/Control_4_Illumina_HiSeq2500_paired```

```./flexbar --adapter-min-overlap 7 --adapter-trim-end RIGHT --adapters /home/user/bioinformatics/BP_PRJNA496042/trimmed/illumina_multiplex.fa --pre-trim-left 13 --max-uncalled 300 --min-read-length 25 --threads 24 --zip-output GZ --reads /home/user/bioinformatics/BP_PRJNA496042/PDX_1_Illumina_HiSeq2500_paired.fastq.gz --target /home/user/bioinformatics/BP_PRJNA496042/trimmed/PDX_1_Illumina_HiSeq2500_paired```

```./flexbar --adapter-min-overlap 7 --adapter-trim-end RIGHT --adapters /home/user/bioinformatics/BP_PRJNA496042/trimmed/illumina_multiplex.fa --pre-trim-left 13 --max-uncalled 300 --min-read-length 25 --threads 24 --zip-output GZ --reads /home/user/bioinformatics/BP_PRJNA496042/PDX_2_Illumina_HiSeq2500_paired.fastq.gz --target /home/user/bioinformatics/BP_PRJNA496042/trimmed/PDX_2_Illumina_HiSeq2500_paired```

```./flexbar --adapter-min-overlap 7 --adapter-trim-end RIGHT --adapters /home/user/bioinformatics/BP_PRJNA496042/trimmed/illumina_multiplex.fa --pre-trim-left 13 --max-uncalled 300 --min-read-length 25 --threads 24 --zip-output GZ --reads /home/user/bioinformatics/BP_PRJNA496042/PDX_3_Illumina_HiSeq2500_paired.fastq.gz --target /home/user/bioinformatics/BP_PRJNA496042/trimmed/PDX_3_Illumina_HiSeq2500_paired```

```./flexbar --adapter-min-overlap 7 --adapter-trim-end RIGHT --adapters /home/user/bioinformatics/BP_PRJNA496042/trimmed/illumina_multiplex.fa --pre-trim-left 13 --max-uncalled 300 --min-read-length 25 --threads 24 --zip-output GZ --reads /home/user/bioinformatics/BP_PRJNA496042/PDX_4_Illumina_HiSeq2500_paired.fastq.gz --target /home/user/bioinformatics/BP_PRJNA496042/trimmed/PDX_4_Illumina_HiSeq2500_paired```

Compare the impact of trimming with the original QC reports by running fastqc and multiqc on the trimmed reads

In the directory of the trimmed reads ```cd /home/user/bioinformatics/BP_PRJNA496042/trimmed```, run:

```fastqc *.fastq.gz```

```multiqc .```

Same as before, make a new directory for the fastqc .html outputs and move them

```mkdir fastqc && mv *_fastqc* fastqc```


***
# Alignment with HISAT2

For more details on HISAT2, refer to the [manual on Github](http://daehwankimlab.github.io/hisat2/manual/)

The outputs after running HISAT2 will be a SAM/BAM file for each data set.

Options for HISAT2 include:


   - ```-p 24``` tells HISAT2 to use 24 threads for bowtie alignments.
   - ```–rna-strandness``` specifies strandness of RNAseq library. Check_strandedness said it is likely unstranded, which is the default. So we remove this flag in this run.
   - ```–rg-id $ID``` specifies a read group ID that is a unique identifier.
   - ```–rg SM:$SAMPLE_NAME``` specifies a read group sample name. This together with rg-id will allow you to determine which reads came from which sample in the merged bam later on.
   - ```–rg LB:$LIBRARY_NAME``` specifies a read group library name. This together with rg-id will allow you to determine which reads came from which library in the merged bam later on.
   - ```–rg PL:ILLUMINA``` specifies a read group sequencing platform.
   - ```–rg PU:$PLATFORM_UNIT``` specifies a read group sequencing platform unit. Typically this consists of FLOWCELL-BARCODE.LANE
   - ```–dta``` Reports alignments tailored for transcript assemblers.
   - ```-x /path/to/hisat2/index``` The HISAT2 index filename prefix (minus the trailing .X.ht2) built earlier including splice sites and exons.
   - ```-1 /path/to/read1.fastq.gz``` The read 1 FASTQ file, optionally gzip(.gz) or bzip2(.bz2) compressed.
   - ```-2 /path/to/read2.fastq.gz``` The read 2 FASTQ file, optionally gzip(.gz) or bzip2(.bz2) compressed.
   - ```-S /path/to/output.sam``` The output SAM format text file of alignments.
   
Create an alignment folder for the outputs.

```cd /home/user/bioinformatics/BP_PRJNA496042 && mkdir alignments && cd alignments```

Run a command for each of the replicates. I will try running based off the trimmed fastq files and see how it goes.

```hisat2 -p 24 --rg-id Control_1 --rg SM:Control --rg LB:Control_Lib --rg PL:ILLUMINA -x /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.dna.primary_assembly --dta -U /home/user/bioinformatics/BP_PRJNA496042/trimmed/Control_1_Illumina_HiSeq2500_paired.fastq.gz -S /home/user/bioinformatics/BP_PRJNA496042/alignments/Control_1.sam```

```hisat2 -p 24 --rg-id Control_2 --rg SM:Control --rg LB:Control_Lib --rg PL:ILLUMINA -x /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.dna.primary_assembly --dta -U /home/user/bioinformatics/BP_PRJNA496042/trimmed/Control_2_Illumina_HiSeq2500_paired.fastq.gz -S /home/user/bioinformatics/BP_PRJNA496042/alignments/Control_2.sam```

```hisat2 -p 24 --rg-id Control_3 --rg SM:Control --rg LB:Control_Lib --rg PL:ILLUMINA -x /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.dna.primary_assembly --dta -U /home/user/bioinformatics/BP_PRJNA496042/trimmed/Control_3_Illumina_HiSeq2500_paired.fastq.gz -S /home/user/bioinformatics/BP_PRJNA496042/alignments/Control_3.sam```

```hisat2 -p 24 --rg-id Control_4 --rg SM:Control --rg LB:Control_Lib --rg PL:ILLUMINA -x /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.dna.primary_assembly --dta -U /home/user/bioinformatics/BP_PRJNA496042/trimmed/Control_4_Illumina_HiSeq2500_paired.fastq.gz -S /home/user/bioinformatics/BP_PRJNA496042/alignments/Control_4.sam```

```hisat2 -p 24 --rg-id=PDX_1 --rg SM:PDX --rg LB:PDX_Lib --rg PL:ILLUMINA -x /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.dna.primary_assembly --dta -U /home/user/bioinformatics/BP_PRJNA496042/trimmed/PDX_1_Illumina_HiSeq2500_paired.fastq.gz -S /home/user/bioinformatics/BP_PRJNA496042/alignments/PDX_1.sam```

```hisat2 -p 24 --rg-id=PDX_2 --rg SM:PDX --rg LB:PDX_Lib --rg PL:ILLUMINA -x /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.dna.primary_assembly --dta -U /home/user/bioinformatics/BP_PRJNA496042/trimmed/PDX_2_Illumina_HiSeq2500_paired.fastq.gz -S /home/user/bioinformatics/BP_PRJNA496042/alignments/PDX_2.sam```

```hisat2 -p 24 --rg-id=PDX_3 --rg SM:PDX --rg LB:PDX_Lib --rg PL:ILLUMINA -x /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.dna.primary_assembly --dta -U /home/user/bioinformatics/BP_PRJNA496042/trimmed/PDX_3_Illumina_HiSeq2500_paired.fastq.gz -S /home/user/bioinformatics/BP_PRJNA496042/alignments/PDX_3.sam```

```hisat2 -p 24 --rg-id=PDX_4 --rg SM:PDX --rg LB:PDX_Lib --rg PL:ILLUMINA -x /home/user/bioinformatics/reference_genomes/GRCm38/Mus_musculus.GRCm38.dna.primary_assembly --dta -U /home/user/bioinformatics/BP_PRJNA496042/trimmed/PDX_4_Illumina_HiSeq2500_paired.fastq.gz -S /home/user/bioinformatics/BP_PRJNA496042/alignments/PDX_4.sam```


### Convert SAM to BAM files

```-@ int``` is the number of threads to use.

```cd /home/user/bioinformatics/BP_PRJNA496042/alignments```

```samtools sort -@ 24 -o Control_1.bam Control_1.sam```

```samtools sort -@ 24 -o Control_2.bam Control_2.sam```

```samtools sort -@ 24 -o Control_3.bam Control_3.sam```

```samtools sort -@ 24 -o Control_4.bam Control_4.sam```

```samtools sort -@ 24 -o PDX_1.bam PDX_1.sam```

```samtools sort -@ 24 -o PDX_2.bam PDX_2.sam```

```samtools sort -@ 24 -o PDX_3.bam PDX_3.sam```

```samtools sort -@ 24 -o PDX_4.bam PDX_4.sam```


### Merge all BAM files
Other tools such as "samtools merge" and "bamtools merge" can also achieve this. Create a BAM file for each group (UHR and HBR).

```java -Xmx2g -jar picard.jar MergeSamFiles -OUTPUT /home/user/bioinformatics/ref_genome/test_data/align_output/UHR.bam -INPUT /home/user/bioinformatics/ref_genome/test_data/align_output/UHR_Rep1.bam -INPUT /home/user/bioinformatics/ref_genome/test_data/align_output/UHR_Rep2.bam -INPUT /home/user/bioinformatics/ref_genome/test_data/align_output/UHR_Rep3.bam```

```java -Xmx2g -jar picard.jar MergeSamFiles -OUTPUT /home/user/bioinformatics/ref_genome/test_data/align_output/HBR.bam -INPUT /home/user/bioinformatics/ref_genome/test_data/align_output/HBR_Rep1.bam -INPUT /home/user/bioinformatics/ref_genome/test_data/align_output/HBR_Rep2.bam -INPUT /home/user/bioinformatics/ref_genome/test_data/align_output/HBR_Rep3.bam```

You can verify that all bam files were created successfully, there should be 8.

```ls -l *.bam | wc -l```