<a href="https://colab.research.google.com/github/shreyansegnyte/NASA-GeneLab-Code/blob/main/read%20alignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div>
<img src="https://www.nasa.gov/wp-content/uploads/2024/07/osdr-gl4hs-logo.png" width="600"/>
</div>

# **NOTEBOOK 4: Aligning sequencing reads to a reference genome**


In this notebook you will map (align) the sequence reads from the FASTQ files to the reference chromosome index (chr17) you built in the previous notebook.

## **Objectives of this notebook**
The primary objective of this notebook is to align the 2 FASTQ files to a reference genome index for chromosome 17. Once you perform the alignment, you will check the alignment using a command called `samtools`. You can learn more about the `samtools` command in this [Wikipedia article](https://en.wikipedia.org/wiki/SAMtools).

## **UNIX commands introduced in this notebook**


# Prepare your runtime environment for this lab

In [None]:
# mount the google drive
from google.colab import drive
drive.flush_and_unmount()
drive.mount("mnt")


In [None]:
# time the notebook
import datetime
start_time = datetime.datetime.now()
print('notebook start time: ', start_time.strftime('%Y-%m-%d %H:%M:%S'))

In [None]:
# import the os module which you'll use throughout the notebook
import os

In [None]:
# set FASTQ_DIR and verify it exists
FASTQ_DIR="/content/mnt/MyDrive/NASA/GL4HS/FASTQ"
if not os.path.exists(FASTQ_DIR):
  raise Exception("STOP! You have not finished the previous notebooks!")

In [None]:
# set REFERENCE_DIR and verify it exists
REFERENCE_DIR="/content/mnt/MyDrive/NASA/GL4HS/REFERENCE"
if not os.path.exists(REFERENCE_DIR):
  raise Exception("STOP! You have not finished the previous notebooks!")

In [None]:
# set STAR_DIR and verify it exists
STAR_DIR="/content/mnt/MyDrive/NASA/GL4HS/STAR"
if not os.path.exists(STAR_DIR):
  raise Exception("STOP! You have not finished the previous notebooks!")

In [None]:
# check for the compressed trimmed fastq files
if not os.path.exists(f"{FASTQ_DIR}/TRIM/PAIRED/reduced_r1_val_1.fq.gz") \
  or not os.path.exists(f"{FASTQ_DIR}/TRIM/PAIRED/reduced_r2_val_2.fq.gz"):
  raise Exception("STOP: one or both of the reduced trimmed fq.gz files do not exist. You must first run the previous notebooks!")

In [None]:
# install samtools
!sudo apt-get install samtools > /dev/null 2>&1

In [None]:
# check version of samtools
!samtools --version

# Align the reads to the reference

In [None]:
# unzip the reduce_trimmed fq.gz files as STAR only reads unzipped fastq
#
import os
if os.path.exists(f"{FASTQ_DIR}/TRIM/PAIRED/reduced_r1_val_1.fq.gz") and not os.path.exists(f"{FASTQ_DIR}/TRIM/PAIRED/reduced_r1_val_1.fq"):
  !gunzip -f {FASTQ_DIR}/TRIM/PAIRED/reduced_r1_val_1.fq.gz
if os.path.exists(f"{FASTQ_DIR}/TRIM/PAIRED/reduced_r2_val_2.fq.gz") and not os.path.exists(f"{FASTQ_DIR}/TRIM/PAIRED/reduced_r2_val_2.fq"):
  !gunzip -f {FASTQ_DIR}/TRIM/PAIRED/reduced_r2_val_2.fq.gz

## Use the `STAR` command to perform the alignment

Read Sections 3.1 and 3.2 of the [STAR manual](https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf) to learn about the STAR command options for running mapping (alignment) jobs.

For more information, feel free to read [this tutorial](https://hbctraining.github.io/Intro-to-rnaseq-hpc-O2/lessons/03_alignment.html) on the STAR aligner for more information about how it works.

In [None]:
# run STAR to align the reads to the reference
# this may take a LONG time (several hours)
# runtime depends on the REDUCTION_FACTOR setting from the first notebook
# runtime also depends on how much data you're aligning
# runtime also depends on how much memory is on the computer where STAR is running
import datetime
start = datetime.datetime.now()
print('starting time: ', start.strftime('%Y-%m-%d %H:%M:%S'))

!chmod +x {STAR_DIR}/STAR-2.7.11b/bin/Linux_x86_64_static/STAR

if os.path.exists(f"{STAR_DIR}/ALIGNMENT"):
  !rm -rf {STAR_DIR}/ALIGNMENT
!mkdir {STAR_DIR}/ALIGNMENT
!{STAR_DIR}/STAR-2.7.11b/bin/Linux_x86_64_static/STAR \
  --outFileNamePrefix {STAR_DIR}/ALIGNMENT/chr17 \
  --readFilesIn {FASTQ_DIR}/TRIM/PAIRED/reduced_r1_val_1.fq {FASTQ_DIR}/TRIM/PAIRED/reduced_r2_val_2.fq \
  --genomeDir {REFERENCE_DIR}/MM39_CHR17 \
  --runThreadN 2

end = datetime.datetime.now()
print('ending time: ', end.strftime('%Y-%m-%d %H:%M:%S'))


In [None]:
# check alignment output files
# there should be 5 files in all
# the alignment file itself is called chr17Aligned.out.sam
!ls -lh {STAR_DIR}/ALIGNMENT/chr17*

## Use `samtools` view to examine the alignment

Read the [samtools view manpage](https://www.htslib.org/doc/samtools-view.html) to learn more about this command.

In [None]:
# look at the first 10 lines of the SAM file
# Question: Which position is the first read in this (unsorted) SAM file?
!samtools view -h {STAR_DIR}/ALIGNMENT/chr17Aligned.out.sam | head -10

In [None]:
# sort alignment and save in BAM file
!samtools sort {STAR_DIR}/ALIGNMENT/chr17Aligned.out.sam -o {STAR_DIR}/ALIGNMENT/chr17Aligned.out.bam

In [None]:
# look at the first 10 lines of the sorted BAM file
# Question: Which position is the first read in this (sorted) BAM file?
!samtools view -h {STAR_DIR}/ALIGNMENT/chr17Aligned.out.bam | head -10

In [None]:
# count all aligned reads in BAM file
!samtools view -c {STAR_DIR}/ALIGNMENT/chr17Aligned.out.bam

In [None]:
# count number of reads with MAPQ quality score 20 or higher in BAM file
!samtools view -q 20 -c {STAR_DIR}/ALIGNMENT/chr17Aligned.out.bam

In [None]:
# calculate percentage of reads with MAPQ > 20 in BAM file
q20 = !samtools view -q 20 -c {STAR_DIR}/ALIGNMENT/chr17Aligned.out.bam
total = !samtools view -c {STAR_DIR}/ALIGNMENT/chr17Aligned.out.bam
int(q20[0]) / int(total[0])

## Use `samtools` flagstat to get statistics on the alignment

Read the [samtools flagstat manpage](https://www.htslib.org/doc/samtools-flagstat.html) for more information.

In [None]:
# get general stats on alignment from BAM file
!samtools flagstat {STAR_DIR}/ALIGNMENT/chr17Aligned.out.bam

Singletons are reads that are missing a mate (R1 or R2) in paired-end sequencing. You can read more [here](https://www.seqanswers.com/forum/bioinformatics/bioinformatics-aa/41311-what-is-singletons).

## Use the `samtools mpileup` command to examine individual base mappings

You can read more about samtools mpileup in the DESCRIPTION Pileup Format section of the [samtools manpage](https://www.htslib.org/doc/samtools-mpileup.html) and in this [Wikipedia article](https://en.wikipedia.org/wiki/Pileup_format).

In [None]:
# run mpileup in a region of the alignment in BAM file
!samtools mpileup -f {REFERENCE_DIR}/Mus_musculus.GRCm39.dna.chromosome.17.fa {STAR_DIR}/ALIGNMENT/chr17Aligned.out.bam | sed -n '1,30p'

# Check your work before moving on

In [None]:
# check disk space utilization (should be about 1.2GB)
!du -sh /content/mnt/MyDrive/NASA/GL4HS

In [None]:
# time the notebook
import datetime
end_time = datetime.datetime.now()
print('notebook end time: ', end_time.strftime('%Y-%m-%d %H:%M:%S'))

total_time = end_time - start_time
print('notebook total runtime: ', total_time)