<a href="https://colab.research.google.com/github/shreyansegnyte/NASA-GeneLab-Code/blob/main/preparingRawReads.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div>
<img src="https://www.nasa.gov/wp-content/uploads/2024/07/osdr-gl4hs-logo.png" width="600"/>
</div>

# **NOTEBOOK 2: Preparing RNA-seq reads**

In this notebook, you will be preparing the RNA-seq reads you just downloaded in the previous notebook for subsequent processing. You will first calculate and compare the quality of each of the 2 FASTQ files. Next you will trim short, low-quality, and extraneous reads from each of the 2 FASTQ files. Last, you will calculate the quality again for one of the paired-end FASTQ files and compare it to what it was before trimming.

## **Objectives of this notebook**
The primary objective of this notebook is to ensure that the quality of the RNA-seq reads for our sample is sufficient.  To measure the quality of the FASTQ data, you will use the FastQC tool developed by Simon Andrews. You can learn more about FastQC in this [tutorial](https://rtsf.natsci.msu.edu/genomics/technical-documents/fastqc-tutorial-and-faq.aspx).

You will install and use a tool called `trim_galore` to trim short reads, low-quality reads, and adapter reads from your FASTQ files. The `trim_galore` command is itself a wrapper of both the FastQC tool and the `cutadapt` tool. You can learn more about `trim_galore` in this [user guide](https://github.com/FelixKrueger/TrimGalore/blob/master/Docs/Trim_Galore_User_Guide.md).

Note that there are a few ways you can display the images generated in this notebook. One way is to look at them inside the notebook. If they are too small to read, you can click on an image to enlarge it. Another way is to navigate inside your Google Drive folder, download the HTML/PDF/image files to your laptop, and look at them from your laptop.

## **UNIX commands introduced in this notebook**

[`apt-get`](https://manpages.ubuntu.com/manpages/lunar/man8/apt-get.8.html) command to install operating system packages from the Internet.

[`pip`](https://pip.pypa.io/en/stable/cli/pip_install/) command to install Python packages from the Internet.

[`tar`](https://man7.org/linux/man-pages/man1/tar.1.html) command to extract an archive of files.

[`wget`](https://linux.die.net/man/1/wget) command to download files from the Internet.

[`chmod`](https://man7.org/linux/man-pages/man1/chmod.1p.html) command to change permissions of a file.

[`wkhtmltopdf`](https://wkhtmltopdf.org/usage/wkhtmltopdf.txt) to convert HTML to PDF.



# Prepare notebook environment

In [None]:
# mount google drive
from google.colab import drive
drive.flush_and_unmount()
drive.mount("mnt")


In [None]:
# time the notebook
import datetime
start_time = datetime.datetime.now()
print('notebook start time: ', start_time.strftime('%Y-%m-%d %H:%M:%S'))

In [None]:
# make sure FASTQ_DIR directory exists on google drive
import os
FASTQ_DIR="/content/mnt/MyDrive/NASA/GL4HS/FASTQ"
if not os.path.exists(FASTQ_DIR):
  raise Exception('STOP! Make sure you run the previous notebook before running this one')


In [None]:
# make sure reduced_r1.fastq.gz and reduced_r2.fastq.gz files are in place
if not os.path.exists(f"{FASTQ_DIR}/reduced_r1.fastq.gz") or not os.path.exists(f"{FASTQ_DIR}/reduced_r2.fastq.gz"):
  raise Exception('STOP! Make sure you run the previous notebook before running this one')

In [None]:
# create the TRIM directory for this notebook
if os.path.exists(f"{FASTQ_DIR}/TRIM"):
  !rm -rf {FASTQ_DIR}/TRIM
!mkdir {FASTQ_DIR}/TRIM

# Run FastQC to check the quality of the FASTQ files before trimming

In [None]:
# install FastQC
if os.path.exists(f"{FASTQ_DIR}/FastQC"):
  print('FastQC already installed - removing now to reinstall')
  !rm -rf {FASTQ_DIR}/FastQC
!mkdir {FASTQ_DIR}/FastQC
!wget -O {FASTQ_DIR}/fastqc.zip https://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.12.1.zip
!unzip {FASTQ_DIR}/fastqc.zip -d {FASTQ_DIR} > /dev/null
!chmod +x {FASTQ_DIR}/FastQC/fastqc

In [None]:
# check the version of fastqc
!{FASTQ_DIR}/FastQC/fastqc --version

In [None]:
# remove the fastqc.zip file as we don't need it anymore
!rm -f {FASTQ_DIR}/fastqc.zip

In [None]:
# make directories for FASTQC output
if os.path.exists(f"{FASTQ_DIR}/FASTQC_R1_OUT"):
  !rm -rf {FASTQ_DIR}/FASTQC_R1_OUT
!mkdir {FASTQ_DIR}/FASTQC_R1_OUT
if os.path.exists(f"{FASTQ_DIR}/FASTQC_R2_OUT"):
  !rm -rf {FASTQ_DIR}/FASTQC_R2_OUT
!mkdir {FASTQ_DIR}/FASTQC_R2_OUT

In [None]:
# run fastqc on compressed reduced R1 fastq file
!{FASTQ_DIR}/FastQC/fastqc  {FASTQ_DIR}/reduced_r1.fastq.gz -t 2 -o {FASTQ_DIR}/FASTQC_R1_OUT
# check for error message

In [None]:
# list fastqc output directory
# you should see the reduced R1 fastqc.html and fastqc.zip files that got generated in the previous step
!ls -lh {FASTQ_DIR}/FASTQC_R1_OUT

In [None]:
# run fastqc on compressed reduced R2 fastq file
!{FASTQ_DIR}/FastQC/fastqc  {FASTQ_DIR}/reduced_r2.fastq.gz -t 2 -o {FASTQ_DIR}/FASTQC_R2_OUT

In [None]:
# list fastqc output directories
# you should see the reduced R2 fastqc.html and fastqc.zip files that got generated in the previous step
!ls -lh {FASTQ_DIR}/FASTQC_R2_OUT

In [None]:
# install packages to convert HTML to PDF for rendering in google colab
!sudo DEBIAN_FRONTEND=noninteractive apt-get update -y > /dev/null
!sudo DEBIAN_FRONTEND=noninteractive apt-get install -y wkhtmltopdf > /dev/null
!sudo DEBIAN_FRONTEND=noninteractive apt-get install -y poppler-utils > /dev/null
!pip install pdf2image > /dev/null


In [None]:
# run the wkhtmltopdf command to convert the FastQC HTML files into a PDF files
!wkhtmltopdf {FASTQ_DIR}/FASTQC_R1_OUT/reduced_r1_fastqc.html {FASTQ_DIR}/FASTQC_R1_OUT/out.pdf
!wkhtmltopdf {FASTQ_DIR}/FASTQC_R2_OUT/reduced_r2_fastqc.html {FASTQ_DIR}/FASTQC_R2_OUT/out.pdf

In [None]:
# convert the PDF to images (there should be 6 images created from each PDF)
from pdf2image import convert_from_path
import os
images_r1 = convert_from_path(f"{FASTQ_DIR}/FASTQC_R1_OUT/out.pdf")
print('len r1 = ', str(len(images_r1)))
images_r2 = convert_from_path(f"{FASTQ_DIR}/FASTQC_R2_OUT/out.pdf")
print('len r2 = ', str(len(images_r2)))

In [None]:
# define a method to display a single page of the report

def display_page(report_1, report_2, page_num):
  from ctypes import resize
  import matplotlib.pyplot as plt

  plt.axis('off')
  fig, axes = plt.subplots(1, 2, figsize=(30,30))
  axes[0].imshow(report_1[page_num])
  axes[0].axis('off')
  axes[1].imshow(report_2[page_num])
  axes[1].axis('off')

In [None]:
# display the first page of the report
display_page(images_r1, images_r2, 0)


Note the total number of sequences, total number of bases, number of sequences flagged as poor quality, sequence length, and percentage GC content for R1.

In [None]:
# display the second page of the report
display_page(images_r1, images_r2, 1)

In [None]:
# display the third page of the report
display_page(images_r1, images_r2, 2)

In [None]:
# display the fourth page of the report
display_page(images_r1, images_r2, 3)

In [None]:
# display the fifth page of the report
display_page(images_r1, images_r2, 4)


# Use `trim_galore` to trim the reads

In [None]:
# Install Trim Galore
import os
if os.path.exists(f"{FASTQ_DIR}/TRIM/TrimGalore-0.6.10/trim_galore"):
  print('trim_galore already installed - removing now to reinstall')
  !rm -rf {FASTQ_DIR}/TRIM
!mkdir {FASTQ_DIR}/TRIM
!wget -O {FASTQ_DIR}/TRIM/trim_galore.tar.gz https://github.com/FelixKrueger/TrimGalore/archive/0.6.10.tar.gz
!curl -fsSL https://github.com/FelixKrueger/TrimGalore/archive/0.6.10.tar.gz -o {FASTQ_DIR}/TRIM/trim_galore.tar.gz
!tar xzf {FASTQ_DIR}/TRIM/trim_galore.tar.gz -C {FASTQ_DIR}/TRIM
# make the trim_galore command executable
!chmod +x {FASTQ_DIR}/TRIM/TrimGalore-0.6.10/trim_galore
# remove the compressed tar file
!rm -f {FASTQ_DIR}/TRIM/trim_galore.tar.gz

In [None]:
# check version of trim_galore
!{FASTQ_DIR}/TRIM/TrimGalore-0.6.10/trim_galore -v

In [None]:
# install cutadapt
!pip install cutadapt

In [None]:
# find path to cutadapt executable
!which cutadapt

In [None]:
# run trim_galore on R1 and R2
# use the path_to_cutadapt path you found in the previous code cell
if os.path.exists(f"{FASTQ_DIR}/TRIM/PAIRED"):
  !rm -rf {FASTQ_DIR}/TRIM/PAIRED
!mkdir -p {FASTQ_DIR}/TRIM/PAIRED

!{FASTQ_DIR}/TRIM/TrimGalore-0.6.10/trim_galore \
  --path_to_cutadapt /usr/local/bin/cutadapt \
  --paired \
  -o {FASTQ_DIR}/TRIM/PAIRED \
  -q 20 \
  -j 2 \
  {FASTQ_DIR}/reduced_r1.fastq.gz \
  {FASTQ_DIR}/reduced_r2.fastq.gz


Note the following:
1. which adapter sequence was auto-detected and most prevalent?
2. what is the minimum required sequence length for both reads before a sequence pair gets removed?
3. approximately what percentage of sequence pairs were removed because at least one read was shorter than the length cutoff?
4. which base was the most prevalent preceding removed adapters?

In [None]:
# validate the trimmed output files got created
# should be about 2.1GB each
!ls -lh {FASTQ_DIR}/TRIM/PAIRED


In [None]:
# examine trimming report (or refer student cell with trim command that also has output)
#!cat {FASTQ_DIR}/TRIM/R1/reduced_r1.fastq.gz_trimming_report.txt
#!cat {FASTQ_DIR}/TRIM/R2/reduced_r2.fastq.gz_trimming_report.txt

# Run FastQC to check the quality of FASTQ file after trimming

In this section, the images we display on the left will be the pre-trimming images for R1 only and the images on the right will be the corresponding post-trimming images for R1 only. We will not compare pre- and post-trimming reports for R2.

In [None]:
# repeat FastQC run and examine quality scores across all bases + quality score distribution for R1
# TODO do some sort of before-and-after side-by-side comparison of plots
if os.path.exists(f"{FASTQ_DIR}/TRIM/R1/FASTQC_OUT"):
  !rm -rf {FASTQ_DIR}/TRIM/R1/FASTQC_OUT
!mkdir -p {FASTQ_DIR}/TRIM/R1/FASTQC_OUT
!{FASTQ_DIR}/FastQC/fastqc  {FASTQ_DIR}/TRIM/PAIRED/reduced_r1_val_1.fq.gz -o {FASTQ_DIR}/TRIM/R1/FASTQC_OUT

In [None]:
# convert HTML to pdf
!wkhtmltopdf {FASTQ_DIR}/TRIM/R1/FASTQC_OUT/reduced_r1_val_1_fastqc.html {FASTQ_DIR}/TRIM/R1/FASTQC_OUT/out.pdf

In [None]:
# convert PDF to pillow image
from pdf2image import convert_from_path
images_r1_trim = convert_from_path(f"{FASTQ_DIR}/TRIM/R1/FASTQC_OUT/out.pdf")
print('len = ', str(len(images_r1_trim)))

In [None]:
# compare first page of FastQC output for pre- and post-trim R1 data
display_page(images_r1, images_r1_trim, 0)

In [None]:
# compare second page of FastQC output for pre- and post-trim R1 data
display_page(images_r1, images_r1_trim, 1)

In [None]:
# compare third page of FastQC output for pre- and post-trim R1 data
display_page(images_r1, images_r1_trim, 2)

In [None]:
# compare fourth page of FastQC output for pre- and post-trim R1 data
display_page(images_r1, images_r1_trim, 3)

In [None]:
# compare fifth page of FastQC output for pre- and post-trim R1 data
display_page(images_r1, images_r1_trim, 4)

In [None]:
# compare sixth page of FastQC output for pre- and post-trim R1 data
display_page(images_r1, images_r1_trim, 5)

# Check your work before moving on

In [None]:
# check size of all GL4HS drive usage
# should be about 260MB
!du -sh /content/mnt/MyDrive/NASA/GL4HS

In [None]:
# time the notebook
import datetime
end_time = datetime.datetime.now()
print('notebook end time: ', end_time.strftime('%Y-%m-%d %H:%M:%S'))

total_notebook_time = end_time - start_time
print('total notebook time: ', total_notebook_time)