<a href="https://colab.research.google.com/github/shreyansegnyte/NASA-GeneLab-Code/blob/main/refgenome.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div>
<img src="https://www.nasa.gov/wp-content/uploads/2024/07/osdr-gl4hs-logo.png" width="600"/>
</div>

# **NOTEBOOK 3: Building a reference genome index**


In this notebook you will build a reference chromosome index (chr17) for hands-on experience and demonstration purposes.

## **Objectives of this notebook**
The primary objective of this notebook is to build a reference genome index for chromosome 17. A reference genome index (like any index) makes searching faster, so the overall performance of the mapping/alignment step will also be faster. In subsequent notebooks, you will use this index to align the sequence records from the FASTQ files to this reference genome. You can learn more about reference genomes in this [Wikipedia article](https://en.wikipedia.org/wiki/Reference_genome).

## **UNIX commands introduced in this notebook**

[`tail`](https://man7.org/linux/man-pages/man1/tail.1.html) command to see the last n lines of a file.

[`mkdir`](https://man7.org/linux/man-pages/man1/mkdir.1.html) command to make a directory.

# Prepare runtime environment

In [None]:
# mount google drive for notebook
from google.colab import drive
drive.flush_and_unmount()
drive.mount("mnt")


In [None]:
# time the notebook
import datetime
start_time = datetime.datetime.now()
print('notebook start time: ', start_time.strftime('%Y-%m-%d %H:%M:%S'))

In [None]:
# define FASTQ_DIR
import os
FASTQ_DIR="/content/mnt/MyDrive/NASA/GL4HS/FASTQ"
if not os.path.exists(FASTQ_DIR):
  raise Exception("STOP! You haven't completed the previous notebooks yet")

In [None]:
# create directory structure for this lab
import os
REFERENCE_DIR='/content/mnt/MyDrive/NASA/GL4HS/REFERENCE'
if os.path.exists(REFERENCE_DIR):
  !rm -rf {REFERENCE_DIR}
!mkdir {REFERENCE_DIR}


In [None]:
# create directory structure for this lab
import os
STAR_DIR='/content/mnt/MyDrive/NASA/GL4HS/STAR'
if os.path.exists(STAR_DIR):
  !rm -rf {STAR_DIR}
!mkdir {STAR_DIR}

In [None]:
# download and install STAR
if not os.path.exists('/content/mnt/MyDrive/NASA/GL4HS/STAR/bin/Linux_x86_64_static/STAR'):
  !wget -O {STAR_DIR}/STAR.tar.gz https://github.com/alexdobin/STAR/archive/2.7.11b.tar.gz
  !tar -xzf {STAR_DIR}/STAR.tar.gz -C {STAR_DIR}

!chmod +x {STAR_DIR}/STAR-2.7.11b/bin/Linux_x86_64_static/STAR

# remove the compressed tar file
!rm {STAR_DIR}/STAR.tar.gz

In [None]:
# check version of STAR
!{STAR_DIR}/STAR-2.7.11b/bin/Linux_x86_64_static/STAR --version

In [None]:
# download GRCm39 reference for chromosome 17
import os
if not os.path.exists(f"{REFERENCE_DIR}/Mus_musculus.GRCm39.dna.chromosome.17.fa.gz"):
  !wget -O {REFERENCE_DIR}/Mus_musculus.GRCm39.dna.chromosome.17.fa.gz https://ftp.ensembl.org/pub/release-113/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.chromosome.17.fa.gz
  !gunzip -c {REFERENCE_DIR}/Mus_musculus.GRCm39.dna.chromosome.17.fa.gz > {REFERENCE_DIR}/Mus_musculus.GRCm39.dna.chromosome.17.fa


In [None]:
# look at the first 10 lines of the reference fasta file
!head {REFERENCE_DIR}/Mus_musculus.GRCm39.dna.chromosome.17.fa

In [None]:
# look at 10 lines in the middle of the reference fasta file
!sed -n '100000,100010 p' {REFERENCE_DIR}/Mus_musculus.GRCm39.dna.chromosome.17.fa

In [None]:
# look at the last 10 lines of the reference fasta file
!tail {REFERENCE_DIR}/Mus_musculus.GRCm39.dna.chromosome.17.fa

Read [this discussion thread](https://www.reddit.com/r/genetics/comments/rz47pq/why_there_is_a_lot_of_ns_at_the_begining_of_the/) to see why there may be lots of 'N' at beginning and end of a reference genome fasta file.

# Run STAR to build reference chromosome index

Read [this discussion thread](https://www.biostars.org/p/251736/) and search the [documentation](https://physiology.med.cornell.edu/faculty/skrabanek/lab/angsd/lecture_notes/STARmanual.pdf) to learn more about `genomeChrBinNbits` option of the STAR command.

In [None]:
# run STAR to create index of GRCm39 chr 17 reference
if os.path.exists(REFERENCE_DIR + '/MM39_CHR17'):
  !rm -rf {REFERENCE_DIR}/MM39_CHR17
!mkdir -p {REFERENCE_DIR}/MM39_CHR17
!{STAR_DIR}/STAR-2.7.11b/bin/Linux_x86_64_static/STAR \
        --runThreadN 2 \
        --runMode genomeGenerate \
        --genomeDir {REFERENCE_DIR}/MM39_CHR17 \
        --genomeFastaFiles {REFERENCE_DIR}/Mus_musculus.GRCm39.dna.chromosome.17.fa  \
        --genomeSAindexNbases 12 \
        --genomeChrBinNbits 5

In [None]:
# check index file
!ls -lh {REFERENCE_DIR}/MM39_CHR17

In [None]:
# check size of google drive usage for the reference (should be about 1.1G)
!du -sh {REFERENCE_DIR}/MM39_CHR17

# Check your work before moving on

In [None]:
# check size of all GL4HS drive usage (should be about 1.4G)
!du -sh /content/mnt/MyDrive/NASA/GL4HS

In [None]:
# time the notebook
import datetime
end_time = datetime.datetime.now()
print('notebook end time: ', end_time.strftime('%Y-%m-%d %H:%M:%S'))

total_time = end_time - start_time
print('notebook total runtime: ', total_time)