<a href="https://colab.research.google.com/github/shreyansegnyte/NASA-GeneLab-Code/blob/main/5_quantifying_gene_expression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div>
<img src="https://www.nasa.gov/wp-content/uploads/2024/07/osdr-gl4hs-logo.png" width="600"/>
</div>

# **NOTEBOOK 5: Quantifying gene expression**
In this notebook, you will assign a number for each gene that represents the number of RNA transcripts for that gene that your sample expressed in the tissue at a point in time.

## **Objectives of this notebook**
The primary objective of this notebook is to quantify the gene expression of the one sample's reduced chromosome 17 alignment. You will then compare the gene expression counts for your sample's chromosome 17 to those obtained by the GeneLab processing team. We expect that the quantities should be off by a factor approximately near `REDUCTION_FACTOR`.

## **UNIX commands introduced in this notebook**

`grep` command to search for lines in files that have a matching pattern.

`htseq-count ` command to quantify gene expression.



# Prepare your environment for this lab

In [None]:
# mount your google drive
from google.colab import drive
drive.flush_and_unmount()
drive.mount("mnt")


In [None]:
# time the notebook
import datetime
start_time=datetime.datetime.now()
print('notebook start time: ', start_time.strftime('%Y-%m-%d %H:%M:%S'))

In [None]:
# set env variables for OSD dataset to use in this lab
OSD_DATASET='104'
GLDS_DATASET='104'

In [None]:
# set FASTQ_DIR directory location in google drive
import os
FASTQ_DIR="/content/mnt/MyDrive/NASA/GL4HS/FASTQ"
if not os.path.exists(FASTQ_DIR):
  raise Exception("STOP! You must finish the previous notebooks before running this one")

In [None]:
# read env var for reduction factor from first notebook
import os
with open(f"{FASTQ_DIR}/SAMPLE_NAME.txt", "r") as f:
  OSD_SAMPLE=f.read().strip()
if not OSD_SAMPLE:
  raise Exception("STOP! You must finish the previous notebooks before running this one")
print(OSD_SAMPLE)

In [None]:
# read env var for reduction factor from first notebook
import os
with open(f"{FASTQ_DIR}/REDUCTION_FACTOR.txt", "r") as f:
  REDUCTION_FACTOR=f.read()
if not REDUCTION_FACTOR:
  raise Exception("STOP! You must finish the previous notebooks before running this one")
print(REDUCTION_FACTOR)

In [None]:
# set REFERENCE_DIR directory location in google drive
import os
REFERENCE_DIR="/content/mnt/MyDrive/NASA/GL4HS/REFERENCE"
if not os.path.exists(REFERENCE_DIR):
  raise Exception("STOP! You must finish the previous notebooks before running this one")

In [None]:
# set ALIGNMENT_DIR directory location in google drive
import os
ALIGNMENT_DIR="/content/mnt/MyDrive/NASA/GL4HS/STAR/ALIGNMENT"
if not os.path.exists(ALIGNMENT_DIR):
  raise Exception("STOP! You must finish the previous notebooks before running this one")

In [None]:
# set COUNTS_DIR directory location in google drive
import os
COUNTS_DIR="/content/mnt/MyDrive/NASA/GL4HS/COUNTS"
if os.path.exists(COUNTS_DIR):
  !rm -rf {COUNTS_DIR}

!mkdir -p {COUNTS_DIR}

In [None]:
# install htseq
!pip install HTSeq

In [None]:
# determine the version of htseq installed
!htseq-count --version

In [None]:
# download gene annotation for GRCm39
import os
if os.path.exists(f"{REFERENCE_DIR}"):
  !rm -rf {REFERENCE_DIR}
!mkdir -p {REFERENCE_DIR}
  !wget -O {REFERENCE_DIR}/gencode.vM36.primary_assembly.basic.annotation.gtf.gz \
    https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M36/gencode.vM36.primary_assembly.basic.annotation.gtf.gz

In [None]:
# remove all but chr17 annotations
# change "chr17" to "17" in the GTF file as that's what the htseq-count is expecting
# gzip the file
!gunzip -c {REFERENCE_DIR}/gencode.vM36.primary_assembly.basic.annotation.gtf.gz > {REFERENCE_DIR}/gencode.vM36.primary_assembly.basic.annotation.gtf
!grep ^chr17 {REFERENCE_DIR}/gencode.vM36.primary_assembly.basic.annotation.gtf | sed 's/chr17/17/' > {REFERENCE_DIR}/chr17.gtf
!gzip -c {REFERENCE_DIR}/chr17.gtf > {REFERENCE_DIR}/chr17.gtf.gz


Read the [GTF documentation](https://www.gencodegenes.org/mouse/) and the [GTF Wikipedia page](https://en.wikipedia.org/wiki/Gene_transfer_format) for more information about basic mouse gene annotation.

In [None]:
# check the first 10 lines of the GTF annotation file
!head -10 {REFERENCE_DIR}/chr17.gtf

Question: What is lncRNA? Feel free to read more about that in this [Wikipedia article](https://en.wikipedia.org/wiki/Long_non-coding_RNA).

# Use HTSEQ to quantify gene expression

Read the [htseq-count manual](https://htseq.readthedocs.io/en/master/htseqcount.html) for more information.

In [None]:
# run htseq to quantify the gene expression

!htseq-count -n 2 \
  --format bam \
  --order pos \
  --stranded reverse \
  {ALIGNMENT_DIR}/chr17Aligned.out.bam \
  {REFERENCE_DIR}/chr17.gtf.gz \
  > {COUNTS_DIR}/chr17-counts.tsv

Note that you may get a warning about "mate records missing" from `htseq-count`. You can ignore this warning -- it's a known bug in the `htseq-count` software. Read [this github issue](https://github.com/simon-anders/htseq/issues/37) for more information if you're curious.

In [None]:
# look at the first 10 lines of the counts file
!head -10 {COUNTS_DIR}/chr17-counts.tsv

In [None]:
# read count data from file into dataframe
import pandas as pd
counts_df=pd.read_csv(f"{COUNTS_DIR}/chr17-counts.tsv", sep="\t", header=None)
counts_df.head()

In [None]:
# remove any rows from the counts_df that do not begin with 'ENSMUSG'
print('length before filter: ', len(counts_df))
counts_df=counts_df[counts_df[0].str.startswith('ENSMUSG')]
print('length after filter: ', len(counts_df))
counts_df[0]

# Compare your count data to the GeneLab-processed count data for the same sample

In [None]:
# open another tab in your web browser and navigate to the following site:
url=!echo https://visualization.osdr.nasa.gov/biodata/api/v2/dataset/OSD-{OSD_DATASET}/files/\?format=browser
print(url[0])

In [None]:
# download genelab-processed data for OSD dataset
import pandas as pd
#url = 'https://osdr.nasa.gov/geode-py/ws/studies/OSD-' + OSD_DATASET + '/download?source=datamanager\&file=GLDS-' + GLDS_DATASET + '_rna_seq_STAR_Unnormalized_Counts.csv'url='https://osdr.nasa.gov/geode-py/ws/studies/OSD-' + osd_dataset + '/download?source=datamanager\&file=GLDS-' + glds_dataset + '_rna_seq_STAR_Unnormalized_Counts.csv'
url = 'https://osdr.nasa.gov/geode-py/ws/studies/OSD-' + OSD_DATASET + '/download?source=datamanager\&file=GLDS-' + GLDS_DATASET + '_rna_seq_Unnormalized_Counts.csv'
osd_df = pd.read_csv(url)
osd_df.head()

In [None]:
# get list of genes from OSD_SAMPLE counts data to compare with the counts for the entire OSD_DATASET
# remove the "." extension from the ensemble gene ID
sample_genes = list(counts_df[0].values)[:20]
sample_genes = [gene.split(".")[0] for gene in sample_genes]
sample_genes

In [None]:
# find the gene count data for the first 20 genes associated with the sample name
osd_df[['Unnamed: 0', OSD_SAMPLE]].head(20)
gene_counts_from_osd = list(osd_df[osd_df['Unnamed: 0'].isin(sample_genes)][OSD_SAMPLE].values)

In [None]:
# capture the first 20 lines of the counts_df dataframe
gene_counts_from_you = list(counts_df[1].values)[:20]

In [None]:
# compare the counts side by side (the second column should be roughly 1/REDUCTION_FACTOR of the first column)
for gene_count in zip(sample_genes, gene_counts_from_osd, gene_counts_from_you):
  print(gene_count[0], '\t', gene_count[1], '\t', gene_count[2])


In [None]:
# look at the first 500 gene counts in both
# determine if fraction of abundance is approximately 1/REDUCTION_FACTOR
import numpy as np
count_fractions = list()
for gene in counts_df[0].values[:500]:
  _gene = gene.split(".")[0]
  if _gene in osd_df['Unnamed: 0'].values:
    genelab_val = osd_df[osd_df['Unnamed: 0'] == _gene][OSD_SAMPLE].values[0]
    your_val = int(counts_df[counts_df[0] == gene][1].values[0])
    if not genelab_val == 0:
      frac = your_val/genelab_val
      count_fractions.append(frac)

print(np.mean(count_fractions))

# Check your work before moving on

In [None]:
# check disk space utilization in google drive (should be about 2.4GB)
!du -sh /content/mnt/MyDrive/NASA/GL4HS

In [None]:
# time the notebook
import datetime
end_time=datetime.datetime.now()
print('notebook end time: ', end_time.strftime('%Y-%m-%d %H:%M:%S'))

print('notebook runtime: ', end_time - start_time)
#