# CompLabNGS2024 Final Project

## Getting the data
Visit the [Gene Expression Omnibus (GEO) website](https://www.ncbi.nlm.nih.gov/geo/) and search for the accession number you were given. Review the information on the dataset and explore the publication that generated the data.

Write a short but inclusive description of the dataset you were given. Include name of the dataset, the publication, authors, number and condition of samples and any other relevant information to allow others to understand the dataset.

After completing the entire analysis, create a conda environment with all the necessary packages and their versions to run the analysis. In the cell below, write the command you used to create the environment and the command to activate it. The name of the environment should be `<first name>_<last name>`. You can use ```bash conda list``` to list all the packages and their versions in the environment.

Modify the example command below to include the packages you used in your analysis.

In [None]:
#module load miniconda/miniconda3-4.7.12-environmentally
!conda create -n ngs_project \
    python=3.7 \
    pandas=1.0.3 \
    numpy=1.18.1 \
    matplotlib=3.1.3 \
    scikit-learn=0.22.1  \
    pydeseq2=1.26.0  \
    fastqc=0.12.1  \
    trimmomatic=0.39  \
    fastuniq=1.1
!conda activate ngs_project

## Getting the raw fastq files
At the bottom of the GEO page, you will find a link to the `SRA Run Selector`. Click on the link. 

Explain what is the SRA and SRR in gerneal and how it is used to store sequencing data.


Use `prefetch + fasterq-dump` command from the `SRA Toolkit` to download the fastq files. Write the command you used to download the fastq files. [SRA Tools Conda installation](https://anaconda.org/bioconda/sra-tools)

In [7]:
import os
import subprocess
from multiprocessing import Pool

BASE_DIR = '/home/yair/ngs_project'
srr_accessions = ['SRR25115639', 'SRR25115640', 'SRR25115641', 'SRR25115642', 'SRR25115643', 'SRR25115644', 
                  'SRR25115645', 'SRR25115646', 'SRR25115647']

INPUT_FASTQ_FILES = os.path.join(BASE_DIR, 'input_fastq')
TRIMMOMATIC_OUTPUT_DIR = os.path.join(BASE_DIR, 'trimmomatic_outputs')
FASTUNIQ_INPUTS_FILES_DIR = os.path.join(BASE_DIR, 'fastuniq_inputs')
FASTUNIQ_OUTPUT_DIR = os.path.join(BASE_DIR, 'fastuniq_outputs')

#generate folders
os.makedirs(INPUT_FASTQ_FILES, exist_ok=True)
os.makedirs(TRIMMOMATIC_OUTPUT_DIR, exist_ok=True)
os.makedirs(FASTUNIQ_INPUTS_FILES_DIR, exist_ok=True)
os.makedirs(FASTUNIQ_OUTPUT_DIR, exist_ok=True)

In [None]:
def download_fastq(srr_accession):
    os.chdir(INPUT_FASTQ_FILES)
    subprocess.run(f'prefetch {srr_accession}', shell=True)
    subprocess.run(f'fasterq-dump {srr_accession}', shell=True)

with Pool() as pool:
    pool.map(download_fastq, srr_accessions)

os.chdir(BASE_DIR)

## Quality Control
Below include all the commands, figures and output you used and generated to perform quality control on the fastq files. You can use `fastqc`, `trimmomatic`, `multiqc` and any other tool you find necessary to give a comprehensive QC and filtering of the data.

Write a brief description of the quality control process and explain the figures and tables you generated.

In [None]:
# First we run FastQC on 3 of the input fastq files (1 for each of the three groups).
srr_acessions_representatives = ['SRR25115639', 'SRR25115642', 'SRR25115645']

FASTQC_OUTPUTS_DIR = os.path.join(BASE_DIR, 'fastqc_outputs')
os.makedirs(FASTQC_OUTPUTS_DIR, exist_ok=True)

def run_fastqc(srr_accession):
    srr_1_path = os.path.join(INPUT_FASTQ_FILES, f'{srr_accession}_1.fastq')
    srr_2_path = os.path.join(INPUT_FASTQ_FILES, f'{srr_accession}_2.fastq')
    subprocess.run(f'fastqc {srr_1_path} {srr_2_path} -o {FASTQC_OUTPUTS_DIR}', shell=True)

with Pool() as pool:
    pool.map(run_fastqc, srr_acessions_representatives)

We examined several of the fastqc reports and noticed 2 major issues:
* Per base sequence content tab - the first 10 bases aren't distributed according to the rest of the read sequence. That is unexpected and therefore we will use trimmomatic HEADCROP module to cut the first 10 bases of each read.
* Sequence Duplication Levels tab - this tab indicated that there are duplicated reads.

![fastqc_per_base_sequence_content.jpg](fastqc_output/fastqc_per_base_sequence_content.jpg)
![fastqc_duplication_levels.jpg](fastqc_output/fastqc_duplication_levels.jpg)

In [None]:
# Now we will run trimmomatic and fastuniq

def run_trimmomatic(srr_accession):
    srr_1_path = os.path.join(INPUT_FASTQ_FILES, f'{srr_accession}_1.fastq')
    srr_2_path = os.path.join(INPUT_FASTQ_FILES, f'{srr_accession}_2.fastq')
    trimmomatic_output_path = os.path.join(TRIMMOMATIC_OUTPUT_DIR, f'{srr_accession}.fastq')
    subprocess.run(f'trimmomatic PE {srr_1_path} {srr_2_path} -baseout {trimmomatic_output_path} HEADCROP:10 MINLEN:90', shell=True)

with Pool() as pool:
    pool.map(run_trimmomatic, srr_accessions)

In [8]:
def run_fastuniq(srr_accession):
    fastuniq_input_path = os.path.join(FASTUNIQ_INPUTS_FILES_DIR, f'{srr_accession}.txt')
    with open(fastuniq_input_path, 'w') as fp:
        fp.write(os.path.join(TRIMMOMATIC_OUTPUT_DIR, f'{srr_accession}_1P.fastq') + '\n')
        fp.write(os.path.join(TRIMMOMATIC_OUTPUT_DIR, f'{srr_accession}_2P.fastq'))
    
    fastuniq_1_output = os.path.join(FASTUNIQ_OUTPUT_DIR, f'{srr_accession}_1.fastq')
    fastuniq_2_output = os.path.join(FASTUNIQ_OUTPUT_DIR, f'{srr_accession}_2.fastq')
    subprocess.run(f'fastuniq -i {fastuniq_input_path} -o {fastuniq_1_output} -p {fastuniq_2_output}', shell=True)

with Pool() as pool:
    pool.map(run_fastuniq, srr_accessions)

Killed


In [None]:
import os
from multiprocessing import Pool

# Finally we run Fastqc again on the representative fastq files

CLEAN_READS_FASTQC_OUTPUTS_DIR = os.path.join(BASE_DIR, 'clean_reads_fastqc_outputs')
os.makedirs(CLEAN_READS_FASTQC_OUTPUTS_DIR, exist_ok=True)

def run_fastqc_on_clean_reads(srr_accession):
    srr_1_path = os.path.join(FASTUNIQ_OUTPUT_DIR, f'{srr_accession}_1.fastq')
    srr_2_path = os.path.join(FASTUNIQ_OUTPUT_DIR, f'{srr_accession}_2.fastq')
    subprocess.run(f'fastqc {srr_1_path} {srr_2_path} -o {CLEAN_READS_FASTQC_OUTPUTS_DIR}', shell=True)

with Pool() as pool:
    pool.map(run_fastqc_on_clean_reads, srr_acessions_representatives)

## Alignment and DGE

Similar to what we did in class and in the assignemnts, you will now run through the steps of mapping and conducting DGE. Include below an elaborate report of your analysis, include study desgin, example count and normalized count matrices, draw summary satistics and every other information you think is relevent to report your findings (if any)

Running `STAR` on a laptop or PC is almost impossible we will use one of the more recent pseudo alighners, ['kallisto`](https://pachterlab.github.io/kallisto/manual), which will allow us to quantify transcript level information in a more efficent manner.

1. Download the index files from kallisto's GitHub
```bash
wget https://github.com/pachterlab/kallisto-transcriptome-indices/releases/download/v1/mouse_index_standard.tar.xz
tar -xf mouse_index_standard.tar.xz
```
2. Install and run `kallisto`
```bash
conda install -c bioconda -c conda-forge kallisto==0.50.1
kallisto quant -i index.idx -o output -t 16 sample.fastq 
```

In the `output` directory you should have `abundance.tsv`. View the file and make sure you understand it. Produce quantification to all your samples.
We now need to sum up transcript level counts to gene level counts before proceeding to `pyDESEQ`


```python
import pandas as pd

# Load transcript to gene mapping file
t2g_file = 't2g.txt'
t2g_df = pd.read_csv(t2g_file, sep='\t', usecols=[0, 2], header=None, names=['transcript_id', 'gene_id'])

abundance_file = 'output/abundance.tsv'
abundance_df = pd.read_csv(abundance_file, sep='\t')

# Merge transcript abundance with gene mapping
merged_df = pd.merge(abundance_df, t2g_df, left_on='target_id', right_on='transcript_id', how='inner')

# Sum abundances at the gene level
gene_abundance_df = merged_df.groupby('gene_id')['tpm'].sum().reset_index()

# Save gene-level abundance to a new file
print(gene_abundance_df)
gene_abundance_df.to_csv('gene_abundance.csv', index=False)
```

Do you understand the code? Include `print` statments and rerun until you understnad what's going on fully

Do this operation to each of your sample's abundance file. Using python or a different tool merge all samples to a single file.

You should now be able to take the resulting `all_samples_gene_abundance.csv` and use it in `pyDeSeq2` as we did in class

## **BONUS** Gene Set Enrichment Analysis

Up until this point we covered material learned in class. If you want to boost your scores attempt at learning and running GSEA analysis on your DGE results.

Explain what GSEA is and what insights can we draw from it.

Execute a GSEA analysis using [pyGSEA](https://gseapy.readthedocs.io/en/latest/introduction.html) and summarize your results below.

## Notes

This excerise is meant to 'through' in the water and might prove challanging. Attempt to work in groups to overcome challanges. 

If you get stuck try to post an issue, but give as much background and explanation of your problem as possible so I could help

### Good luck!