# Project 2: M. tuberculosis Genome Assembly
## 01 - Quality Control (QC) and Trimming

* **Author:** Youssef mimoune
* **Date:** 25-Oct-2025
* **Sample IDs:** `DRR749571` (Control), `DRR749572` (Resistant)

### Objective
This notebook performs two key steps:
1.  **Raw QC:** Run `FastQC` on the four raw FASTQ files to assess their initial quality.
2.  **Trimming:** Use `Fastp` to clean the data (remove adapters, filter low-quality reads).
3.  **Trimmed QC:** Run `FastQC` again on the cleaned files to verify the improvement.
4.  **Aggregation:** Use `MultiQC` to create summary reports for easy comparison.

### Tools
* `FastQC`: (v0.12.1) Generates QC reports for sequence data.
* `Fastp`: (v0.23.4) Performs fast all-in-one preprocessing (trimming, filtering).
* `MultiQC`: (v1.19) Aggregates QC reports into a single summary.

In [None]:
print("--- 1. Creating QC and Trimming directories ---")
# We use '..' to go "up" one level from the 'notebooks' directory
!mkdir -p ../analysis/01_fastqc_raw
!mkdir -p ../analysis/02_fastp_trimmed
!mkdir -p ../analysis/03_fastqc_trimmed

print("Directories created:")
!ls -lR ../analysis/

In [None]:
print("--- FIX: Moving FASTQ files to the correct project data directory ---")

# 1. Just in case, we make sure the *correct* directory exists
# (It should exist from our setup, but this is safer)
!mkdir -p ../data/01_raw_fastq

# 2. Move all FASTQ files from the wrong location to the correct one
# 'data/01_raw_fastq/*' <-- This is the wrong path (relative to the notebook)
# '../data/01_raw_fastq/' <-- This is the correct path (relative to the notebook)
!mv data/01_raw_fastq/*.fastq ../data/01_raw_fastq/

# 3. Verify the move
print("\n--- Verification ---")
print("Wrong location (should be empty now):")
!ls -lh data/01_raw_fastq/

print("\nCorrect location (should have files now):")
!ls -lh ../data/01_raw_fastq/

In [None]:
print("--- 2. Running FastQC on Raw Data ---")
# We run FastQC on all 4 raw FASTQ files
# -o : Output directory
# -t : Number of threads (to speed it up)
# The '\' just lets us split the command onto multiple lines for readability

!fastqc \
  -o ../analysis/01_fastqc_raw \
  -t 4 \
  ../data/01_raw_fastq/DRR749571_1.fastq \
  ../data/01_raw_fastq/DRR749571_2.fastq \
  ../data/01_raw_fastq/DRR749572_1.fastq \
  ../data/01_raw_fastq/DRR749572_2.fastq

print("--- Raw FastQC complete ---")
!ls -lh ../analysis/01_fastqc_raw

In [None]:
print("--- 3. Running MultiQC on Raw QC Reports ---")

# -o : فين تحط التقرير (in the same directory)
# ../analysis/01_fastqc_raw : فين تقلب على التقارير (scan this directory)
!multiqc -o ../analysis/01_fastqc_raw ../analysis/01_fastqc_raw

print("\n--- MultiQC complete. Checking directory contents: ---")
# This command will show us ALL files in that directory
!ls -lh ../analysis/01_fastqc_raw

## 4. Trimming and Filtering with `fastp`

Based on the raw `MultiQC` report, we confirmed the presence of adapters. We will now use `fastp` to clean the data.

`fastp` will perform several steps automatically:
1.  **Auto-detect and trim adapters:** (This solves our main problem).
2.  **Quality trimming:** We will trim reads from the right side if quality drops (using `fastp`'s default windowing).
3.  **Quality filtering:** We will set a minimum Phred score (`--qualified_quality_phred 20`).
4.  **Length filtering:** We will discard any read pair that becomes shorter than 50bp after trimming (`--length_required 50`).

We will also output compressed files (`.fastq.gz`) to save space, and generate new HTML/JSON reports for `fastp`.

In [None]:
print("--- 4. Running fastp on Sample DRR749571 (Control) ---")

!fastp \
  -i ../data/01_raw_fastq/DRR749571_1.fastq \
  -I ../data/01_raw_fastq/DRR749571_2.fastq \
  -o ../analysis/02_fastp_trimmed/DRR749571.trimmed_1.fastq.gz \
  -O ../analysis/02_fastp_trimmed/DRR749571.trimmed_2.fastq.gz \
  --qualified_quality_phred 20 \
  --length_required 50 \
  -t 4 \
  -j ../analysis/02_fastp_trimmed/DRR749571.fastp.json \
  -h ../analysis/02_fastp_trimmed/DRR749571.fastp.html

print("--- fastp complete for DRR749571 ---")

In [None]:
print("--- 5. Running fastp on Sample DRR749572 (Resistant) ---")

!fastp \
  -i ../data/01_raw_fastq/DRR749572_1.fastq \
  -I ../data/01_raw_fastq/DRR749572_2.fastq \
  -o ../analysis/02_fastp_trimmed/DRR749572.trimmed_1.fastq.gz \
  -O ../analysis/02_fastp_trimmed/DRR749572.trimmed_2.fastq.gz \
  --qualified_quality_phred 20 \
  --length_required 50 \
  -t 4 \
  -j ../analysis/02_fastp_trimmed/DRR749572.fastp.json \
  -h ../analysis/02_fastp_trimmed/DRR749572.fastp.html

print("--- fastp complete for DRR749572 ---")

In [None]:
print("\n--- 6. Verifying Trimming Output ---")
!ls -lh ../analysis/02_fastp_trimmed

## 5. Post-Trimming Quality Control

Now that we have cleaned and trimmed our data with `fastp`, we must verify the results. We will run `FastQC` and `MultiQC` *again* on the new trimmed files.

Our goal is to compare the new `multiqc_report.html` with the old one and confirm that:
1.  The "Adapter Content" graph is now flat (all adapters are gone).
2.  The "Per Base Sequence Quality" remains high, or even improves.

In [None]:
print("--- 7. Running FastQC on Trimmed Data ---")
# We run FastQC on the 4 *new* trimmed/compressed files
# -o : Output directory
# -t : Number of threads

!fastqc \
  -o ../analysis/03_fastqc_trimmed \
  -t 4 \
  ../analysis/02_fastp_trimmed/DRR749571.trimmed_1.fastq.gz \
  ../analysis/02_fastp_trimmed/DRR749571.trimmed_2.fastq.gz \
  ../analysis/02_fastp_trimmed/DRR749572.trimmed_1.fastq.gz \
  ../analysis/02_fastp_trimmed/DRR749572.trimmed_2.fastq.gz

print("--- Trimmed FastQC complete ---")
!ls -lh ../analysis/03_fastqc_trimmed

In [None]:
print("--- 8. Running MultiQC on Trimmed QC Reports ---")
# This scans the *new* directory for reports

!multiqc \
  -o ../analysis/03_fastqc_trimmed \
  ../analysis/03_fastqc_trimmed

print("--- Trimmed MultiQC complete ---")
!ls -lh ../analysis/03_fastqc_trimmed