# 01. Read Quality Control (FastQC & MultiQC)

This notebook performs the initial quality control (QC) on our raw sequencing reads.

**Workflow:**
1.  **Run FastQC:** We will run `FastQC` on all 14 `.fastq.gz` files located in the `../data/` directory. This will generate an individual HTML report for each read file.
2.  **Run MultiQC:** We will use `MultiQC` to scan the `FastQC` output directory and create a single, unified summary report. This allows us to assess the quality of all samples at a glance.

In [None]:
# Create a dedicated directory for the FastQC reports
!mkdir -p ../results/fastqc

# Run FastQC on all 14 files
# -o: specifies the output directory (../results/fastqc)
# -t 8: uses 8 threads (to match our VM's 8 CPUs)
# ../data/*.fastq.gz: wildcard to select all fastq files in the data folder

print("--- Running FastQC on 14 files (this will take several minutes)... ---")
!fastqc -o ../results/fastqc/ -t 8 ../data/*.fastq.gz

# Verification
print("\n--- FastQC complete. Verifying output files: ---")
!ls -lh ../results/fastqc/

## 2. Aggregate QC Reports (MultiQC)

Now that we have 14 individual FastQC reports, we will use `MultiQC` to parse all of them and create a single, unified HTML report. This is essential for quickly comparing the quality metrics across all 7 samples (and their Read 1 / Read 2 files) at a glance.

We will create a new directory, `../results/multiqc`, to store this summary report.

In [None]:
# Create a dedicated directory for the MultiQC report
!mkdir -p ../results/multiqc

# Run MultiQC
# -o: specifies the output directory (../results/multiqc)
# ../results/fastqc/: the directory where MultiQC will scan for reports

print("--- Running MultiQC to aggregate reports... ---")
!multiqc ../results/fastqc/ -o ../results/multiqc/

# Verification
print("\n--- MultiQC complete. Verifying output file: ---")
!ls -lh ../results/multiqc/

## 3. Install Trimming Tool (fastp)

Based on the "Before" QC report, our reads contain significant adapter contamination and low-quality tails (Red Zone). This is "Garbage In," which will lead to "Garbage Out" in our GATK results if not fixed.

We will use `fastp`, a modern and ultra-fast tool, to:
1.  Auto-detect and remove adapter sequences.
2.  Trim low-quality bases from the ends of the reads.

First, let's install it.

In [None]:
# Install fastp from bioconda
!mamba install -c bioconda fastp -y

# Verify the installation
print("\n--- fastp Version ---")
!fastp --version

## 4. Run Trimming (fastp)

Now we will execute `fastp` on all 7 sample pairs.

**Workflow for each sample:**
1.  **Input:** The `_1.fastq.gz` (Read 1) and `_2.fastq.gz` (Read 2) files from `../data/`.
2.  **Process:** `fastp` will auto-detect and remove adapters, trim low-quality bases, and filter out any reads that become too short.
3.  **Output (Data):** New, clean `_1.trimmed.fastq.gz` and `_2.trimmed.fastq.gz` files will be saved in a new directory: `../data/trimmed/`.
4.  **Output (Reports):** A new HTML and JSON report for each sample will be saved in `../results/fastp_reports/`.

We will use a `bash` loop inside this cell to automate the process for all 7 samples.

In [None]:
%%bash
# This 'magic command' tells Jupyter to run this entire cell as a bash script

# 1. Create output directories for our new clean data and reports
echo "--- Creating output directories... ---"
mkdir -p ../data/trimmed/
mkdir -p ../results/fastp_reports/

# 2. Loop through all Read 1 files in the ../data/ directory
for r1_path in ../data/*_1.fastq.gz; do
    
    # --- Construct all filenames ---
    
    # Get the base filename (e.g., "SRR11187849_1.fastq.gz")
    r1_filename=$(basename "$r1_path")
    
    # Get the sample ID (e.g., "SRR11187849")
    # This command 'cuts' the name at the '_' and takes the first part (f1)
    sample_id=$(echo "$r1_filename" | cut -d'_' -f1)
    
    echo "--- Processing sample: $sample_id ---"
    
    # Construct the path to the matching Read 2 file
    r2_path="../data/${sample_id}_2.fastq.gz"
    
    # Construct output paths for the new *trimmed* data
    out_r1="../data/trimmed/${sample_id}_1.trimmed.fastq.gz"
    out_r2="../data/trimmed/${sample_id}_2.trimmed.fastq.gz"
    
    # Construct output paths for the new fastp QC reports
    out_html="../results/fastp_reports/${sample_id}.html"
    out_json="../results/fastp_reports/${sample_id}.json"
    
    # --- Run fastp ---
    # This is the main command
    fastp \
        -i "$r1_path" \
        -I "$r2_path" \
        -o "$out_r1" \
        -O "$out_r2" \
        --html "$out_html" \
        --json "$out_json" \
        --thread 8  # Use all 8 CPUs
done

echo "--- fastp trimming complete. ---"

## 5. Post-Trimming QC (MultiQC "After")

We have now created 14 new "trimmed" FASTQ files. The final step in this notebook is to **verify** that our trimming was successful.

`fastp` created its own HTML/JSON reports in `../results/fastp_reports/`. We will now run `MultiQC` on this *new* directory.

This "After" report will allow us to visually confirm that:
1.  The adapter contamination is gone.
2.  The low-quality tails (Red Zone) have been clipped.

In [None]:
# Create a new, separate directory for our "After" MultiQC report
!mkdir -p ../results/multiqc_trimmed

# Run MultiQC
# Note: We are now pointing it at the 'fastp_reports' directory
print("--- Running MultiQC on 'fastp' reports (After Trimming)... ---")
!multiqc ../results/fastp_reports/ -o ../results/multiqc_trimmed/

# Verification
print("\n--- MultiQC (After) complete. Verifying output file: ---")
!ls -lh ../results/multiqc_trimmed/