# RNA-Seq Pipeline: Step 2 - Quality Control (QC) and Trimming

This notebook runs the quality control and trimming steps on the raw `.fastq.gz` files.

**Workflow:**
1.  **Raw QC:** Run `FastQC` on the 12 compressed raw data files to assess initial quality.
2.  **Aggregate Raw QC:** Run `MultiQC` to create a single summary report for the raw data.
3.  **Trimming:** Use `fastp` to remove adapters, trim low-quality bases, and filter short reads.
4.  **Trimmed QC:** Run `FastQC` again on the trimmed/cleaned files.
5.  **Aggregate Trimmed QC:** Run `MultiQC` on the trimmed reports to assess the results of cleaning.

**Tools:** `FastQC`, `MultiQC`, `fastp`

In [None]:
import os

# --- Define Core Paths ---

# Input directory (from previous step)
raw_data_dir = "01_Raw_Data"

# Output directory for all QC reports
qc_dir = "00_Data_QC"

# Output directory for trimmed data
trimmed_dir = "02_Trimmed_Data"

# --- Create Output Directories ---
# We will create sub-directories for clarity

# Directory for FastQC reports on RAW data
raw_fastqc_dir = os.path.join(qc_dir, "01_raw_fastqc")
os.makedirs(raw_fastqc_dir, exist_ok=True)

# Directory for FastQC reports on TRIMMED data
trimmed_fastqc_dir = os.path.join(qc_dir, "02_trimmed_fastqc")
os.makedirs(trimmed_fastqc_dir, exist_ok=True)

# Directory for the trimmed .fastq.gz files
os.makedirs(trimmed_dir, exist_ok=True)

print(f"All output directories created/verified.")

In [None]:
%%bash -s "$raw_data_dir" "$raw_fastqc_dir"
# $1 = raw_data_dir
# $2 = raw_fastqc_dir

RAW_DIR=$1
RAW_QC_OUT=$2

echo "--- 1. Running FastQC on RAW Data ---"
echo "Input Directory: $RAW_DIR"
echo "Output Directory: $RAW_QC_OUT"

# Run FastQC
# -t 8 : Use 8 threads
# -o $RAW_QC_OUT : Output directory
# $RAW_DIR/*.fastq.gz : Run on all .fastq.gz files in the input directory
fastqc -t 8 -o $RAW_QC_OUT $RAW_DIR/*.fastq.gz

echo "--- Raw FastQC complete. ---"

In [None]:
%%bash -s "$raw_fastqc_dir"
# $1 = raw_fastqc_dir (this variable was set in Cell 2)

RAW_QC_OUT=$1

echo "--- 2. Running MultiQC on RAW FastQC Reports ---"
echo "Target Directory: $RAW_QC_OUT"

# Run MultiQC
# -o $RAW_QC_OUT : Output directory
# $RAW_QC_OUT : Directory to scan for reports
multiqc -o $RAW_QC_OUT $RAW_QC_OUT

echo "--- Raw MultiQC complete. ---"
echo "Check the 'multiqc_report.html' file in $RAW_QC_OUT"

### 3. Trimming and Filtering (fastp)

**Analysis of Raw QC Report (multiqc_report.html):**
* **Good News:** The overall sequence quality (`Per Base Sequence Quality`) is excellent.
* **Problem 1 (High Priority):** The `Adapter Content` plot shows significant **Illumina adapter contamination**. These must be removed.
* **Problem 2 (Medium Priority):** The `Per Base Sequence Content` plot shows a strong bias in the first ~10-15 bases (a common **random primer** artifact).

**Action:**
We will use `fastp` to clean the data by:
1.  Removing adapters automatically.
2.  Trimming the first 10 bases from both Read 1 and Read 2 (`--trim_front1=10`, `--trim_front2=10`) to remove the bias.
3.  Trimming low-quality bases from the ends.

In [None]:
# We will use Python to create the loop.

import os

# --- Get Sample Names ---
# We get the variables (like raw_data_dir) from Cell 2
# We only need the R1 files to get the sample names
input_files = sorted([f for f in os.listdir(raw_data_dir) if f.endswith("_1.fastq.gz")])

# Get just the sample names (e.g., "SRR34134109")
sample_names = [f.split("_1.fastq.gz")[0] for f in input_files]

print(f"Found {len(sample_names)} samples to trim: {sample_names}")

# --- Start the loop ---
print("\n--- Starting fastp Trimming Loop ---")

for sample in sample_names:
    print(f"Processing sample: {sample} ...")
    
    # Define input paths for R1 and R2
    in_r1 = f"{raw_data_dir}/{sample}_1.fastq.gz"
    in_r2 = f"{raw_data_dir}/{sample}_2.fastq.gz"
    
    # Define output paths for cleaned R1 and R2
    out_r1 = f"{trimmed_dir}/{sample}.trimmed_R1.fastq.gz"
    out_r2 = f"{trimmed_dir}/{sample}.trimmed_R2.fastq.gz"
    
    # Define paths for the reports that fastp creates
    report_html = f"{trimmed_dir}/{sample}.fastp.html"
    report_json = f"{trimmed_dir}/{sample}.fastp.json"
    
    # Build the fastp command
    # We use '!' to run the command in the shell
    !fastp \
        --in1 $in_r1 \
        --in2 $in_r2 \
        --out1 $out_r1 \
        --out2 $out_r2 \
        --html $report_html \
        --json $report_json \
        --thread 8 \
        --detect_adapter_for_pe \
        --trim_front1 10 \
        --trim_front2 10
        
    print(f"Finished trimming {sample}.")

print("--- fastp Trimming complete. ---")
print(f"Trimmed files are in: {trimmed_dir}")

### 4. Post-Trimming Quality Control

Now that `fastp` has finished, we have a new set of cleaned `.fastq.gz` files in the `02_Trimmed_Data` directory.

We must run `FastQC` and `MultiQC` **again** on this new data to:
1.  **Verify** that the adapters are gone.
2.  **Confirm** that the random primer bias (first 10 bases) is gone.
3.  **Ensure** that we didn't introduce any new problems.

In [None]:
%%bash -s "$trimmed_dir" "$trimmed_fastqc_dir"
# $1 = trimmed_dir (variable from Cell 2)
# $2 = trimmed_fastqc_dir (variable from Cell 2)

TRIMMED_DIR=$1
TRIMMED_QC_OUT=$2

echo "--- 4. Running FastQC on TRIMMED Data ---"
echo "Input Directory: $TRIMMED_DIR"
echo "Output Directory: $TRIMMED_QC_OUT"

# Run FastQC
# -t 8 : Use 8 threads
# -o $TRIMMED_QC_OUT : Output directory
# $TRIMMED_DIR/*.fastq.gz : Run on all .fastq.gz files in the TRIMMED directory
fastqc -t 8 -o $TRIMMED_QC_OUT $TRIMMED_DIR/*.fastq.gz

echo "--- Trimmed FastQC complete. ---"

In [None]:
%%bash -s "$trimmed_fastqc_dir"
# $1 = trimmed_fastqc_dir (variable from Cell 2)

TRIMMED_QC_OUT=$1

echo "--- 5. Running MultiQC on TRIMMED FastQC Reports ---"
echo "Target Directory: $TRIMMED_QC_OUT"

# Run MultiQC
# -o $TRIMMED_QC_OUT : Output directory
# $TRIMMED_QC_OUT : Directory to scan for reports
multiqc -o $TRIMMED_QC_OUT $TRIMMED_QC_OUT

echo "--- Trimmed MultiQC complete. ---"
echo "Check the 'multiqc_report.html' file in $TRIMMED_QC_OUT"