# RNA-Seq Pipeline: Step 2 - Quality Control (QC) and Trimming

This notebook runs the quality control and trimming steps on the raw `.fastq.gz` files.

**Workflow:**
1.  **Raw QC:** Run `FastQC` on the 12 compressed raw data files to assess initial quality.
2.  **Aggregate Raw QC:** Run `MultiQC` to create a single summary report for the raw data.
3.  **Trimming:** Use `fastp` to remove adapters, trim low-quality bases, and filter short reads.
4.  **Trimmed QC:** Run `FastQC` again on the trimmed/cleaned files.
5.  **Aggregate Trimmed QC:** Run `MultiQC` on the trimmed reports to assess the results of cleaning.

**Tools:** `FastQC`, `MultiQC`, `fastp`

In [2]:
import os

# --- Define Core Paths ---

# Input directory (from previous step)
raw_data_dir = "01_Raw_Data"

# Output directory for all QC reports
qc_dir = "00_Data_QC"

# Output directory for trimmed data
trimmed_dir = "02_Trimmed_Data"

# --- Create Output Directories ---
# We will create sub-directories for clarity

# Directory for FastQC reports on RAW data
raw_fastqc_dir = os.path.join(qc_dir, "01_raw_fastqc")
os.makedirs(raw_fastqc_dir, exist_ok=True)

# Directory for FastQC reports on TRIMMED data
trimmed_fastqc_dir = os.path.join(qc_dir, "02_trimmed_fastqc")
os.makedirs(trimmed_fastqc_dir, exist_ok=True)

# Directory for the trimmed .fastq.gz files
os.makedirs(trimmed_dir, exist_ok=True)

print(f"All output directories created/verified.")

All output directories created/verified.


In [2]:
%%bash -s "$raw_data_dir" "$raw_fastqc_dir"
# $1 = raw_data_dir
# $2 = raw_fastqc_dir

RAW_DIR=$1
RAW_QC_OUT=$2

echo "--- 1. Running FastQC on RAW Data ---"
echo "Input Directory: $RAW_DIR"
echo "Output Directory: $RAW_QC_OUT"

# Run FastQC
# -t 8 : Use 8 threads
# -o $RAW_QC_OUT : Output directory
# $RAW_DIR/*.fastq.gz : Run on all .fastq.gz files in the input directory
fastqc -t 8 -o $RAW_QC_OUT $RAW_DIR/*.fastq.gz

echo "--- Raw FastQC complete. ---"

--- 1. Running FastQC on RAW Data ---
Input Directory: 01_Raw_Data
Output Directory: 00_Data_QC/01_raw_fastqc
application/gzip
application/gzip


Started analysis of SRR34134104_1.fastq.gz


application/gzip
application/gzip
application/gzip
application/gzip
application/gzip
application/gzip
application/gzip
application/gzip
application/gzip
application/gzip


Started analysis of SRR34134104_2.fastq.gz
Started analysis of SRR34134105_1.fastq.gz
Started analysis of SRR34134105_2.fastq.gz
Started analysis of SRR34134106_1.fastq.gz
Started analysis of SRR34134106_2.fastq.gz
Started analysis of SRR34134107_1.fastq.gz
Started analysis of SRR34134107_2.fastq.gz
Approx 5% complete for SRR34134105_1.fastq.gz
Approx 5% complete for SRR34134105_2.fastq.gz
Approx 5% complete for SRR34134106_1.fastq.gz
Approx 5% complete for SRR34134106_2.fastq.gz
Approx 5% complete for SRR34134107_1.fastq.gz
Approx 5% complete for SRR34134104_2.fastq.gz
Approx 5% complete for SRR34134104_1.fastq.gz
Approx 5% complete for SRR34134107_2.fastq.gz
Approx 10% complete for SRR34134105_1.fastq.gz
Approx 10% complete for SRR34134105_2.fastq.gz
Approx 10% complete for SRR34134106_1.fastq.gz
Approx 10% complete for SRR34134106_2.fastq.gz
Approx 10% complete for SRR34134107_1.fastq.gz
Approx 10% complete for SRR34134107_2.fastq.gz
Approx 10% complete for SRR34134104_2.fastq.gz
Ap

Analysis complete for SRR34134105_1.fastq.gz
Analysis complete for SRR34134105_2.fastq.gz


Approx 95% complete for SRR34134106_1.fastq.gz
Approx 95% complete for SRR34134106_2.fastq.gz
Approx 90% complete for SRR34134107_1.fastq.gz
Started analysis of SRR34134108_1.fastq.gz
Started analysis of SRR34134108_2.fastq.gz
Approx 90% complete for SRR34134107_2.fastq.gz
Approx 75% complete for SRR34134104_1.fastq.gz
Approx 75% complete for SRR34134104_2.fastq.gz


Analysis complete for SRR34134106_1.fastq.gz
Analysis complete for SRR34134106_2.fastq.gz


Approx 5% complete for SRR34134108_1.fastq.gz
Approx 95% complete for SRR34134107_1.fastq.gz
Approx 5% complete for SRR34134108_2.fastq.gz
Started analysis of SRR34134109_1.fastq.gz
Approx 95% complete for SRR34134107_2.fastq.gz
Started analysis of SRR34134109_2.fastq.gz
Approx 80% complete for SRR34134104_1.fastq.gz
Approx 80% complete for SRR34134104_2.fastq.gz
Approx 10% complete for SRR34134108_1.fastq.gz
Approx 10% complete for SRR34134108_2.fastq.gz


Analysis complete for SRR34134107_1.fastq.gz


Approx 5% complete for SRR34134109_1.fastq.gz
Approx 5% complete for SRR34134109_2.fastq.gz


Analysis complete for SRR34134107_2.fastq.gz


Approx 85% complete for SRR34134104_1.fastq.gz
Approx 85% complete for SRR34134104_2.fastq.gz
Approx 15% complete for SRR34134108_2.fastq.gz
Approx 15% complete for SRR34134108_1.fastq.gz
Approx 10% complete for SRR34134109_2.fastq.gz
Approx 10% complete for SRR34134109_1.fastq.gz
Approx 90% complete for SRR34134104_1.fastq.gz
Approx 90% complete for SRR34134104_2.fastq.gz
Approx 20% complete for SRR34134108_1.fastq.gz
Approx 20% complete for SRR34134108_2.fastq.gz
Approx 15% complete for SRR34134109_2.fastq.gz
Approx 15% complete for SRR34134109_1.fastq.gz
Approx 25% complete for SRR34134108_1.fastq.gz
Approx 95% complete for SRR34134104_1.fastq.gz
Approx 20% complete for SRR34134109_1.fastq.gz
Approx 95% complete for SRR34134104_2.fastq.gz
Approx 25% complete for SRR34134108_2.fastq.gz
Approx 20% complete for SRR34134109_2.fastq.gz
Approx 30% complete for SRR34134108_1.fastq.gz


Analysis complete for SRR34134104_1.fastq.gz


Approx 25% complete for SRR34134109_1.fastq.gz
Approx 25% complete for SRR34134109_2.fastq.gz
Approx 30% complete for SRR34134108_2.fastq.gz


Analysis complete for SRR34134104_2.fastq.gz


Approx 35% complete for SRR34134108_1.fastq.gz
Approx 30% complete for SRR34134109_1.fastq.gz
Approx 30% complete for SRR34134109_2.fastq.gz
Approx 35% complete for SRR34134108_2.fastq.gz
Approx 40% complete for SRR34134108_1.fastq.gz
Approx 35% complete for SRR34134109_1.fastq.gz
Approx 35% complete for SRR34134109_2.fastq.gz
Approx 40% complete for SRR34134108_2.fastq.gz
Approx 45% complete for SRR34134108_1.fastq.gz
Approx 40% complete for SRR34134109_1.fastq.gz
Approx 40% complete for SRR34134109_2.fastq.gz
Approx 45% complete for SRR34134108_2.fastq.gz
Approx 50% complete for SRR34134108_1.fastq.gz
Approx 45% complete for SRR34134109_1.fastq.gz
Approx 45% complete for SRR34134109_2.fastq.gz
Approx 50% complete for SRR34134108_2.fastq.gz
Approx 55% complete for SRR34134108_1.fastq.gz
Approx 50% complete for SRR34134109_1.fastq.gz
Approx 50% complete for SRR34134109_2.fastq.gz
Approx 55% complete for SRR34134108_2.fastq.gz
Approx 60% complete for SRR34134108_1.fastq.gz
Approx 55% co

Analysis complete for SRR34134108_1.fastq.gz


Approx 95% complete for SRR34134109_1.fastq.gz
Approx 95% complete for SRR34134109_2.fastq.gz


Analysis complete for SRR34134108_2.fastq.gz
Analysis complete for SRR34134109_1.fastq.gz
Analysis complete for SRR34134109_2.fastq.gz
--- Raw FastQC complete. ---


In [6]:
%%bash -s "$raw_fastqc_dir"
# $1 = raw_fastqc_dir (this variable was set in Cell 2)

RAW_QC_OUT=$1

echo "--- 2. Running MultiQC on RAW FastQC Reports ---"
echo "Target Directory: $RAW_QC_OUT"

# Run MultiQC
# -o $RAW_QC_OUT : Output directory
# $RAW_QC_OUT : Directory to scan for reports
multiqc -o $RAW_QC_OUT $RAW_QC_OUT

echo "--- Raw MultiQC complete. ---"
echo "Check the 'multiqc_report.html' file in $RAW_QC_OUT"

--- 2. Running MultiQC on RAW FastQC Reports ---
Target Directory: 00_Data_QC/01_raw_fastqc



[91m///[0m ]8;id=934496;https://multiqc.info\[1mMultiQC[0m]8;;\ 🔍 [2mv1.30[0m

[34m     version_check[0m | [33mMultiQC Version v1.31 now available![0m
[34m       file_search[0m | Search path: /home/refm_youssef/rnaseq_project/00_Data_QC/01_raw_fastqc
[2K         [34msearching[0m | [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m [32m28/28[0m  [32m23/28[0m [2m00_Data_QC/01_raw_fastqc/SRR34134108_1_fastqc.html[0m
[?25h[34m            fastqc[0m | Found 12 reports
[34m     write_results[0m | Existing reports found, adding suffix to filenames. Use '--force' to overwrite.
[34m     write_results[0m | Data        : 00_Data_QC/01_raw_fastqc/multiqc_data_1
[34m     write_results[0m | Report      : 00_Data_QC/01_raw_fastqc/multiqc_report_1.html
[34m           multiqc[0m | MultiQC complete


--- Raw MultiQC complete. ---
Check the 'multiqc_report.html' file in 00_Data_QC/01_raw_fastqc


### 3. Trimming and Filtering (fastp)

**Analysis of Raw QC Report (multiqc_report.html):**
* **Good News:** The overall sequence quality (`Per Base Sequence Quality`) is excellent.
* **Problem 1 (High Priority):** The `Adapter Content` plot shows significant **Illumina adapter contamination**. These must be removed.
* **Problem 2 (Medium Priority):** The `Per Base Sequence Content` plot shows a strong bias in the first ~10-15 bases (a common **random primer** artifact).

**Action:**
We will use `fastp` to clean the data by:
1.  Removing adapters automatically.
2.  Trimming the first 10 bases from both Read 1 and Read 2 (`--trim_front1=10`, `--trim_front2=10`) to remove the bias.
3.  Trimming low-quality bases from the ends.

In [7]:
# We will use Python to create the loop.

import os

# --- Get Sample Names ---
# We get the variables (like raw_data_dir) from Cell 2
# We only need the R1 files to get the sample names
input_files = sorted([f for f in os.listdir(raw_data_dir) if f.endswith("_1.fastq.gz")])

# Get just the sample names (e.g., "SRR34134109")
sample_names = [f.split("_1.fastq.gz")[0] for f in input_files]

print(f"Found {len(sample_names)} samples to trim: {sample_names}")

# --- Start the loop ---
print("\n--- Starting fastp Trimming Loop ---")

for sample in sample_names:
    print(f"Processing sample: {sample} ...")
    
    # Define input paths for R1 and R2
    in_r1 = f"{raw_data_dir}/{sample}_1.fastq.gz"
    in_r2 = f"{raw_data_dir}/{sample}_2.fastq.gz"
    
    # Define output paths for cleaned R1 and R2
    out_r1 = f"{trimmed_dir}/{sample}.trimmed_R1.fastq.gz"
    out_r2 = f"{trimmed_dir}/{sample}.trimmed_R2.fastq.gz"
    
    # Define paths for the reports that fastp creates
    report_html = f"{trimmed_dir}/{sample}.fastp.html"
    report_json = f"{trimmed_dir}/{sample}.fastp.json"
    
    # Build the fastp command
    # We use '!' to run the command in the shell
    !fastp \
        --in1 $in_r1 \
        --in2 $in_r2 \
        --out1 $out_r1 \
        --out2 $out_r2 \
        --html $report_html \
        --json $report_json \
        --thread 8 \
        --detect_adapter_for_pe \
        --trim_front1 10 \
        --trim_front2 10
        
    print(f"Finished trimming {sample}.")

print("--- fastp Trimming complete. ---")
print(f"Trimmed files are in: {trimmed_dir}")

Found 6 samples to trim: ['SRR34134104', 'SRR34134105', 'SRR34134106', 'SRR34134107', 'SRR34134108', 'SRR34134109']

--- Starting fastp Trimming Loop ---
Processing sample: SRR34134104 ...
Detecting adapter sequence for read1...
>Nextera_LMP_Read1_External_Adapter | >Illumina Multiplexing Index Sequencing Primer
GATCGGAAGAGCACACGTCTGAACTCCAGTCAC

Detecting adapter sequence for read2...
>Illumina TruSeq Adapter Read 2
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT

Read1 before filtering:
total reads: 20798968
total bases: 3119845200
Q20 bases: 3067097368(98.3093%)
Q30 bases: 2962826594(94.9671%)
Q40 bases: 0(0%)

Read2 before filtering:
total reads: 20798968
total bases: 3119845200
Q20 bases: 3046404045(97.646%)
Q30 bases: 2915427732(93.4478%)
Q40 bases: 0(0%)

Read1 after filtering:
total reads: 20728434
total bases: 2847514975
Q20 bases: 2802226697(98.4096%)
Q30 bases: 2708023546(95.1013%)
Q40 bases: 0(0%)

Read2 after filtering:
total reads: 20728434
total bases: 2847468201
Q20 bases: 2792856725

### 4. Post-Trimming Quality Control

Now that `fastp` has finished, we have a new set of cleaned `.fastq.gz` files in the `02_Trimmed_Data` directory.

We must run `FastQC` and `MultiQC` **again** on this new data to:
1.  **Verify** that the adapters are gone.
2.  **Confirm** that the random primer bias (first 10 bases) is gone.
3.  **Ensure** that we didn't introduce any new problems.

In [3]:
%%bash -s "$trimmed_dir" "$trimmed_fastqc_dir"
# $1 = trimmed_dir (variable from Cell 2)
# $2 = trimmed_fastqc_dir (variable from Cell 2)

TRIMMED_DIR=$1
TRIMMED_QC_OUT=$2

echo "--- 4. Running FastQC on TRIMMED Data ---"
echo "Input Directory: $TRIMMED_DIR"
echo "Output Directory: $TRIMMED_QC_OUT"

# Run FastQC
# -t 8 : Use 8 threads
# -o $TRIMMED_QC_OUT : Output directory
# $TRIMMED_DIR/*.fastq.gz : Run on all .fastq.gz files in the TRIMMED directory
fastqc -t 8 -o $TRIMMED_QC_OUT $TRIMMED_DIR/*.fastq.gz

echo "--- Trimmed FastQC complete. ---"

--- 4. Running FastQC on TRIMMED Data ---
Input Directory: 02_Trimmed_Data
Output Directory: 00_Data_QC/02_trimmed_fastqc
application/gzip
application/gzip


Started analysis of SRR34134104.trimmed_R1.fastq.gz


application/gzip
application/gzip
application/gzip
application/gzip
application/gzip
application/gzip
application/gzip
application/gzip
application/gzip
application/gzip


Started analysis of SRR34134104.trimmed_R2.fastq.gz
Started analysis of SRR34134105.trimmed_R1.fastq.gz
Started analysis of SRR34134105.trimmed_R2.fastq.gz
Started analysis of SRR34134106.trimmed_R1.fastq.gz
Started analysis of SRR34134106.trimmed_R2.fastq.gz
Started analysis of SRR34134107.trimmed_R1.fastq.gz
Started analysis of SRR34134107.trimmed_R2.fastq.gz
Approx 5% complete for SRR34134105.trimmed_R1.fastq.gz
Approx 5% complete for SRR34134105.trimmed_R2.fastq.gz
Approx 5% complete for SRR34134106.trimmed_R1.fastq.gz
Approx 5% complete for SRR34134106.trimmed_R2.fastq.gz
Approx 5% complete for SRR34134104.trimmed_R1.fastq.gz
Approx 5% complete for SRR34134107.trimmed_R1.fastq.gz
Approx 5% complete for SRR34134104.trimmed_R2.fastq.gz
Approx 5% complete for SRR34134107.trimmed_R2.fastq.gz
Approx 10% complete for SRR34134105.trimmed_R1.fastq.gz
Approx 10% complete for SRR34134105.trimmed_R2.fastq.gz
Approx 10% complete for SRR34134106.trimmed_R1.fastq.gz
Approx 10% complete for SRR3

Analysis complete for SRR34134105.trimmed_R1.fastq.gz
Analysis complete for SRR34134105.trimmed_R2.fastq.gz


Approx 95% complete for SRR34134106.trimmed_R1.fastq.gz
Approx 75% complete for SRR34134104.trimmed_R1.fastq.gz
Approx 90% complete for SRR34134107.trimmed_R1.fastq.gz
Started analysis of SRR34134108.trimmed_R1.fastq.gz
Approx 90% complete for SRR34134107.trimmed_R2.fastq.gz
Approx 75% complete for SRR34134104.trimmed_R2.fastq.gz
Approx 95% complete for SRR34134106.trimmed_R2.fastq.gz
Started analysis of SRR34134108.trimmed_R2.fastq.gz


Analysis complete for SRR34134106.trimmed_R1.fastq.gz


Approx 5% complete for SRR34134108.trimmed_R1.fastq.gz


Analysis complete for SRR34134106.trimmed_R2.fastq.gz


Approx 95% complete for SRR34134107.trimmed_R2.fastq.gz
Approx 95% complete for SRR34134107.trimmed_R1.fastq.gz
Approx 5% complete for SRR34134108.trimmed_R2.fastq.gz
Started analysis of SRR34134109.trimmed_R1.fastq.gz
Approx 80% complete for SRR34134104.trimmed_R1.fastq.gz
Approx 80% complete for SRR34134104.trimmed_R2.fastq.gz
Started analysis of SRR34134109.trimmed_R2.fastq.gz
Approx 10% complete for SRR34134108.trimmed_R1.fastq.gz


Analysis complete for SRR34134107.trimmed_R2.fastq.gz


Approx 10% complete for SRR34134108.trimmed_R2.fastq.gz


Analysis complete for SRR34134107.trimmed_R1.fastq.gz


Approx 5% complete for SRR34134109.trimmed_R1.fastq.gz
Approx 5% complete for SRR34134109.trimmed_R2.fastq.gz
Approx 85% complete for SRR34134104.trimmed_R1.fastq.gz
Approx 85% complete for SRR34134104.trimmed_R2.fastq.gz
Approx 15% complete for SRR34134108.trimmed_R2.fastq.gz
Approx 15% complete for SRR34134108.trimmed_R1.fastq.gz
Approx 10% complete for SRR34134109.trimmed_R1.fastq.gz
Approx 10% complete for SRR34134109.trimmed_R2.fastq.gz
Approx 90% complete for SRR34134104.trimmed_R2.fastq.gz
Approx 90% complete for SRR34134104.trimmed_R1.fastq.gz
Approx 20% complete for SRR34134108.trimmed_R1.fastq.gz
Approx 15% complete for SRR34134109.trimmed_R2.fastq.gz
Approx 20% complete for SRR34134108.trimmed_R2.fastq.gz
Approx 15% complete for SRR34134109.trimmed_R1.fastq.gz
Approx 95% complete for SRR34134104.trimmed_R2.fastq.gz
Approx 95% complete for SRR34134104.trimmed_R1.fastq.gz
Approx 25% complete for SRR34134108.trimmed_R1.fastq.gz
Approx 20% complete for SRR34134109.trimmed_R2.fas

Analysis complete for SRR34134104.trimmed_R2.fastq.gz


Approx 30% complete for SRR34134108.trimmed_R1.fastq.gz


Analysis complete for SRR34134104.trimmed_R1.fastq.gz


Approx 30% complete for SRR34134108.trimmed_R2.fastq.gz
Approx 25% complete for SRR34134109.trimmed_R1.fastq.gz
Approx 25% complete for SRR34134109.trimmed_R2.fastq.gz
Approx 35% complete for SRR34134108.trimmed_R1.fastq.gz
Approx 35% complete for SRR34134108.trimmed_R2.fastq.gz
Approx 30% complete for SRR34134109.trimmed_R1.fastq.gz
Approx 30% complete for SRR34134109.trimmed_R2.fastq.gz
Approx 40% complete for SRR34134108.trimmed_R1.fastq.gz
Approx 40% complete for SRR34134108.trimmed_R2.fastq.gz
Approx 35% complete for SRR34134109.trimmed_R1.fastq.gz
Approx 35% complete for SRR34134109.trimmed_R2.fastq.gz
Approx 45% complete for SRR34134108.trimmed_R1.fastq.gz
Approx 45% complete for SRR34134108.trimmed_R2.fastq.gz
Approx 40% complete for SRR34134109.trimmed_R1.fastq.gz
Approx 40% complete for SRR34134109.trimmed_R2.fastq.gz
Approx 50% complete for SRR34134108.trimmed_R1.fastq.gz
Approx 45% complete for SRR34134109.trimmed_R1.fastq.gz
Approx 50% complete for SRR34134108.trimmed_R2.f

Analysis complete for SRR34134108.trimmed_R1.fastq.gz


Approx 95% complete for SRR34134109.trimmed_R1.fastq.gz


Analysis complete for SRR34134108.trimmed_R2.fastq.gz


Approx 95% complete for SRR34134109.trimmed_R2.fastq.gz


Analysis complete for SRR34134109.trimmed_R1.fastq.gz
Analysis complete for SRR34134109.trimmed_R2.fastq.gz
--- Trimmed FastQC complete. ---


In [4]:
%%bash -s "$trimmed_fastqc_dir"
# $1 = trimmed_fastqc_dir (variable from Cell 2)

TRIMMED_QC_OUT=$1

echo "--- 5. Running MultiQC on TRIMMED FastQC Reports ---"
echo "Target Directory: $TRIMMED_QC_OUT"

# Run MultiQC
# -o $TRIMMED_QC_OUT : Output directory
# $TRIMMED_QC_OUT : Directory to scan for reports
multiqc -o $TRIMMED_QC_OUT $TRIMMED_QC_OUT

echo "--- Trimmed MultiQC complete. ---"
echo "Check the 'multiqc_report.html' file in $TRIMMED_QC_OUT"

--- 5. Running MultiQC on TRIMMED FastQC Reports ---
Target Directory: 00_Data_QC/02_trimmed_fastqc



[91m///[0m ]8;id=735404;https://multiqc.info\[1mMultiQC[0m]8;;\ 🔍 [2mv1.30[0m

[34m     version_check[0m | [33mMultiQC Version v1.31 now available![0m
[34m       file_search[0m | Search path: /home/refm_youssef/rnaseq_project/00_Data_QC/02_trimmed_fastqc
[2K         [34msearching[0m | [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m [32m24/24[0m  [32m21/24[0m [2m_trimmed_fastqc/SRR34134109.trimmed_R2_fastqc.html[0m
[?25h[34m            fastqc[0m | Found 6 reports
[34m     write_results[0m | Data        : 00_Data_QC/02_trimmed_fastqc/multiqc_data
[34m     write_results[0m | Report      : 00_Data_QC/02_trimmed_fastqc/multiqc_report.html
[34m           multiqc[0m | MultiQC complete


--- Trimmed MultiQC complete. ---
Check the 'multiqc_report.html' file in 00_Data_QC/02_trimmed_fastqc
