# Notebook 01: Quality Control (QC)

### Objective
The goal of this notebook is to assess the quality of our 510 raw FASTQ files (255 paired-end samples).

### Workflow
1.  **Install Tools:** We will install `FastQC` and `MultiQC`.
2.  **Run FastQC:** We will run `FastQC` on all 510 files located in `../data/raw_fastq/`.
3.  **Run MultiQC:** We will use `MultiQC` to aggregate all the individual FastQC reports into a single, interactive HTML report.
4.  **Analyze Report:** We will analyze the MultiQC report to determine the optimal **truncation lengths (trimming parameters)** for DADA2. This is the most critical step for ensuring a high-quality analysis.

In [None]:
# Install FastQC and MultiQC using mamba
# We install them into our active qiime2_env
# -c bioconda: This is the best channel for these bioinformatics tools
# -y: automatically says 'yes' to the installation prompt

!mamba install -n qiime2_env -c bioconda fastqc multiqc -y

### 2. Run FastQC on all Samples

Now we will run `FastQC` on all 510 `.fastq` files in parallel.

**Workflow for this cell:**
1.  **Create Output Directory:** We will create a new, clean directory `../results/01_fastqc_reports/` to store the 510 HTML reports.
2.  **Run FastQC:** We will execute `fastqc` with the following key parameters:
    * `--threads 8`: To use all 8 CPUs of our VM and speed up the process significantly.
    * `-o ../results/01_fastqc_reports/`: Specifies the output directory we just created.
    * `../data/raw_fastq/*.fastq`: The input files (the `*` is a wildcard meaning "all files that end in .fastq").

In [None]:
# 1. Create the output directory
# The -p flag means "don't complain if the folder already exists"
!mkdir -p ../results/01_fastqc_reports

# 2. Run FastQC on all files
# This command will take a long time (20-30+ minutes)
!fastqc --threads 8 -o ../results/01_fastqc_reports/ ../data/raw_fastq/*.fastq

### 3. Verify FastQC Output

Before running MultiQC, we must verify that FastQC ran successfully. We expect to find 1020 new files (510 `.html` reports + 510 `.zip` archives) in our output directory.

In [None]:
# First, let's count the total number of files in the directory
!ls -lh ../results/01_fastqc_reports/ | wc -l

### 4. Run MultiQC to Aggregate Reports

Now that we have 1021 individual reports, we will use `MultiQC` to parse all of them and create a single, unified HTML report. This single report is what we will analyze to make our scientific decision.

* **Input:** The `../results/01_fastqc_reports/` directory (which contains all the `.zip` and `.html` files).
* **Output:** A new directory, `../results/02_multiqc_report/`, which will contain the final report.

In [None]:
# 1. Create the output directory for the MultiQC report
!mkdir -p ../results/02_multiqc_report

# 2. Run MultiQC
# -o: Specify the output directory
# The last argument is the input directory (where MultiQC should scan for reports)
!multiqc -o ../results/02_multiqc_report/ ../results/01_fastqc_reports/

In [None]:
# List the contents of the MultiQC output directory
!ls -lh ../results/02_multiqc_report/

### 6. Analysis and Conclusion

**Finding:** The "Per Base Sequence Quality" plot (summarized by boxplots) shows **excellent quality** across the entire 160bp read length for *both* R1 (blue/purple cloud) and R2 (red cloud).

**Decision:**
The median quality (boxplot body) remains high (well within the green zone, Q > 30) for the full read length. Therefore, **no truncation (trimming) is necessary.** We will use the full 160bp for both forward and reverse reads.

**Parameters for DADA2:**
* `--p-trunc-len-f 160`
* `--p-trunc-len-r 160`