# Notebook 03: Statistics, Taxonomy, and Filtering

## Introduction

In Notebook 02, we successfully completed the most computationally intensive step: DADA2 denoising. We used a "Split-Apply-Combine" strategy to process all 255 samples and successfully merged the results into two final, validated artifacts:
* `table.qza`: The final ASV Feature Table.
* `rep-seqs.qza`: The final ASV Representative Sequences.

In this notebook, we will **analyze** those results to understand our dataset and **assign** biological meaning (taxonomy) to our ASVs. Finally, we will filter out unwanted data to create a "clean" table for downstream analysis.

### Objectives:

1.  **Analyze DADA2 Statistics:** Use the `table.qzv` artifact to analyze the final read counts per sample. This helps us identify any failed samples or samples with very low read counts that need to be removed.
2.  **Assign Taxonomy:** Use a pre-trained classifier (SILVA) to assign taxonomic names (Phylum, Class, Order, Family, Genus, Species) to our `rep-seqs.qza`.
3.  **Visualize Taxonomy:** Create an interactive bar plot to visualize the taxonomic composition of our samples.
4.  **Filter Data:** Remove any non-bacterial (e.g., Mitochondria, Chloroplast) or unassigned ASVs from our table and sequences.

In [None]:
# ---  Imports, Settings, and Verification ---
import pandas as pd
import os

print("--- 1. Verification: Checking for input files from Notebook 02 ---")

# Define file paths
TABLE_QZA = "../results/table.qza"
REP_SEQS_QZA = "../results/rep-seqs.qza"
TABLE_QZV = "../results/table.qzv"
METADATA_TSV = "../data/metadata.tsv"

# Check if all required files exist
files_to_check = [TABLE_QZA, REP_SEQS_QZA, TABLE_QZV, METADATA_TSV]
all_files_exist = True

for f in files_to_check:
    if not os.path.exists(f):
        print(f"!!! ERROR: Required file not found: {f}")
        all_files_exist = False
    else:
        print(f"Found: {f}")

if all_files_exist:
    print("\n--- All required input files are present. Ready to start Notebook 03. ---")
else:
    print("\n--- !!! ERROR: Please ensure Notebook 02 ran successfully before proceeding. ---")

### 1. Analyze DADA2 Statistics

Our first objective is to analyze the output of the DADA2 pipeline. We have the summary file `table.qzv`, which is an interactive visualization.

Instead of viewing this file manually on `view.qiime2.org`, we will programmatically export its contents. This will give us access to the raw data tables inside it, specifically the table showing the number of reads (frequencies) per sample. We can then load this data into `pandas` to analyze it.

In [None]:
# --- (Cell 5) Export data from table.qzv ---

print("--- 1. Exporting data from table.qzv ---")

# Define the export directory
EXPORTED_STATS_DIR = "../results/07_exported_stats"

# Use qiime tools export
# --input-path is our .qzv file
# --output-path is the new directory where the contents will be saved
!docker run --rm -v $(pwd)/..:/data -w /data/notebooks \
  qiime2/core:latest \
  qiime tools export \
    --input-path {TABLE_QZV} \
    --output-path {EXPORTED_STATS_DIR}

print(f"\n--- 2. Verification: Listing contents of the export directory ---")
print(f"Contents of {EXPORTED_STATS_DIR}:")
!ls -lh {EXPORTED_STATS_DIR}

print("\nWe are looking for 'sample-frequency-detail.csv'.")

### 1.1 Load Statistics into Pandas

The export was successful, and we now have the file `sample-frequency-detail.csv`. We will load this file into a Pandas DataFrame to analyze the read distribution across all 255 samples.

In [None]:
# --- (Cell 8) Corrected Stats Loading ---

print("--- 1. Re-loading stats file (Corrected Method) ---")
# We tell pandas the file has no header (header=None)
# and we provide the column names manually (names=[...])
df_stats = pd.read_csv(STATS_CSV_PATH, header=None, names=['sample-id', 'TotalReads'])

print("DataFrame Head (Corrected):")
print(df_stats.head())

print("\n--- 2. Descriptive Statistics for Total Reads (Corrected) ---")
# This should now show 'count 255.00'
print(df_stats['TotalReads'].describe())

### 1.2 Statistical Analysis and Filtering Decision

The corrected statistics (from Cell 8) now show all **255 samples**.

The descriptive statistics give us the most important information for our filtering decision:
* **count:** 255.0 (All samples were successfully processed)
* **min:** [This will be the new min value, e.g., ~2957.00]
* **max:** [e.g., ~152748.00]

**Decision:**
The minimum read count (the `min` value) is our most critical metric. Common practice is to filter out samples with very low read counts (e.g., < 1000 or < 3000 reads) as they may not be representative.

Based on our results, the minimum read count is sufficiently high. Therefore, **no samples will be filtered** based on sequencing depth. This is an excellent outcome, as it means we retain all 255 samples for our analysis.

### 2. Assign Taxonomy

We have confirmed that all 255 samples have sufficient read counts. We will proceed with all samples.

Our next objective is to assign taxonomic names (e.g., Phylum, Genus, Species) to our list of ASVs (`rep-seqs.qza`). To do this, we need a pre-trained taxonomic classifier. We will use the SILVA 138 99% OTUs classifier, which is compatible with our QIIME 2 version (2020.8).

**Step 1: Download the Classifier**
First, we must download this classifier file. We will save it in the `../data/` directory.

In [None]:
# --- (Cell 11) Download the SILVA Classifier ---

# Define the path where we will save the classifier
CLASSIFIER_PATH = "../data/silva-138-99-nb-classifier.qza"
CLASSIFIER_URL = "https://data.qiime2.org/2020.8/common/silva-138-99-nb-classifier.qza"

print(f"--- 1. Downloading SILVA Classifier ---")
print(f"From: {CLASSIFIER_URL}")
print(f"To: {CLASSIFIER_PATH}")

# We use 'wget' to download the file.
# '-O' specifies the output file path.
# We check if the file *already* exists first to avoid re-downloading
if not os.path.exists(CLASSIFIER_PATH):
    !wget {CLASSIFIER_URL} -O {CLASSIFIER_PATH}
else:
    print("\nClassifier file already exists. Skipping download.")

print("\n--- 2. Verification of Download ---")
# We use 'ls -lh' to check if the file was downloaded and its size (~550M)
!ls -lh {CLASSIFIER_PATH}

### 2.2 Run Taxonomic Classification

We have successfully downloaded the SILVA classifier.

Now, we will use the `classify-sklearn` command from the `feature-classifier` plugin. This command will take our `rep-seqs.qza` artifact and the `classifier.qza` file as input, and it will output a new artifact: `taxonomy.qza`.

This file will contain the taxonomic assignment for every single ASV in our dataset.

**(Warning: This step is computationally intensive and will take several minutes to complete.)**

In [None]:
# ---  Run Classification ---

print("--- 1. Initializing variables (after Kernel Restart) ---")
# We must redefine variables because the kernel was restarted.
CLASSIFIER_PATH = "../data/silva-138-99-nb-classifier.qza"
REP_SEQS_QZA = "../results/rep-seqs.qza"
TAXONOMY_QZA = "../results/taxonomy.qza"

print(f"--- 2. Starting Taxonomic Classification (Safe Mode) ---")
print(f"Using classifier: {CLASSIFIER_PATH}")
print(f"Using reads: {REP_SEQS_QZA}")
print("\nThis step will take a long time (30-60+ min), but will not freeze...")
print("Please be patient. The notebook is working...")

# We use '--p-n-jobs 1' to be safe and avoid freezing the VM
!docker run --rm -v $(pwd)/..:/data -w /data/notebooks \
  qiime2/core:latest \
  qiime feature-classifier classify-sklearn \
    --i-classifier {CLASSIFIER_PATH} \
    --i-reads {REP_SEQS_QZA} \
    --o-classification {TAXONOMY_QZA} \
    --p-n-jobs 1

print("\n--- 3. Classification Finished. Verifying output file ---")
!ls -lh {TAXONOMY_QZA}

### 3. Visualize Taxonomic Composition

We have successfully assigned taxonomy to our ASVs. Now we want to visualize these results.

We will use the `qiime taxa barplot` command. This will take our final feature table (`table.qza`), our new taxonomy file (`taxonomy.qza`), and our metadata file (`metadata.tsv`) to generate an interactive bar plot. This visualization (`.qzv`) is one of the most important outputs of the analysis, as it shows us the relative abundance of different bacteria across all samples.

In [None]:
# ---  Generate Taxonomic Bar Plot ---

print(f"--- 1. Generating interactive taxonomy bar plot ---")

# Define file paths
TABLE_QZA = "../results/table.qza"
TAXONOMY_QZA = "../results/taxonomy.qza"
METADATA_TSV = "../data/metadata.tsv"
TAXA_BARPLOT_QZV = "../results/taxa-barplot.qzv"

# Run the command
!docker run --rm -v $(pwd)/..:/data -w /data/notebooks \
  qiime2/core:latest \
  qiime taxa barplot \
    --i-table {TABLE_QZA} \
    --i-taxonomy {TAXONOMY_QZA} \
    --m-metadata-file {METADATA_TSV} \
    --o-visualization {TAXA_BARPLOT_QZV}

print("\n--- 2. Bar Plot Finished. Verifying output file ---")
!ls -lh {TAXA_BARPLOT_QZV}

### 4. Filter Contaminants from the Data

We have successfully generated our taxonomy bar plots. However, our taxonomic assignments (`taxonomy.qza`) likely contain non-bacterial sequences (contaminants) that we must remove before downstream analysis.

These contaminants typically include:
* **Mitochondria:** 16S sequences from the host's (human) own cells.
* **Chloroplasts:** 16S sequences from plant matter (e.g., diet).
* **Unassigned:** ASVs that could not be assigned any taxonomy.

We will now perform two filtering steps:
1.  **Filter the Feature Table:** Use `qiime taxa filter-table` to remove any ASV assigned to "Mitochondria", "Chloroplast", or "Unassigned".
2.  **Filter the Rep-Seqs:** Use `qiime feature-table filter-seqs` to create a new sequence file that only contains the ASVs we kept in the filtered table.

In [None]:
# ---  Filter the Representative Sequences ---

print(f"--- 1. Filtering contaminants from the Rep-Seqs ---")
print("This ensures our sequence file matches our new filtered table.")

# Define input paths
REP_SEQS_QZA = "../results/rep-seqs.qza" # The *original* sequences
TABLE_FILTERED_QZA = "../results/table-filtered.qza" # The *new* filtered table
# Define output path
REP_SEQS_FILTERED_QZA = "../results/rep-seqs-filtered.qza"

# Run the command
# This command keeps only the sequences that are present in the filtered table
!docker run --rm -v $(pwd)/..:/data -w /data/notebooks \
  qiime2/core:latest \
  qiime feature-table filter-seqs \
    --i-data {REP_SEQS_QZA} \
    --i-table {TABLE_FILTERED_QZA} \
    --o-filtered-data {REP_SEQS_FILTERED_QZA}

print("\n--- 2. Rep-Seqs Filtering Finished. Verifying output file ---")
!ls -lh {REP_SEQS_FILTERED_QZA}

### 5. Conclusion & Next Steps

In this notebook, we successfully analyzed our denoised data and prepared it for downstream analysis.

We have successfully:
1.  **Analyzed DADA2 Statistics:** We confirmed that all 255 samples were processed successfully and had sufficient read depths (minimum of 2,957 reads), meaning no samples needed to be filtered based on depth.
2.  **Assigned Taxonomy:** We downloaded the SILVA classifier and successfully assigned taxonomic names to all our ASVs (saved in `taxonomy.qza`).
3.  **Visualized Taxonomy:** We generated an interactive bar plot (`taxa-barplot.qzv`) to visualize the bacterial composition of our samples.
4.  **Filtered Contaminants:** We filtered our data to remove non-bacterial sequences (Mitochondria, Chloroplasts, Unassigned), creating our final clean artifacts.

**Final Clean Artifacts for Analysis:**
* `results/table-filtered.qza` (The clean Feature Table)
* `results/rep-seqs-filtered.qza` (The clean Representative Sequences)

**Next Steps:**
With our clean, validated, and taxonomically-assigned data ready, we can now move on to **Notebook 04**. In the next stage, we will:
1.  Build a **Phylogenetic Tree** (شجرة تطورية) from our `rep-seqs-filtered.qza`.
2.  Perform **Alpha and Beta Diversity analysis** to compare the microbial communities between different samples (e.g., Crohn's Disease vs. Healthy).