# Notebook 04: Phylogenetic Tree and Diversity Analysis

## Introduction

In Notebook 03, we successfully assigned taxonomy and filtered our dataset. We now have two clean, validated artifacts ready for deep scientific analysis:
* `table-filtered.qza`: The final, filtered feature table.
* `rep-seqs-filtered.qza`: The final, filtered representative sequences.

In this notebook, we will perform the core diversity analyses to compare the microbial communities between our sample groups. This involves two major stages:
1.  **Phylogenetic Tree Construction:** We will build a phylogenetic tree to understand the evolutionary relationships between our ASVs.
2.  **Diversity Analysis:** We will use this tree (and our table) to calculate Alpha and Beta diversity metrics, allowing us to statistically compare the microbial composition of different experimental groups (e.g., Crohn's Disease vs. Healthy).

### Objectives:

1.  **Build Phylogenetic Tree:** Use the `rep-seqs-filtered.qza` to perform sequence alignment and construct a phylogenetic tree. This is essential for phylogenetic diversity metrics (Faith's PD, UniFrac).
2.  **Calculate Core Diversity Metrics:** Run the powerful `qiime diversity core-metrics-phylogenetic` command. This single command will automatically calculate:
    * Alpha diversity metrics (e.g., Shannon, Faith's PD, Observed Features).
    * Beta diversity metrics (e.g., Bray-Curtis, UniFrac).
    * PCoA (Principal Coordinates Analysis) results for visualization.
3.  **Analyze Alpha Diversity:** Statistically compare alpha diversity between experimental groups (Crohn's vs. Healthy).
4.  **Analyze Beta Diversity:** Statistically compare and visualize (PCoA) the beta diversity between experimental groups.

In [None]:
# ---  Imports, Settings, and Verification ---
import pandas as pd
import os

print("--- 1. Verification: Checking for input files from Notebook 03 ---")

# Define file paths
# These are the *clean* files from the end of Notebook 03
TABLE_FILTERED_QZA = "../results/table-filtered.qza"
REP_SEQS_FILTERED_QZA = "../results/rep-seqs-filtered.qza"
METADATA_TSV = "../data/metadata.tsv"

# Check if all required files exist
files_to_check = [TABLE_FILTERED_QZA, REP_SEQS_FILTERED_QZA, METADATA_TSV]
all_files_exist = True

for f in files_to_check:
    if not os.path.exists(f):
        print(f"!!! ERROR: Required file not found: {f}")
        all_files_exist = False
    else:
        print(f"Found: {f}")

if all_files_exist:
    print("\n--- All required input files are present. Ready to start Notebook 04. ---")
else:
    print("\n--- !!! ERROR: Please ensure Notebook 03 ran successfully before proceeding. ---")

### 1. Build Phylogenetic Tree

Our first objective is to build a phylogenetic tree from our clean representative sequences (`rep-seqs-filtered.qza`). This tree is essential for calculating phylogenetic diversity metrics like Faith's PD and UniFrac.

This process involves four standard steps:
1.  **Alignment:** Align all sequences using MAFFT.
2.  **Masking:** Remove highly variable or "noisy" positions from the alignment.
3.  **Tree Building:** Construct an unrooted tree using FastTree.
4.  **Rooting:** Apply midpoint rooting to create the final rooted tree.

In [None]:
# --- Step 1: Sequence Alignment (MAFFT) ---

print("--- 1. Starting Step 1: Sequence Alignment (MAFFT) ---")
print("This may take a few minutes...")

# Define input path
REP_SEQS_FILTERED_QZA = "../results/rep-seqs-filtered.qza" 
# Define output path
ALIGNED_SEQS_QZA = "../results/08_tree/aligned-rep-seqs.qza"

# We must create the output directory first
!mkdir -p ../results/08_tree

# Run the alignment command
# NOTE: Using '--p-n-threads auto' to correctly use all available cores.
!docker run --rm -v $(pwd)/..:/data -w /data/notebooks \
  qiime2/core:latest \
  qiime alignment mafft \
    --i-sequences {REP_SEQS_FILTERED_QZA} \
    --o-alignment {ALIGNED_SEQS_QZA} \
    --p-n-threads auto

print("\n--- 2. Alignment Finished. Verifying output file ---")
!ls -lh {ALIGNED_SEQS_QZA}

### 1.2 Step 2: Mask Alignment

Now that we have our aligned sequences, the next step is to "mask" or filter out positions in the alignment that are highly variable (e.g., mostly gaps). These noisy positions can introduce errors into the phylogenetic tree construction.

In [None]:
# ---  Step 2: Mask Alignment ---

print("--- 1. Starting Step 2: Masking noisy alignment positions ---")

# Define input path (from Cell 5)
ALIGNED_SEQS_QZA = "../results/08_tree/aligned-rep-seqs.qza"
# Define output path
MASKED_ALIGNED_SEQS_QZA = "../results/08_tree/masked-aligned-rep-seqs.qza"

# Run the mask command
!docker run --rm -v $(pwd)/..:/data -w /data/notebooks \
  qiime2/core:latest \
  qiime alignment mask \
    --i-alignment {ALIGNED_SEQS_QZA} \
    --o-masked-alignment {MASKED_ALIGNED_SEQS_QZA}

print("\n--- 2. Masking Finished. Verifying output file ---")
!ls -lh {MASKED_ALIGNED_SEQS_QZA}

### 1.3 Step 3: Build Unrooted Tree

With our clean, masked alignment, we can now build the actual phylogenetic tree. We will use `FastTree`, a very fast algorithm, to generate an unrooted tree.

In [None]:
# ---  Step 3: Build Unrooted Tree (FastTree) ---

print("--- 1. Starting Step 3: Building unrooted tree with FastTree ---")
print("This may take a few minutes...")

# Define input path (from Cell 7)
MASKED_ALIGNED_SEQS_QZA = "../results/08_tree/masked-aligned-rep-seqs.qza"
# Define output path
UNROOTED_TREE_QZA = "../results/08_tree/unrooted-tree.qza"

# Run the FastTree command
!docker run --rm -v $(pwd)/..:/data -w /data/notebooks \
  qiime2/core:latest \
  qiime phylogeny fasttree \
    --i-alignment {MASKED_ALIGNED_SEQS_QZA} \
    --o-tree {UNROOTED_TREE_QZA} \
    --p-n-threads auto # Use all available cores

print("\n--- 2. Tree Building Finished. Verifying output file ---")
!ls -lh {UNROOTED_TREE_QZA}

### 1.4 Step 4: Root the Tree

We now have an unrooted tree. For phylogenetic diversity metrics like UniFrac, we need a rooted tree, which provides an evolutionary direction (i.e., an oldest common ancestor).

We will use "midpoint rooting," a common method that places the root at the midpoint of the longest path between any two ASVs in the tree.

In [None]:
# --- Step 4: Root the Tree (Midpoint Rooting) ---

print("--- 1. Starting Step 4: Applying midpoint rooting ---")

# Define input path (from Cell 9)
UNROOTED_TREE_QZA = "../results/08_tree/unrooted-tree.qza"
# Define output path
ROOTED_TREE_QZA = "../results/08_tree/rooted-tree.qza"

# Run the rooting command
!docker run --rm -v $(pwd)/..:/data -w /data/notebooks \
  qiime2/core:latest \
  qiime phylogeny midpoint-root \
    --i-tree {UNROOTED_TREE_QZA} \
    --o-rooted-tree {ROOTED_TREE_QZA}

print("\n--- 2. Rooting Finished. Verifying final tree file ---")
!ls -lh {ROOTED_TREE_QZA}

### 2. Calculate Core Diversity Metrics

We have successfully built our rooted phylogenetic tree. We now have all three required components for our main analysis: the filtered table, the rooted tree, and the sample metadata.

We will now run the `qiime diversity core-metrics-phylogenetic` pipeline. This powerful command will:
1.  **Rarefy** our feature table to a specified sampling depth (normalization).
2.  Calculate several **Alpha Diversity** metrics (e.g., Shannon, Faith's PD).
3.  Calculate several **Beta Diversity** metrics (e.g., Bray-Curtis, Jaccard, Weighted & Unweighted UniFrac).
4.  Generate **PCoA (Principal Coordinates Analysis)** results for all beta diversity metrics.

**Sampling Depth:**
Based on our analysis in Notebook 03 (Cell 8), our weakest sample had **2,957** reads. We will use this value for `--p-sampling-depth` to normalize all samples and ensure fair comparisons, without losing any samples.

In [None]:
# ---  Run the Core Diversity Metrics Pipeline ---

print("--- 1. Starting Core Diversity Metrics Pipeline ---")
print("This step will calculate all alpha/beta metrics and may take several minutes...")

# Define inputs
ROOTED_TREE_QZA = "../results/08_tree/rooted-tree.qza"
TABLE_FILTERED_QZA = "../results/table-filtered.qza"
METADATA_TSV = "../data/metadata.tsv"

# Define the sampling depth (from Notebook 03, Cell 8)
SAMPLING_DEPTH = 2957 

# Define the *output directory*
CORE_METRICS_DIR = "../results/09_core_metrics"

# Run the core-metrics-phylogenetic command
!docker run --rm -v $(pwd)/..:/data -w /data/notebooks \
  qiime2/core:latest \
  qiime diversity core-metrics-phylogenetic \
    --i-table {TABLE_FILTERED_QZA} \
    --i-phylogeny {ROOTED_TREE_QZA} \
    --m-metadata-file {METADATA_TSV} \
    --p-sampling-depth {SAMPLING_DEPTH} \
    --output-dir {CORE_METRICS_DIR}

print(f"\n--- 2. Core Metrics Finished. Verifying output directory ---")
print(f"Contents of {CORE_METRICS_DIR}:")
!ls -lh {CORE_METRICS_DIR}

### 3. Analyze Alpha Diversity (Statistical Significance)

We have successfully generated all our core metrics. The folder `09_core_metrics` contains multiple alpha diversity vectors (e.g., `faith_pd_vector.qza`, `shannon_vector.qza`).

Our next objective is to answer our first key scientific question: **"Is there a statistically significant difference in alpha diversity between the experimental groups (Crohn's Disease vs. Healthy)?"**

We will use the `alpha-group-significance` command to find this out. We will test two important metrics:
1.  **Faith's PD:** A phylogenetic metric (uses the tree).
2.  **Shannon's Index:** A non-phylogenetic metric.

We will compare the groups based on the `gastrointest_disord` column in our metadata.

In [None]:
# ---  Alpha Diversity Significance (Faith's PD) [Corrected] ---

print("--- 1. Running Alpha Diversity Test (Faith's PD) ---")
# (v2020.8 Correction: Removed '--m-metadata-column' parameter. 
# The command will test *all* metadata columns automatically.)

# Define input paths
ALPHA_FAITH_PD_QZA = "../results/09_core_metrics/faith_pd_vector.qza"
METADATA_TSV = "../data/metadata.tsv"
# Define output path
ALPHA_FAITH_PD_QZV = "../results/09_core_metrics/faith-pd-significance.qzv"

# Run the command (without the removed parameter)
!docker run --rm -v $(pwd)/..:/data -w /data/notebooks \
  qiime2/core:latest \
  qiime diversity alpha-group-significance \
    --i-alpha-diversity {ALPHA_FAITH_PD_QZA} \
    --m-metadata-file {METADATA_TSV} \
    --o-visualization {ALPHA_FAITH_PD_QZV}

print("\n--- 2. Test Finished. Verifying output file ---")
!ls -lh {ALPHA_FAITH_PD_QZV}

In [None]:
# --- Alpha Diversity Significance (Shannon) [Corrected] ---

print("--- 1. Running Alpha Diversity Test (Shannon) ---")
# (v2020.8 Correction: Removed '--m-metadata-column' parameter.)

# Define input paths
ALPHA_SHANNON_QZA = "../results/09_core_metrics/shannon_vector.qza"
METADATA_TSV = "../data/metadata.tsv"
# Define output path
ALPHA_SHANNON_QZV = "../results/09_core_metrics/shannon-significance.qzv"

# Run the command (without the removed parameter)
!docker run --rm -v $(pwd)/..:/data -w /data/notebooks \
  qiime2/core:latest \
  qiime diversity alpha-group-significance \
    --i-alpha-diversity {ALPHA_SHANNON_QZA} \
    --m-metadata-file {METADATA_TSV} \
    --o-visualization {ALPHA_SHANNON_QZV}

print("\n--- 2. Test Finished. Verifying output file ---")
!ls -lh {ALPHA_SHANNON_QZV}

### 4. Analyze Beta Diversity (Statistical Significance)

We have successfully analyzed alpha diversity. Now we will move to beta diversity to answer the question: **"Is the *overall community composition* significantly different between the Crohn's Disease and Healthy groups?"**

We already generated the distance matrices (e.g., `unweighted_unifrac_distance_matrix.qza`) and PCoA plots (e.g., `unweighted_unifrac_emperor.qzv`) in Cell 13.

The PCoA plots let us *visualize* the clustering, but we need a statistical test (PERMANOVA) to get a p-value. We will use the `beta-group-significance` command for this. We will test two important metrics:
1.  **Unweighted UniFrac:** A phylogenetic metric (uses the tree), sensitive to rare organisms.
2.  **Bray-Curtis:** A non-phylogenetic metric, based on abundance.

In [None]:
# --- Beta Diversity Significance (Unweighted UniFrac) ---

print("--- 1. Running Beta Diversity Test (Unweighted UniFrac vs. gastrointest_disord) ---")

# Define input paths
BETA_UNWEIGHTED_UNIFRAC_QZA = "../results/09_core_metrics/unweighted_unifrac_distance_matrix.qza"
METADATA_TSV = "../data/metadata.tsv"
# Define output path
BETA_UNWEIGHTED_UNIFRAC_QZV = "../results/09_core_metrics/unweighted-unifrac-significance.qzv"

# Run the command
# NOTE: Unlike alpha-group-significance, this command *does* require '--m-metadata-column' in v2020.8
!docker run --rm -v $(pwd)/..:/data -w /data/notebooks \
  qiime2/core:latest \
  qiime diversity beta-group-significance \
    --i-distance-matrix {BETA_UNWEIGHTED_UNIFRAC_QZA} \
    --m-metadata-file {METADATA_TSV} \
    --m-metadata-column gastrointest_disord \
    --o-visualization {BETA_UNWEIGHTED_UNIFRAC_QZV} \
    --p-permutations 9999 # Use a high number of permutations for a more accurate p-value

print("\n--- 2. Test Finished. Verifying output file ---")
!ls -lh {BETA_UNWEIGHTED_UNIFRAC_QZV}

In [None]:
# --- Beta Diversity Significance (Bray-Curtis) ---

print("--- 1. Running Beta Diversity Test (Bray-Curtis vs. gastrointest_disord) ---")

# Define input paths
BETA_BRAY_CURTIS_QZA = "../results/09_core_metrics/bray_curtis_distance_matrix.qza"
METADATA_TSV = "../data/metadata.tsv"
# Define output path
BETA_BRAY_CURTIS_QZV = "../results/09_core_metrics/bray-curtis-significance.qzv"

# Run the command
!docker run --rm -v $(pwd)/..:/data -w /data/notebooks \
  qiime2/core:latest \
  qiime diversity beta-group-significance \
    --i-distance-matrix {BETA_BRAY_CURTIS_QZA} \
    --m-metadata-file {METADATA_TSV} \
    --m-metadata-column gastrointest_disord \
    --o-visualization {BETA_BRAY_CURTIS_QZV} \
    --p-permutations 9999

print("\n--- 2. Test Finished. Verifying output file ---")
!ls -lh {BETA_BRAY_CURTIS_QZV}

### 5. Conclusion & Next Steps

This notebook successfully performed the core statistical generation steps of our analysis.

We have successfully:
1.  **Built a Phylogenetic Tree:** We generated a rooted phylogenetic tree (`rooted-tree.qza`) from our clean sequences, which is essential for phylogenetic-aware metrics.
2.  **Calculated Core Metrics:** We ran the `core-metrics-phylogenetic` pipeline, which rarefied our table to **2,957 reads** and generated all our key alpha and beta diversity artifacts (distance matrices, PCoA results, and alpha vectors).
3.  **Generated Statistical Tests:** We ran the `alpha-group-significance` and `beta-group-significance` commands.

**Final Generated Artifacts (The "Locked Boxes"):**
* `faith-pd-significance.qzv`
* `shannon-significance.qzv`
* `unweighted-unifrac-significance.qzv`
* `bray-curtis-significance.qzv`

**Next Steps:**
This notebook's job was to *generate* the results. Our next notebook, **Notebook 05**, will be dedicated entirely to **interpreting** them. We will "unlock" these `.qzv` files, extract the p-values and other data, and finally answer our scientific questions.