#  STEP 6: Phylogenetics R&D (Phase 8)

**Goal:** To convert our final VCF into an alignment, and then use that alignment to build our final, publication-ready Phylogenetic Tree.

**Why (The Final Goal):**
This is the scientific climax of the project.
1.  **Conversion (The "Recipe"):** Our "Golden File" (`ANALYSIS_READY.vcf.gz`) is a *list of differences*. Tree-builders like `IQ-TREE` need an *alignment* (a FASTA or PHYLIP file). We will test our new tool (`snp-sites`) to perform this conversion.
2.  **Tree Building (The "R&D"):** We will test our primary analysis tool (`iqtree`) on this new alignment to generate the final tree.

##  1: Handoff from "The Factory" (Phase 7 Success)

**The Handoff (Done):**
We successfully updated our `Snakefile` (V6.0) to automate the `merge` and `filter` "recipes" developed in `Notebook 06`. We then executed this final factory step from the terminal:

```bash
snakemake --cores 1 --rerun-triggers mtime

##  2: R&D Test (VCF-to-PHYLIP Conversion)

**Goal:** Test our newly installed local script (`vcf2phylip.py`) to convert our "Golden VCF" (`ANALYSIS_READY.vcf.gz`) into a PHYLIP alignment.

**Why (The "Recipe"):**
Our tree-building tool, `IQ-TREE`, cannot read a VCF file. It requires a multiple sequence alignment (MSA) format, such as PHYLIP or FASTA. This script will create that necessary input file.

**R&D Test:**
We will run the `vcf2phylip.py` script and verify that it produces a `.phy` file.

In [3]:
import os
import sys

print("--- STEP 2: R&D Test (VCF to PHYLIP Conversion) ---")

# --- 1. Define Paths (Relative to 'notebooks/' CWD) ---
script_path = "../scripts/vcf2phylip.py"
input_vcf = "../results/variant_calling/ANALYSIS_READY.vcf.gz"
output_dir = "../results/phylogenetics"
os.makedirs(output_dir, exist_ok=True)

# --- [THE FIX (Rule 4)] ---
# The prefix should be *only* the base name, not the full path.
output_prefix = "ANALYSIS_READY"
# --- [END FIX] ---

# --- 2. Build the Conversion "Recipe" ---
# -i : Input (our "Golden VCF")
# --output-folder : Where to save the result
# --output-prefix : The base name for the new file
command = f"python3 {script_path} -i {input_vcf} --output-folder {output_dir} --output-prefix {output_prefix}"

print(f"Starting conversion... (This should be very fast for 15M VCF)")

# --- 3. Run the R&D Test ---
!{command}

print("\nConversion complete.")

# --- 4. Verification (The Proof) ---
# The script *automatically* adds ".min4.phy"
# We must build the *correct* verification path
output_file = f"{output_dir}/{output_prefix}.min4.phy"

print(f"\n--- Verification: Checking for PHYLIP alignment file ---")
!ls -lh {output_file}

--- STEP 2: R&D Test (VCF to PHYLIP Conversion) ---
Starting conversion... (This should be very fast for 15M VCF)

Converting file '../results/variant_calling/ANALYSIS_READY.vcf.gz':

Number of samples in VCF: 93
Total of genotypes processed: 196771
Genotypes excluded because they exceeded the amount of missing data allowed: 77638
Genotypes that passed missing data filter but were excluded for being MNPs: 0
SNPs that passed the filters: 119133

Sample 1 of 93, 'PA097', added to the nucleotide matrix(ces).
Sample 2 of 93, 'PA096', added to the nucleotide matrix(ces).
Sample 3 of 93, 'PA095', added to the nucleotide matrix(ces).
Sample 4 of 93, 'PA093', added to the nucleotide matrix(ces).
Sample 5 of 93, 'PA092', added to the nucleotide matrix(ces).
Sample 6 of 93, 'PA091', added to the nucleotide matrix(ces).
Sample 7 of 93, 'PA010', added to the nucleotide matrix(ces).
Sample 8 of 93, 'PA090', added to the nucleotide matrix(ces).
Sample 9 of 93, 'PA089', added to the nucleotide matrix

##  3: R&D Test (Building the Phylogenetic Tree)

**Goal:** Run `IQ-TREE` on our new PHYLIP alignment (`.phy`) to generate the final phylogenetic tree.

**Why (The Final Scientific Goal - Goal 1):**
This is the scientific "climax" of the project. We will use our newly installed `iqtree` tool to build the tree.

**R&D Test (The "Recipe" - Goal 2):**
We will use a professional-grade command:
1.  **`-s`**: The input alignment (`.phy`) file (the 11M file we just made).
2.  **`-m MFP`**: "ModelFinder Plus". This is a key skill (Goal 2). Instead of "guessing" a model (like GTR), we ask `IQ-TREE` to *test all models* and *choose the best one* for our data.
3.  **`-b 1000`**: "Ultrafast Bootstrap". This is critical. It runs 1000 "mini-experiments" to test how "confident" we are in each branch of the tree. (e.g., "Are PA001 and PA002 *really* related?")
4.  **`-p`**: "Prefix". `IQ-TREE` creates *many* output files. We will give it a "prefix" and a "directory" to keep our project clean (Rule 4).
5.  **`-T 8`**: "Threads". This is a *very heavy* computation. We will tell it to use 8 vCPUs to be fast.

In [None]:
import os
import sys

print("--- STEP 3: R&D Test (Building the Final Tree with IQ-TREE) ---")

# --- 1. Define Paths (Relative to 'notebooks/' CWD) ---
input_phy = "../results/phylogenetics/ANALYSIS_READY.min4.phy"
output_dir = "../results/phylogenetics/iqtree_analysis"
os.makedirs(output_dir, exist_ok=True)

# This will be the "prefix" for all of IQ-TREE's output files
output_prefix = f"{output_dir}/MDR_PA_TREE"

# --- 2. Build the Tree "Recipe" (THE CORRECTED COMMAND) ---
# --- [THE FIX (Rule 4)] ---
# We use the *explicit* flag "--prefix" instead of the
# *ambiguous* flag "-p" to define the output path.
command = f"iqtree -s {input_phy} --prefix {output_prefix} -m MFP -b 1000 -T 8"
# --- [END FIX] ---

print("Starting IQ-TREE... (This is the FINAL, HEAVY step)")
print(f"This *will* take a long time (10-30+ minutes). Do not stop it.")

# --- 3. Run the R&D Test ---
!{command}

print("\nIQ-TREE complete.")

# --- 4. Verification (The FINAL Proof) ---
# The *most important* output file ends in ".treefile"
final_tree_file = f"{output_prefix}.treefile"

print(f"\n--- Verification: Checking for the FINAL '.treefile' ---")
!ls -lh {final_tree_file}

--- STEP 3: R&D Test (Building the Final Tree with IQ-TREE) ---
Starting IQ-TREE... (This is the FINAL, HEAVY step)
This *will* take a long time (10-30+ minutes). Do not stop it.
IQ-TREE version 3.0.1 for Linux x86 64-bit built Jul  9 2025
Developed by Bui Quang Minh, Thomas Wong, Nhan Ly-Trong, Huaiyan Ren
Contributed by Lam-Tung Nguyen, Dominik Schrempf, Chris Bielow,
Olga Chernomor, Michael Woodhams, Diep Thi Hoang, Heiko Schmidt

Host:    bioinformatics (AVX512, FMA3, 31 GB RAM)
Command: iqtree -s ../results/phylogenetics/ANALYSIS_READY.min4.phy --prefix ../results/phylogenetics/iqtree_analysis/MDR_PA_TREE -m MFP -b 1000 -T 8
Seed:    345248 (Using SPRNG - Scalable Parallel Random Number Generator)
Time:    Sun Nov  2 02:32:22 2025
Kernel:  AVX+FMA - 8 threads (8 CPU cores detected)

Reading alignment file ../results/phylogenetics/ANALYSIS_READY.min4.phy ... Phylip format detected
Alignment most likely contains DNA/RNA sequences
Constructing alignment: done in 0.216781 secs using 8

##  3.5: Verification of the "Pre-Bootstrap" Tree

**Analysis:**
We interrupted the `iqtree` command *during* the heavy `-b 1000` (Standard Bootstrap) phase.

**Verification (Answering Q1):**
However, `IQ-TREE` *already* completed the main tree search (Phase 1) and *saved* the best-fit tree (without confidence values) to the `.treefile`. We will now verify this file exists.

In [1]:
# --- Verification (The Proof of Q1) ---
# We are just checking if the main tree file was saved
# (This is the *same* verification line from the end of the failed Cell 6)
final_tree_file = "../results/phylogenetics/iqtree_analysis/MDR_PA_TREE.treefile"

print(f"\n--- Verification: Checking for the MAIN '.treefile' ---")
!ls -lh {final_tree_file}


--- Verification: Checking for the MAIN '.treefile' ---
-rw-rw-r-- 1 refm_youssef refm_youssef 3.1K Nov  2 04:41 ../results/phylogenetics/iqtree_analysis/MDR_PA_TREE.treefile
