#  STEP 5: VCF Merging and Filtering (Phase 7 R&D)

**Goal:** Combine all 93 high-quality VCF files into a single "master" VCF, and then filter this file for high-confidence SNPs.

**Why (The "Handoff" to the Tree):**
This is the final step before building the phylogenetic tree (Goal 1).
1.  **Merging:** We cannot build one tree from 93 different files. We must merge them into a single "matrix" file using `bcftools merge`.
2.  **Filtering (R&D):** The "master" VCF will contain *all* variants (SNPs). We must develop an R&D recipe (`bcftools filter`) to select *only* the high-quality SNPs (e.g., `QUAL > 30`) that we trust for building our tree.

## 1: Handoff from "The Factory" (Phase 6 Success)

**The Handoff (Done):**
We successfully updated our `Snakefile` (V5.0) to use the 93-sample "clean cohort" (`metadata_final_cohort.csv`) and the correct haploid recipe (`bcftools call --ploidy 1`).

We then executed the high-throughput job from the terminal:

```bash
snakemake --cores 8 --rerun-triggers mtime

## STEP 2: R&D Test (Merging 93 VCFs)

**Goal:** Test the `bcftools merge` command to combine all 93 high-quality VCFs into one "Master VCF".

**Why (The "Recipe"):**
We cannot pass 93 file names to the command line. The "clean" (and fast) way to do this is a two-part recipe:
1.  **Generate a List:** We will use `pandas` to read our 93-sample "brain" (`metadata_final_cohort.csv`) and generate a simple text file that lists the *paths* to all 93 `.vcf.gz` files.
2.  **Run the Merge:** We will then use the `-l` flag (`bcftools merge -l [file_list]`) to merge them all in one efficient operation.

In [7]:
import pandas as pd
import os

print("--- STEP 2.A: Generating the file list (the 'input') ---")

# --- 1. Define Paths ---
cohort_file = "../results/metadata/metadata_final_cohort.csv"
vcf_dir_relative_to_notebook = "../results/variant_calling"
list_file_out = f"{vcf_dir_relative_to_notebook}/vcf_list_for_merge.txt"

# --- [UNCHANGED] The list paths *must* be relative to the PROJECT ROOT
vcf_dir_relative_to_root = "results/variant_calling"

# --- 2. Load the 93-sample "Brain" ---
print(f"Loading the 93-sample 'brain' from: {cohort_file}")
df_clean = pd.read_csv(cohort_file)

# --- 3. Create the list of *CORRECT* (Root-Relative) paths ---
vcf_files_list = [f"{vcf_dir_relative_to_root}/{sample_id}.vcf.gz" for sample_id in df_clean['sample_id']]
print(f"Generated {len(vcf_files_list)} paths for the merge list.")

# --- 4. Save the list to the text file (The "Handoff") ---
with open(list_file_out, 'w') as f:
    for path in vcf_files_list:
        f.write(f"{path}\n")

print(f"SUCCESS: Created (and OVERWROTE): {list_file_out}")
print("--- Verification (first 5 files in list - Correct for Factory): ---")
!head -n 5 {list_file_out}

print("\n--- STEP 2.B: Running the Merge (The R&D Simulation) ---")

# --- 5. Define the Merge "Recipe" ---
# (Paths *relative to the root* for the command)
list_file_for_root = "results/variant_calling/vcf_list_for_merge.txt"
master_vcf_for_root = "results/variant_calling/MASTER_MERGED.vcf.gz"

# --- [THE FIX (Rule 4)] ---
# We "simulate" the Factory's CWD by using (cd ../ && ...)
# This tells the shell: "1. Go to the root. 2. Run bcftools from there."
command = f"(cd ../ && bcftools merge -l {list_file_for_root} -O z -o {master_vcf_for_root})"

print("Starting merge... (This should be fast, ~1 min)")

# --- 6. Run the R&D Test ---
!{command}

print("\nMerge complete.")

# --- 7. Verification (The Proof) ---
# (We check the file using the *notebook's* relative path)
print(f"\n--- Verification: Checking for 'Master VCF' file ---")
!ls -lh {vcf_dir_relative_to_notebook}/MASTER_MERGED.vcf.gz

--- STEP 2.A: Generating the file list (the 'input') ---
Loading the 93-sample 'brain' from: ../results/metadata/metadata_final_cohort.csv
Generated 93 paths for the merge list.
SUCCESS: Created (and OVERWROTE): ../results/variant_calling/vcf_list_for_merge.txt
--- Verification (first 5 files in list - Correct for Factory): ---
results/variant_calling/PA097.vcf.gz
results/variant_calling/PA096.vcf.gz
results/variant_calling/PA095.vcf.gz
results/variant_calling/PA093.vcf.gz
results/variant_calling/PA092.vcf.gz

--- STEP 2.B: Running the Merge (The R&D Simulation) ---
Starting merge... (This should be fast, ~1 min)

Merge complete.

--- Verification: Checking for 'Master VCF' file ---
-rw-rw-r-- 1 refm_youssef refm_youssef 16M Nov  1 20:50 ../results/variant_calling/MASTER_MERGED.vcf.gz


##  3: R&D Test (Filtering the Master VCF)

**Goal:** Filter the `MASTER_MERGED.vcf.gz` to create a final, "analysis-ready" VCF.

**Why (The "Recipe"):**
The 16M Master VCF contains *all* variants (SNPs, INDELs, low-quality calls). For a clean phylogenetic analysis (Goal 1), we must create a final file that contains *only* high-confidence, bi-allelic SNPs.

**R&D Test:**
We will use `bcftools view` to apply two critical filters:
1.  `TYPE=="snp"`: This keeps *only* SNPs and removes all INDELs.
2.  `QUAL > 30`: This keeps *only* "High-Quality" calls (a standard threshold).

In [8]:
import os

print("--- STEP 3: Filtering the Master VCF for high-quality SNPs ---")

# --- 1. Define Paths (Relative to 'notebooks/' CWD) ---
vcf_dir = "../results/variant_calling"
input_vcf = f"{vcf_dir}/MASTER_MERGED.vcf.gz"

# This is our "FINAL" VCF file for the whole project!
output_vcf = f"{vcf_dir}/ANALYSIS_READY.vcf.gz"

# --- 2. Build the Filter "Recipe" ---
# -i : "include" only sites that match this expression
# 'TYPE=="snp" & QUAL > 30' : The filter logic
# -O z : Output compressed .gz
# -o : Output file
command = f"bcftools view -i 'TYPE==\"snp\" & QUAL > 30' -O z -o {output_vcf} {input_vcf}"

print(f"Starting filter... (This should be very fast)")

# --- 3. Run the R&D Test ---
!{command}

print("\nFilter complete.")

# --- 4. Verification (The Proof) ---
print(f"\n--- Verification: Checking for 'Analysis-Ready VCF' file ---")
!ls -lh {output_vcf}

--- STEP 3: Filtering the Master VCF for high-quality SNPs ---
Starting filter... (This should be very fast)

Filter complete.

--- Verification: Checking for 'Analysis-Ready VCF' file ---
-rw-rw-r-- 1 refm_youssef refm_youssef 15M Nov  1 20:54 ../results/variant_calling/ANALYSIS_READY.vcf.gz


## Conclusion & Handoff to "The Factory"

**Status:** Success. The R&D for the *entire* Variant Calling pipeline (Phase 1-7) is 100% complete.

**Our Achievements (The "Recipes"):**
We have now developed and tested *all* the recipes needed for our pipeline:
1.  **Merge Recipe:** We successfully tested the `bcftools merge -l [list_file]` command, creating `MASTER_MERGED.vcf.gz` (16M).
2.  **Filter Recipe:** We successfully tested the `bcftools view -i '...'` command, creating our final, clean `ANALYSIS_READY.vcf.gz` (15M).

**Next Step (The Handoff):**
This notebook (`06_...ipynb`) is now complete. We will save it, and our new recipe file (`vcf_list_for_merge.txt`), to GitHub (Rule 5).

We are ready to automate these new recipes in our main `Snakefile` (V6.0).