# MAJOR STEP 3: Mapping & Sorting (R&D)

**Goal:** Develop and test the "recipe" for Phase 5 (Mapping).

**Why:** Our data is clean (Phase 4). We now need to align (map) all 96 samples to the reference genome. Before automating this in Snakefile, we must (Rule 2: R&D Lab):
1.  Prepare (Index) the reference genome for `bwa-mem2`.
2.  Test the `bwa-mem2 mem` command on a *single sample*.
3.  Test the `samtools` commands (view, sort) to create a clean, sorted BAM file.

##  1: Download Reference Genome

**Goal:** Download the "gold standard" *P. aeruginosa* (PAO1) reference genome.

**Why:** Our "Mapping" (Phase 5) is impossible without a reference. We discovered the `data/reference_genome/` directory was missing. We will now create it and download the official RefSeq FASTA file (`NC_002516.2`).

In [None]:
# --- 1. Create the missing directory (Fixing the "No such file" error) ---
# (Using Python's os.makedirs to be safe, just like in Notebook 00)
import os
output_dir = "../data/reference_genome"
os.makedirs(output_dir, exist_ok=True)
print(f"Ensured directory exists: {output_dir}")

# --- 2. Download the Reference Genome ---
# We will use 'wget' to download it from the NCBI FTP server.
# -O : Specify the output file name (we rename it to be simple)
fasta_url = "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/006/765/GCF_000006765.1_ASM676v1/GCF_000006765.1_ASM676v1_genomic.fna.gz"
output_file = f"{output_dir}/PAO1_reference.fna.gz"

print(f"Downloading Reference Genome from NCBI...")
# We run the command via the shell
!wget -O {output_file} {fasta_url}
print("Download complete.")

# --- 3. Verification (Rule 1) ---
# We decompress it and check the final file
print("\nDecompressing...")
!gunzip {output_file}

print("\n--- Verification: Listing final reference file ---")
!ls -lh {output_dir}/PAO1_reference.fna

##  2: Index the Reference Genome

**Goal:** Create a `bwa-mem2` index from our new `PAO1_reference.fna` file.

**Why:** `bwa-mem2` cannot read the FASTA file directly for high-speed alignment. It needs this pre-computed "index" (like a phone book). This is a one-time setup step that the "Factory" (Snakefile) will eventually automate.

**R&D Test:** We will run the `bwa-mem2 index` command and verify that it creates a new set of index files in the same directory.

In [None]:
# --- 1. Define Paths ---
# (Relative to the 'notebooks/' directory)
ref_genome_file = "../data/reference_genome/PAO1_reference.fna"

print(f"Starting to index: {ref_genome_file}")
print("(This should be very fast for a 6.1M genome)...")

# --- 2. Build and run the index command ---
# The command is simple: bwa-mem2 index [fasta_file]
# It will automatically create new files *in the same directory*
command = f"bwa-mem2 index {ref_genome_file}"

!{command}

print("\nIndexing complete.")

# --- 3. Verification (Rule 1) ---
print("\n--- Verification: Listing ALL files in the directory ---")
print("We should now see new index files (e.g., .amb, .ann, .pac, .bwt.2.64):")
!ls -lh ../data/reference_genome/

##  3: R&D Test (Mapping + Sorting)

**Goal:** Test the full alignment "recipe" on a *single sample*.

**Why (Rule 2: R&D Lab):**
We need to develop the exact "factory command" for our Snakefile. This "recipe" must do three things efficiently:
1.  **Map** the reads (R1 + R2) using `bwa-mem2 mem`.
2.  **Include "Read Group" info** (a "best practice" tag, `-R`, required for GATK later).
3.  **Pipe (`|`)** the output *directly* to `samtools` to convert the text (`.sam`) to binary (`.bam`) and sort it.

This avoids creating a huge, temporary `.sam` file and is the "Clean Protocol" (Rule 4) for high-throughput alignment.

In [None]:
import os

# --- 1. Define Correct & Incorrect Paths ---
# (Based on our !pwd discovery)
project_root = "/home/refm_youssef/project_popgen_mdr_pa"

# This is where the Reference Genome SHOULD be
correct_ref_dir = f"{project_root}/data/reference_genome"
os.makedirs(correct_ref_dir, exist_ok=True) # Ensure it exists

# This is where it *ACTUALLY* is (the "orphan")
orphan_ref_path = f"/home/refm_youssef/data/reference_genome/*" # Use * to grab all index files

print(f"Moving Reference Genome from (orphan) {orphan_ref_path}...")
print(f"Moving to (correct) {correct_ref_dir}...")

# --- 2. Move the files (The Fix) ---
# We use !mv to move the file + all its index files
!mv {orphan_ref_path} {correct_ref_dir}/

# --- 3. Verification (Rule 1) ---
print("\nMove complete. Verifying the correct project directory:")
!ls -lh {correct_ref_dir}

In [None]:
import os

# --- 1. Define CORRECT Paths (No "../") ---
# (Based on our !pwd discovery)
ref_genome = "data/reference_genome/PAO1_reference.fna"

sample_id = "PA001"
read_1 = f"results/trimmed_reads/{sample_id}_1.fastq.gz"
read_2 = f"results/trimmed_reads/{sample_id}_2.fastq.gz"

# The output dir (relative to root)
output_dir = "results/mapped_reads"
os.makedirs(output_dir, exist_ok=True)
output_bam = f"{output_dir}/{sample_id}.sorted.bam"

# --- 2. Define "Read Group" (No changes here) ---
read_group = f"@RG\\tID:{sample_id}\\tSM:{sample_id}\\tPL:ILLUMINA"
print(f"Read Group string: {read_group}")

# --- 3. Build the Pipe Command (THE CLEAN FIX) ---
command = f"bwa-mem2 mem -R '{read_group}' {ref_genome} {read_1} {read_2} | samtools view -bS - | samtools sort -o {output_bam} -"

print(f"\nStarting Alignment Pipe (Corrected Paths) for {sample_id}...")
print(f"This will take a minute or two...")

# --- 4. Run the R&D Test ---
!{command}

print("\nPipe complete.")

# --- 5. Verification (Rule 1) ---
print(f"\n--- Verification: Checking for final sorted BAM file ---")
# This time, it MUST work
!ls -lh {output_bam}

##  4: R&D Test (Indexing)

**Goal:** Create an "index" for our new sorted BAM file.

**Why (Rule 4: Clean Protocol):**
We now have a `PA001.sorted.bam` file. To use this file for downstream analysis (like Variant Calling in Phase 6), all tools (GATK, bcftools) require a "BAM Index" file (a `.bai` file).

This index acts like a "Table of Contents" for the BAM file, allowing tools to jump to specific genomic coordinates (e.g., "Chromosome 1, position 5000") instantly.

**R&D Test:** We will run the `samtools index` command and verify it creates the `.bam.bai` file.

In [None]:
# --- 1. Define the Input File (the one we just made) ---
# (Paths are still relative to the project root)
input_bam = "results/mapped_reads/PA001.sorted.bam"

print(f"Starting to index: {input_bam}")
print("(This should be very fast)...")

# --- 2. Build and run the index command ---
# The command is: samtools index [sorted_bam_file]
command = f"samtools index {input_bam}"

!{command}

print("\nIndexing complete.")

# --- 3. Verification (Rule 1) ---
# The command AUTOMATICALLY creates a new file
# named exactly like the input, but with .bai added
output_index = f"{input_bam}.bai"

print(f"\n--- Verification: Checking for BAM Index file ---")
!ls -lh {output_index}

## Conclusion & Handoff to "The Factory"

**Status:** Success.
The "R&D Lab" for Phase 5 is complete. We have successfully developed and tested the *entire* recipe for high-throughput alignment:

1.  **Downloaded:** The reference genome (`PAO1_reference.fna`).
2.  **Tested `bwa-mem2 index`:** Created the genome index (Test 1).
3.  **Tested the `Pipe`:** Successfully ran `bwa-mem2 mem | samtools view | samtools sort` on a single sample (`PA001`) to produce a `sorted.bam` file (Test 2+3).
4.  **Tested `samtools index`:** Successfully created the final `.bai` index (Test 4).

**Next Step (The Handoff):**
This notebook (`03_Mapping_...ipynb`) is now complete. We will save it to GitHub (Rule 5).

The "recipe" is now proven. We are ready to automate this "recipe" in our main `Snakefile` to run it on all 96 samples.