# 02. Mapping and BAM Processing

This notebook is the core of the pipeline. We will map our 14 clean (trimmed) FASTQ reads against the reference genome.

**Workflow:**
1.  **Index Reference:** Create a `BWA` index for our reference genome (a one-time setup step).
2.  **Run Mapping:** Loop through all 7 sample pairs and run `bwa-mem` to align reads and generate `SAM` files.
3.  **Process BAMs:** Convert `SAM` to `BAM`, sort, and index the BAM files using `samtools`. This prepares them for GATK.

In [None]:
# Create a new directory for the index files to keep things clean
!mkdir -p ../references/bwa_index

# 1. Run bwa index on our reference FASTA
# We use -p to tell BWA where to *put* the new index files.
print("--- Running bwa index (this may take a minute)... ---")
!bwa index -p ../references/bwa_index/saureus_atcc_29213 ../references/saureus_atcc_29213.fasta

# 2. Verification (as a minor step)
# We check that the new index files (.amb, .ann, .bwt, .pac, .sa) were created
# in the new directory.
print("\n--- Verification: BWA Index files ---")
!ls -lh ../references/bwa_index/

## 2. Run Mapping (bwa mem) and SAM-to-BAM Conversion

This is the main event of this notebook. We will now align our 14 clean (trimmed) FASTQ files against the indexed reference genome.

**Workflow:**
We will use a professional `bash` loop to process all 7 samples. For each sample, we will:
1.  Run `bwa mem` using 8 threads to perform the alignment.
2.  "Pipe" (`|`) the output (which is in `SAM` format) *directly* into `samtools`.
3.  Use `samtools view` to convert the streaming `SAM` data into a compressed `BAM` file (`-bS`).
4.  Save the final, compressed `.bam` file into a new `../results/bam/` directory.

This on-the-fly `SAM-to-BAM` conversion is highly efficient, saving massive amounts of disk space and time.

In [None]:
%%bash
# Tell Jupyter to run this entire cell as a bash script

# 1. Create the output directory for our new BAM files
echo "--- Creating output directory ../results/bam/ ---"
mkdir -p ../results/bam/

# 2. Define the path to our BWA index (Corrected Path)
# This is the prefix we created in Cell 2 of this notebook
BWA_INDEX="../references/bwa_index/saureus_atcc_29213"

# 3. Loop through all *trimmed* Read 1 files
for r1_path in ../data/trimmed/*_1.trimmed.fastq.gz; do
    
    # --- Construct filenames ---
    r1_filename=$(basename "$r1_path")
    sample_id=$(echo "$r1_filename" | cut -d'_' -f1)
    
    echo "--- Starting Mapping for sample: $sample_id ---"
    
    # Construct paths
    r2_path="../data/trimmed/${sample_id}_2.trimmed.fastq.gz"
    output_bam="../results/bam/${sample_id}.bam"
    
    # --- !! THE FIX: Define the Read Group (RG) String !! ---
    # GATK *requires* this. We give each sample an ID (ID) and a Sample Name (SM).
    READ_GROUP_STRING="@RG\tID:${sample_id}\tSM:${sample_id}\tPL:ILLUMINA"

    # --- Run the BWA -> Samtools pipeline (NOW WITH -R) ---
    # -R: Attaches the Read Group string to the BAM header
    
    bwa mem -t 8 -R "$READ_GROUP_STRING" $BWA_INDEX $r1_path $r2_path | samtools view -bS -@ 8 -o $output_bam -
    
    echo "--- Finished Mapping for sample: $sample_id ---"
done

echo "--- All mapping is complete. ---"

In [None]:
# Verify the new BAM files
print("--- Verification: BAM files ---")
!ls -lh ../results/bam/

## 3. Sort and Index BAM Files (Samtools)

The BAM files we just created are unsorted (or sorted by read name). For GATK to work, our BAM files **must** be sorted by genomic coordinate.

**Workflow:**
We will use a `bash` loop to process all 7 BAM files:
1.  **Sort:** Use `samtools sort` to read each unsorted BAM file and write a new, sorted BAM file.
2.  **Index:** Use `samtools index` on the *new sorted BAM* file to create its index (`.bai` file). GATK needs this index for fast lookups.

We will store these final, analysis-ready files in a new directory: `../results/bam_sorted/`.

In [None]:
%%bash
# Tell Jupyter to run this entire cell as a bash script

# 1. Create the output directory for our new *sorted* BAM files
echo "--- Creating output directory ../results/bam_sorted/ ---"
mkdir -p ../results/bam_sorted/

# 2. Loop through all *unsorted* BAM files in ../results/bam/
for bam_path in ../results/bam/*.bam; do
    
    # --- Construct filenames ---
    bam_filename=$(basename "$bam_path")
    
    echo "--- Starting Sort & Index for: $bam_filename ---"
    
    # Define the path for the new *sorted* output file
    sorted_bam_path="../results/bam_sorted/${bam_filename}"
    
    # --- Run samtools sort ---
    # -@ 8: Use 8 threads for sorting
    # -o: Specify the output file name
    samtools sort -@ 8 -o $sorted_bam_path $bam_path
    
    # --- Run samtools index ---
    # -@ 8: Use 8 threads for indexing
    # This automatically creates the ".bai" file in the same directory
    samtools index -@ 8 $sorted_bam_path
    
    echo "--- Finished Sort & Index for: $bam_filename ---"
done

echo "--- All BAM sorting and indexing is complete. ---"

In [None]:
# Verify the new SORTED BAM and INDEX files
print("--- Verification: Sorted BAM and BAI (Index) files ---")
!ls -lh ../results/bam_sorted/