# 03. Variant Calling with GATK HaplotypeCaller

This is the central notebook of our project. We will use GATK `HaplotypeCaller` to identify genetic variants (SNPs and INDELs) in each sample.

**Workflow:**
1.  **Create FASTA Index:** GATK (and `samtools`) requires a `.fai` index of the reference genome to allow for fast random access. We will create this first using `samtools faidx`.
2.  **Run HaplotypeCaller:** We will loop through our 7 analysis-ready BAM files to find variants.
3.  **Output:** GATK will generate a `.vcf.gz` file for each sample, detailing all sites where that sample differs from the reference genome.

In [None]:
%%bash
# Tell Jupyter to run this entire cell as a bash script

# 1. Create the output directory for our new VCF files
echo "--- Creating output directory ../results/vcf_files/ ---"
mkdir -p ../results/vcf_files/

# 2. Define the path to our Reference FASTA
REFERENCE_GENOME="../references/saureus_atcc_29213.fasta"

# --- !! FIX 1: Create FASTA index (.fai) for GATK !! ---
echo "--- Creating FASTA index (.fai) using samtools... ---"
samtools faidx $REFERENCE_GENOME
echo "--- .fai index created successfully. ---"

# --- !! FIX 2: Create Sequence Dictionary (.dict) for GATK !! ---
# GATK also requires a sequence dictionary file (ends in .dict)
# We define the output path for the dictionary
DICTIONARY_FILE="../references/saureus_atcc_29213.dict"

echo "--- Creating Sequence Dictionary (.dict) for GATK... ---"
gatk CreateSequenceDictionary \
    -R $REFERENCE_GENOME \
    -O $DICTIONARY_FILE
echo "--- .dict file created successfully. ---"
# --- End of new fixes ---

# 4. Loop through all *sorted* BAM files
for sorted_bam_path in ../results/bam_sorted/*.bam; do
    
    bam_filename=$(basename "$sorted_bam_path")
    sample_id=$(echo "$bam_filename" | cut -d'.' -f1)
    
    echo "--- Starting GATK HaplotypeCaller for sample: $sample_id ---"
    
    output_vcf="../results/vcf_files/${sample_id}.vcf.gz"
    
    # --- Run GATK HaplotypeCaller ---
    # This command is unchanged, but will now work
    gatk HaplotypeCaller \
        -R $REFERENCE_GENOME \
        -I $sorted_bam_path \
        -O $output_vcf \
        -ploidy 1
    
    echo "--- Finished GATK for sample: $sample_id ---"
done

echo "--- All GATK HaplotypeCaller runs are complete. ---"

In [None]:
# Verify the new VCF files
print("--- Verification: VCF files ---")
!ls -lh ../results/vcf_files/