# 00. Project Setup: Dependencies and Data Download

This notebook prepares the entire project environment. It performs three main tasks:
1.  **Install Dependencies:** Installs all required bioinformatics tools (GATK, BWA, etc.) into our dedicated conda environment.
2.  **Download Reference Genome:** Fetches the official reference genome for *S. aureus ATCC 29213*.
3.  **Download SRA Data:** Downloads the 7 SRA sequencing runs for our experiment.

## 1. Install Dependencies

We will install all tools using `mamba`. We explicitly manage the Java version (`openjdk=17`) because `GATK4` requires Java 17, while newer `SnpEff` versions require newer Java. Forcing Java 17 ensures all tools are compatible.

In [None]:
# 1. Install primary tools (GATK, BWA, etc.)
# This will install openjdk=17 as a dependency for GATK, which is our anchor.
!mamba install -c conda-forge -c bioconda bwa samtools gatk4 fastqc multiqc -y

# 2. Install SnpEff (The "Surgical" Fix)
# We aggressively remove any existing version first.
!mamba remove snpeff -y --force
# Now, we install a *specific version* (5.1) known to be compatible with openjdk=17.
!mamba install -c conda-forge -c bioconda "snpeff=5.1" openjdk=17 -y

# 3. Verify all installations
print("--- BWA ---")
!bwa | head -n 5
print("\n--- Samtools Version ---")
!samtools --version
print("\n--- GATK Version ---")
!gatk --version
print("\n--- FastQC Version ---")
!fastqc --version
print("\n--- MultiQC Version ---")
!multiqc --version
print("\n--- SnpEff Version (Fixed) ---")
!snpEff -version

## 2. Download Reference Genome (via Accession Number)

To perform variant calling, we must compare our sample reads against a reference genome. We will use the modern NCBI `datasets` tool to download the "gold-standard" RefSeq genome for our strain directly using its accession number.

**Strain:** *Staphylococcus aureus* ATCC 29213
**RefSeq Accession:** `GCF_022832775.1`

The process will be:
1.  Install the `ncbi-datasets-cli` tool.
2.  Use the tool to download the genome (FASTA file) into our `../references/` directory.
3.  Clean up and rename the file to `saureus_atcc_29213.fasta`.

In [None]:
# 1. Install the NCBI datasets command-line tool
!mamba install -c conda-forge ncbi-datasets-cli -y

# 2. Download the genome using the RefSeq Accession Number
# We ask for the FASTA file ('--include genome') and save it as a zip file in our references folder.
!datasets download genome accession GCF_022832775.1 --filename ../references/reference.zip --include genome

# 3. Unzip the downloaded package
# The '-d' flag specifies the directory to extract to.
!unzip -o ../references/reference.zip -d ../references/

# 4. Clean up and Rename
# The zip file creates a messy folder structure ('ncbi_dataset/data/...').
# We find the actual .fna file, move it, and rename it.
!find ../references/ -name "*.fna" -exec mv {} ../references/saureus_atcc_29213.fasta \;

# 5. Remove the leftover zip and extracted folder
!rm ../references/reference.zip
!rm -rf ../references/ncbi_dataset
!rm ../references/README.md

# 6. Verify the final file
print("\n--- Verification: Reference Genome File ---")
!ls -lh ../references/

## 3. Download SRA Sequencing Data

Now we will download the raw sequencing reads for our 7 samples from the NCBI Sequence Read Archive (SRA).

We will use the `sra-tools` suite in a two-step professional workflow:
1.  **`prefetch`:** This tool robustly downloads the compressed SRA-format files. It's designed to handle large downloads and network interruptions.
2.  **`fastq-dump`:** This tool converts the downloaded `.sra` files into the standard FASTQ format that our QC and alignment tools (FastQC, BWA) can read.

**Our 7 Samples:**
* `SRR5100333` (Control Parent)
* `SRR11187850` (DAP P5)
* `SRR11187849` (DAP P20)
* `SRR5100329` (DAP Final)
* `SRR11187852` (VAN P5)
* `SRR11187851` (VAN P20)
* `SRR5100339` (VAN Final)

In [None]:
# Create a list of our sample SRR accessions
srr_ids = [
    "SRR5100333", "SRR11187850", "SRR11187849",
    "SRR5100329", "SRR11187852", "SRR11187851",
    "SRR5100339"
]

# --- Step 1: Download .sra files using prefetch ---
# This downloads files to the default cache (~/ncbi/public/sra/)
print("--- Starting SRA prefetch (this may take a while)... ---")
for srr in srr_ids:
    print(f"Prefetching {srr}...")
    !prefetch {srr} -v
print("--- Prefetch complete. ---")


# --- Step 2: Convert .sra to .fastq.gz using fastq-dump ---
# This finds the prefetched files and converts them, saving the output
# to our project's 'data' directory.
print("\n--- Starting fastq-dump (conversion)... ---")
for srr in srr_ids:
    print(f"Dumping {srr} to ../data/ ...")
    # --gzip: Compresses the output (saves space)
    # --split-files: Creates _1.fastq.gz and _2.fastq.gz for paired-end
    # --outdir: Specifies our data directory
    !fastq-dump --gzip --split-files --outdir ../data/ {srr}
print("--- Fastq-dump complete. ---")


# --- Step 3: Verify the final FASTQ files ---
print("\n--- Verification: Downloaded FASTQ files ---")
!ls -lh ../data/