# Project: Unveiling Mechanisms of Antibiotic Tolerance in Bacterial Biofilms via scRNA-seq
### Multi-species Analysis: *P. aeruginosa*, *S. aureus*, and *E. coli* (GSE260458)

---

## 1. Biological Context
Bacterial biofilms are heterogeneous communities where "persister cells" survive antibiotic treatment. This project aims to investigate the transcriptomic heterogeneity of three major pathogenic bacteria at the single-cell level.
**Hypothesis:** The Toxin-Antitoxin system (specifically *pdeI/hipH* axis) regulates the formation of persister cells via c-di-GMP signaling.

## 2. Study Objectives
1.  **Ingest & Process:** Raw scRNA-seq data from 14 samples across 3 species.
2.  **Analyze Heterogeneity:** Identify persister sub-populations in *P. aeruginosa*, *S. aureus*, and *E. coli*.
3.  **Comparative Analysis:** Compare Wild Type (WT) vs. Mutant ($\Delta\Delta$) strains to validate the role of *pdeI*.

## 3. Experimental Design (14 Samples)

| Species | Strain | Condition | Samples (Replicates) |
|:---|:---|:---|:---|
| **P. aeruginosa** | PAO1 | Biofilm (WT) | 2 Samples |
| **P. aeruginosa** | PAO1 | Biofilm ($\Delta\Delta$ Mutant) | 2 Samples |
| **S. aureus** | USA300 | Exponential (WT) | 2 Samples |
| **S. aureus** | USA300 | Exponential ($\Delta\Delta$ Mutant) | 2 Samples |
| **E. coli** | MG1655 | Exponential (WT) | 2 Samples |
| **E. coli** | MG1655 | Exponential ($\Delta\Delta$ Mutant) | 2 Samples |
| **E. coli** | MG1655 | Static/Biofilm ($\Delta\Delta$ Mutant) | 2 Samples |

---

## 4. Technical Strategy (Data Ingestion)
This notebook executes the data acquisition pipeline.
* **Source:** We retrieve raw sequencing data from the Sequence Read Archive (SRA) under BioProject **PRJNA999602**.
* **Tool:** We use the SRA Toolkit's `fasterq-dump` utility to download and extract FASTQ files directly.
* **Optimization:** All files are immediately compressed using `gzip` to ensure efficient storage allocation on the instance.
* **Scope:** The pipeline processes all **14 samples** (SRR25445867 to SRR25445880) encompassing *P. aeruginosa*, *S. aureus*, and *E. coli* datasets.

In [None]:
%%bash
# 2. Universal Data Ingestion Pipeline
# This script standardizes the download process for all 14 samples using fasterq-dump.

echo "=== Starting Data Ingestion for 14 Samples ==="

# Define the range of SRR Accessions (from SRR25445867 to SRR25445880)
# This covers all PA, SA, and EC samples (WT and Mutants)
START_NUM=25445867
END_NUM=25445880

for (( id=$START_NUM; id<=$END_NUM; id++ )); do
    SAMPLE="SRR${id}"
    
    # Check if the final compressed file already exists
    # This ensures idempotency: if we run the notebook again, it won't re-download existing data.
    if [ -f "raw_data/${SAMPLE}_1.fastq.gz" ]; then
        echo "[SKIP] ${SAMPLE} already exists and is compressed."
        
    elif [ -f "raw_data/${SAMPLE}_1.fastq" ]; then
        echo "[COMPRESS] ${SAMPLE} found uncompressed. Gzipping now..."
        gzip "raw_data/${SAMPLE}_1.fastq" "raw_data/${SAMPLE}_2.fastq"
        
    else
        echo "[DOWNLOAD] Fetching ${SAMPLE} from SRA..."
        # The main download command
        fasterq-dump --split-files ${SAMPLE} --outdir raw_data --progress
        
        echo "[COMPRESS] Compressing ${SAMPLE}..."
        gzip "raw_data/${SAMPLE}_1.fastq" "raw_data/${SAMPLE}_2.fastq"
    fi
done

echo "=== Pipeline Complete. All samples represent accounted for. ==="

## 5. Reference Genome Acquisition
To map the raw reads and quantify gene expression using **Salmon**, we require the reference transcriptomes (CDS) for our three target species.
We will retrieve the standard **RefSeq** assemblies from NCBI for:
1.  ***P. aeruginosa* PAO1:** The standard lab strain for Pseudomonas research.
2.  ***S. aureus* USA300:** A major methicillin-resistant (MRSA) strain.
3.  ***E. coli* MG1655:** The model K-12 strain.

**Files to retrieve:**
* **CDS FASTA (`_cds_from_genomic.fna.gz`):** Contains the coding sequences (transcripts) required for Salmon indexing.

In [None]:
%%bash
# 3. Downloading Reference Transcriptomes
# We retrieve specific assemblies from NCBI RefSeq FTP.

mkdir -p references
cd references

echo "=== Starting Reference Download ==="

# --- 1. P. aeruginosa PAO1 ---
echo "[1/3] Downloading P. aeruginosa PAO1..."
# We download the CDS (Coding DNA Sequence) for Salmon
wget -q -O PAO1_cds.fna.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/006/765/GCF_000006765.1_ASM676v1/GCF_000006765.1_ASM676v1_cds_from_genomic.fna.gz

# --- 2. S. aureus USA300 ---
echo "[2/3] Downloading S. aureus USA300..."
wget -q -O USA300_cds.fna.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/013/465/GCF_000013465.1_ASM1346v1/GCF_000013465.1_ASM1346v1_cds_from_genomic.fna.gz

# --- 3. E. coli MG1655 ---
echo "[3/3] Downloading E. coli MG1655..."
wget -q -O MG1655_cds.fna.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_cds_from_genomic.fna.gz

echo "=== Unzipping Reference Files ==="
# Unzip all downloaded files
gunzip -f *.gz

echo "Reference acquisition complete. Contents of references/:"
ls -lh

## 6. Conclusion
At the end of this notebook, our environment is fully prepared:
1.  **Project Structure:** Established.
2.  **Raw Data:** All 14 scRNA-seq samples are processed and compressed in `raw_data/`.
3.  **References:** Transcriptomes for *P. aeruginosa*, *S. aureus*, and *E. coli* are ready in `references/`.

**Next Step:** In `02_Alignment_and_Quantification.ipynb`, we will generate the genome indices using **Salmon** and perform the read alignment.