# 02. Alignment & Quantification Pipeline
### Generating Gene Counts using Salmon

---

## 1. Strategy: Hybrid Workflow
Since alignment and quantification are computationally intensive tasks (requiring significant CPU time and RAM), we will not run them directly inside this Jupyter Notebook kernel to avoid timeouts.

**The Workflow:**
1.  **Preparation (Here):** We will write a unified Bash script (`salmon_pipeline.sh`) that handles both indexing and quantification.
2.  **Execution (Terminal):** We will execute this script in the background via the VM terminal.
3.  **Result:** The script will generate quantification directories in `processed_data/`.

## 2. The Tool: Salmon
We use **Salmon** for transcript quantification. It requires:
1.  **Indexing:** Converting the reference transcriptomes (CDS) into a searchable structure.
2.  **Quantification:** Mapping the raw FASTQ reads to the index to count gene expression.

In [None]:
%%writefile run_salmon.sh
#!/bin/bash

# ==========================================
# SALMON PIPELINE V2: Robust & Clean
# ==========================================

# 1. Setup Directories
mkdir -p processed_data/salmon_quant
mkdir -p references/indices

# --- Function to Build Index ---
build_index() {
    species=$1
    fasta=$2
    index_path="references/indices/${species}_index"

    echo "üèóÔ∏è  Building Index for $species..."
    
    # Salmon Indexing Command
    salmon index -t $fasta -i $index_path -k 31
    
    # Verification
    if [ -f "$index_path/versionInfo.json" ]; then
        echo "‚úÖ Index for $species built successfully."
    else
        echo "‚ùå CRITICAL ERROR: Index for $species FAILED to build."
        exit 1
    fi
}

# --- STEP 1: FORCE BUILD INDICES ---
echo "=== STEP 1: Building Indices ==="
build_index "PAO1" "references/PAO1_cds.fna"
build_index "USA300" "references/USA300_cds.fna"
build_index "MG1655" "references/MG1655_cds.fna"


# --- STEP 2: QUANTIFICATION LOOP ---
echo "=== STEP 2: Starting Quantification ==="

for file in raw_data/*_1.fastq.gz; do
    filename=$(basename "$file")
    sample="${filename%_1.fastq.gz}"
    
    # Define Input/Output Files
    read1="raw_data/${sample}_1.fastq.gz"
    read2="raw_data/${sample}_2.fastq.gz"
    output="processed_data/salmon_quant/${sample}"

    # Select Index based on Sample ID (Experimental Design)
    if [[ "$sample" == "SRR25445867" || "$sample" == "SRR25445868" || "$sample" == "SRR25445869" || "$sample" == "SRR25445870" ]]; then
        INDEX="references/indices/PAO1_index"
    elif [[ "$sample" == "SRR25445871" || "$sample" == "SRR25445872" || "$sample" == "SRR25445873" || "$sample" == "SRR25445874" ]]; then
        INDEX="references/indices/USA300_index"
    else
        INDEX="references/indices/MG1655_index" # E. coli
    fi

    echo "------------------------------------------------"
    echo "üöÄ Processing $sample using $(basename $INDEX)"
    
    # Run Salmon Quant (Mapping)
    salmon quant -i $INDEX -l A \
        -1 $read1 -2 $read2 \
        -p 4 --validateMappings -o $output -q

    if [ $? -eq 0 ]; then
        echo "‚úÖ Success: $sample"
    else
        echo "‚ùå Error processing $sample"
    fi
done

echo "=== PIPELINE FINISHED ==="

## 3. Execution Instructions

The alignment pipeline script `run_salmon.sh` has been generated. Since this process is computationally intensive, it should be executed in the terminal background.

**Recommended Terminal Commands:**

```bash
# 1. Open a screen session (to keep it running if disconnected)
screen -S alignment

# 2. Execute the script
./run_salmon.sh
