# RNA-Seq Pipeline: Step 1 - Download Raw SRA Data

This notebook is the first step in the analysis pipeline. Its purpose is to download the 6 raw RNA-Seq samples (FASTQ) associated with **BioProject PRJNA1281001** from the Sequence Read Archive (SRA).

**Tool:** `fasterq-dump` (from the `sra-tools` package)
**Output Directory:** `01_Raw_Data/`

## Experimental Design
The analysis compares two groups:
* **WT (Control):** 3 biological replicates (Wild-Type *K. pneumoniae HS11286*)
* **ΔcpxR (Mutant):** 3 biological replicates (cpxR deletion mutant)

In [None]:
import os

# Define the target output directory for FASTQ files
output_dir = "01_Raw_Data"

# Ensure the output directory exists
os.makedirs(output_dir, exist_ok=True)

# Define the SRA accession numbers for all samples
# (Source: SRA data for PRJNA1281001)

# Group 1: Wild-Type (WT) Samples
wt_samples = ["SRR34134109", "SRR34134108", "SRR34134107"]

# Group 2: Mutant (delta-cpxR) Samples
mutant_samples = ["SRR34134106", "SRR34134105", "SRR34134104"]

# Combine lists for the download loop
all_samples = wt_samples + mutant_samples

print(f"Total samples to download: {len(all_samples)}")
print(f"Output directory: {output_dir}")
print(f"All sample accessions: {all_samples}")

In [None]:
%%bash -s "$output_dir" "$all_samples"
# This cell uses '%%bash' magic to run shell commands.
# '$1' (output_dir) and '$2' (all_samples) are passed in from Python.

# 1. Assign Python variables to Bash variables
output_dir=$1
samples_list=$2

# 2. Convert the Python list string "['SRR...', 'SRR...']"
#    into a clean Bash array (SRR... SRR...)
#    FIX: Used double quotes ("") instead of single ('') for tr
#    to correctly handle the characters to be deleted ( [ ] , ' )
samples_array=$(echo $samples_list | tr -d "[],'")

echo "--- Starting SRA Download Loop ---"
echo "Target Directory: $output_dir"
echo "Processing samples: ${samples_array[@]}"

# 3. Loop through each SRR accession and download
for srr in ${samples_array[@]}; do
    echo "--------------------------------------"
    echo "Fetching: $srr"
    
    # Use fasterq-dump to download
    # --split-files : Splits paired-end data into _1.fastq and _2.fastq
    # -e 8          : Use 8 threads for faster processing
    # -O $output_dir: Set the output directory
    # -p            : Show a progress bar
    
    fasterq-dump $srr --split-files -e 8 -O $output_dir -p
    
    echo "Finished processing $srr."
done

echo "--------------------------------------"
echo "All SRA downloads complete."
echo "Files are located in: $output_dir"

### 4. Verify Downloaded Files

Let's check the contents of the `01_Raw_Data/` directory to ensure all 6 samples (12 files, Paired-End) were downloaded correctly.

In [None]:
%%bash
echo "--- Verifying files in 01_Raw_Data ---"

# List files with human-readable sizes
ls -lh 01_Raw_Data/

echo "--------------------------------------"
echo "Verifying file count..."
# Count the total number of files
file_count=$(ls 01_Raw_Data/ | wc -l)
echo "Total files found: $file_count"

if [ $file_count -eq 12 ]; then
    echo "SUCCESS: All 12 FASTQ files (6 samples x 2 reads) are present."
else
    echo "WARNING: Expected 12 files, but found $file_count."
fi

### 5. Compress FASTQ Files (Optimization)

Raw FASTQ files are very large (several GBs each). To save disk space, we will compress them into `.fastq.gz` format.

We use `pigz` (a parallel version of `gzip`) for fast, multi-threaded compression. All downstream tools (`fastp`, `hisat2`, etc.) can read `.gz` files directly.

In [None]:
%%bash -s "$output_dir"
# $1 is the output_dir variable from Python (Cell 2)

output_dir=$1
echo "--- Starting Compression ---"
echo "Target directory: $output_dir"
echo "Compressing all *.fastq files using pigz (8 threads)..."

# -p 8: Use 8 threads
# $output_dir/*.fastq: Compress all files ending in .fastq in that directory
pigz -p 8 $output_dir/*.fastq

echo "--- Compression complete. ---"
echo "Verifying compressed .gz files:"
ls -lh $output_dir/