## MAJOR STEP 1: Metadata Preparation

**Goal:** Load, inspect, and clean the raw SRA metadata (`SraRunTable_raw.csv`).
**Why:** We need a clean mapping file that connects each **scientific sample ID** (e.g., "PA011") to its **technical run ID** (e.g., "SRR166..."). This simple, clean file will be the "driver" for our entire Snakemake pipeline.

In [5]:
import pandas as pd

# --- 1. Define Paths ---
# (Using '../' to go up from 'notebooks/' directory)
file_in = '../data/SraRunTable_raw.csv'
file_out = '../results/metadata/metadata_clean.csv'

# --- 2. Define Columns ---
# (Based on our 'Inspection' step)
columns_to_keep = [
    'Run',
    'BioSample',
    'Sample Name'
]

# (New names for clarity in our pipeline)
new_column_names = {
    'Run': 'run_id',
    'BioSample': 'biosample_id',
    'Sample Name': 'sample_id'
}

# --- 3. Run the Main Step (Read, Clean, Rename) ---
print(f"Reading raw metadata from {file_in}...")

# usecols= : Reads only the columns we need (very efficient)
df = pd.read_csv(file_in, usecols=columns_to_keep)

# Rename columns to our new standard
df = df.rename(columns=new_column_names)

# --- 4. Verification (Respecting Rule 1) ---
print(f"Cleaned {len(df)} records. Verifying first 5 rows:")
print(df.head())

# --- 5. Save the Clean File ---
# (We must create the 'results/metadata' directory first)
import os
os.makedirs('../results/metadata', exist_ok=True) # Ensure directory exists

df.to_csv(file_out, index=False)
print(f"\nSuccessfully saved clean metadata to {file_out}")

Reading raw metadata from ../data/SraRunTable_raw.csv...
Cleaned 96 records. Verifying first 5 rows:
        run_id  biosample_id sample_id
0  SRR16632095  SAMN22305148     PA097
1  SRR16632096  SAMN22305147     PA096
2  SRR16632097  SAMN22305146     PA095
3  SRR16632098  SAMN22305144     PA093
4  SRR16632099  SAMN22305143     PA092

Successfully saved clean metadata to ../results/metadata/metadata_clean.csv


##  STEP 2: Test Data Download (Single Sample)

**Goal:** Test our download pipeline on a *single* sample.
**Why:** Before automating all 96 downloads (the "High-Throughput" part), we must verify that our `fasterq-dump` command works correctly and that our `metadata_clean.csv` file provides the correct ID.

In [2]:
import pandas as pd
import os

# --- 1. Load our clean metadata map ---
metadata_file = '../results/metadata/metadata_clean.csv'
df_meta = pd.read_csv(metadata_file)

# --- 2. Select our single test sample ---
# We'll just pick the first one from the file
test_run_id = df_meta.loc[0, 'run_id']
test_sample_id = df_meta.loc[0, 'sample_id']

print(f"--- Preparing to test download for: ---")
print(f"Sample ID: {test_sample_id}")
print(f"Run ID: {test_run_id}")

# --- 3. Define and create output directory ---
output_dir = '../data/raw_reads'
os.makedirs(output_dir, exist_ok=True)
print(f"Ensured output directory exists: {output_dir}")

# --- 4. Build and run the fasterq-dump command ---
# We use:
# --split-files : To get R1 and R2 (this is Paired-End data)
# -O : Output directory
# -p : Show progress

command = f"fasterq-dump --split-files -O {output_dir} -p {test_run_id}"

print(f"\nExecuting command:\n{command}")

# This will run the command in the shell. It may take a minute.
!{command}

# --- 5. Verification (Rule 1) ---
# If this step succeeds, we should see the new FASTQ files
print(f"\n--- Verification: Listing files in {output_dir} ---")
!ls -lh {output_dir}

--- Preparing to test download for: ---
Sample ID: PA097
Run ID: SRR16632095
Ensured output directory exists: ../data/raw_reads

Executing command:
fasterq-dump --split-files -O ../data/raw_reads -p SRR16632095
join   :|-------------------------------------------------- 100%
concat :|-------------------------------------------------- 100%
spots read      : 1,775,470
reads read      : 3,550,940
reads written   : 3,550,940

--- Verification: Listing files in ../data/raw_reads ---
total 1.3G
-rw-rw-r-- 1 refm_youssef refm_youssef 626M Oct 31 03:51 SRR16632095_1.fastq
-rw-rw-r-- 1 refm_youssef refm_youssef 626M Oct 31 03:51 SRR16632095_2.fastq


##  Conclusion & Handoff to Automation

**Status:** Success. We have confirmed two critical things:
1.  Our metadata is clean and saved (`results/metadata/metadata_clean.csv`).
2.  Our download tool (`fasterq-dump`) and method are working correctly on a single test sample.

**Next Step (The Handoff):**
The R&D (Research & Development) phase in this notebook is complete. The "Production" phase (downloading all 96 samples) is now automated using the main project `Snakefile`.

To execute the full, high-throughput download, run the following command **from the main project terminal**:

```bash
snakemake --cores 4