# Project 3: 16S Microbiome Analysis (Crohn's Disease)

## Notebook 00: Setup and Data Download

### Objective
The goal of this notebook is to prepare our environment and download the raw sequencing data for the project.

-   **Project:** `PRJNA450540`
-   **Scientific Question:** Compare the gut microbiome of Crohn's Disease (CD) patients in three states (Active, Remission) against Healthy Controls.

### Workflow
1.  **Inspect Metadata:** Load the `SraRunTable.csv` file (our "map") to identify the columns needed for analysis (e.g., Run ID, Disease State).
2.  **Install Tools:** Install NCBI `sra-tools` to allow programmatic download from the SRA database.
3.  **Download Data:** Use `sra-tools` (specifically `fasterq-dump`) to automatically download all 256 FASTQ files based on the Run IDs from our metadata file.

In [None]:
import pandas as pd

# Set a pandas option to display all columns (not just a few)
# This is helpful for wide tables like this one.
pd.set_option('display.max_columns', None)

# Define the path to our metadata file
# We use '../' to go UP one level (from 'notebooks/')
# and then DOWN into the 'data/' folder.
metadata_file = "../data/SraRunTable.csv"

# Load the CSV into a pandas DataFrame (df)
df = pd.read_csv(metadata_file)

# Display the first 5 rows to inspect the columns
df.head()

### 2. Investigate Metadata Columns

`df.head()` showed us the structure. Now, we need to find all the unique categories within the key columns. This will confirm we have the groups we need (Healthy, Active, Remission) and tell us how they are named.

In [None]:
# Let's check the unique values in 'host_subject_id'
# .unique() gives us a list of every unique entry
# We use list() to make it print nicely
subject_ids = list(df['host_subject_id'].unique())

print(f"Total unique subjects found: {len(subject_ids)}")
print("---------------------------------")
print(subject_ids)

### 3. Verify Group Definitions

Our hypothesis from `host_subject_id` is:
-   `Control_...` = Healthy
-   `patient_..._active` = Active CD
-   `patient_...` (no suffix) = Remission CD

We must verify this. Let's inspect other columns like `gastrointest_disord` and `PGA` to see what labels they contain. This will confirm our groupings.

In [None]:
# Let's check the unique values in other potential grouping columns
print("Unique values in 'gastrointest_disord':")
print(list(df['gastrointest_disord'].unique()))
print("---------------------------------")
print("Unique values in 'PGA':")
print(list(df['PGA'].unique()))

### 4. Conclusion from Metadata

The investigation is successful. We have confirmed the presence of our three target groups in the metadata.

* The **`Run`** column will give us the SRA IDs for download.
* The **`PGA`** column will be our primary scientific variable (Healthy/Active/Remission).
* The **`host_subject_id`** column will be used to identify unique individuals (important for longitudinal analysis later).

Now we can proceed to the main goal of this notebook: downloading the data.

### 5. Install SRA-Tools

To download the 256 files from NCBI SRA, we need the official `sra-tools` toolkit. We will install it into our `qiime2_env` environment using `mamba`.

We use `!` to run a shell command (like `mamba`) directly from inside our Jupyter notebook.

In [None]:
# Install sra-tools using mamba
# -n qiime2_env: ensures it installs in our specific environment
# -c bioconda: the channel where sra-tools is located
# -y: automatically says 'yes' to the installation prompt

!mamba install -n qiime2_env -c bioconda sra-tools -y

### 6. Prepare for Data Download

Now that `sra-tools` is installed, we can download the data.

**Warning:** We will *not* run the download command directly in the notebook cell. This process will take hours and will likely time out or crash the notebook kernel.

**The Professional Workflow:**
1.  Extract the list of all 256 `Run` IDs from our `df` DataFrame.
2.  Save this list to a simple text file called `sra_run_list.txt` inside the `data` folder.
3.  Create a `bash` script (`download_data.sh`) to download the files.
4.  We will then run this script from our **Jupyter Terminal** (which is inside our `screen` session) to ensure it runs safely in the background.

In [None]:
# 1. Get the 'Run' column from our DataFrame as a list
run_ids = df['Run'].tolist()

# 2. Check how many IDs we have
print(f"Total Run IDs extracted: {len(run_ids)}")

# 3. Define the path for our new text file
output_list_file = "../data/sra_run_list.txt"

# 4. Write this list to the text file
# We use '\n'.join(run_ids) to write one ID per line
with open(output_list_file, 'w') as f:
    f.write('\n'.join(run_ids))

print(f"Successfully saved list to: {output_list_file}")

### 7. Create the Download Script

We will use the `%%writefile` "magic" command to write the contents of this cell directly to a new file called `download_data.sh` inside our `data` folder.

In [None]:
%%writefile ../data/download_data.sh

#!/bin/bash

# This script will download SRA files using fasterq-dump

# 1. Define the path to our list of IDs
ID_LIST_FILE="../data/sra_run_list.txt"

# 2. Define the output directory (where FASTQ files will go)
OUT_DIR="../data/raw_fastq/"

# 3. Create the output directory if it doesn't exist
mkdir -p $OUT_DIR

# 4. Loop through each ID in the ID_LIST_FILE
while read SRR_ID; do
    
    echo "----------------------------------------"
    echo "Starting download for: $SRR_ID"
    
    # Run fasterq-dump
    # -e 8: Use 8 threads (matches our 8 CPUs)
    # -p: Show progress
    # -O $OUT_DIR: Set the output directory
    # -S: Split files (for paired-end, though ours is single-end)
    fasterq-dump $SRR_ID -e 8 -p -O $OUT_DIR
    
    # Check if the download was successful
    if [ $? -eq 0 ]; then
        echo "Successfully downloaded: $SRR_ID"
    else
        echo "ERROR: Failed to download $SRR_ID"
    fi
    
done < "$ID_LIST_FILE"

echo "----------------------------------------"
echo "All downloads complete."