### MAJOR STEP 1: Setting up the Analysis Environment and Importing Data

The primary goal of this notebook is to analyze the biological impact of the identified mutations across the 7 S. aureus samples (Control, DAP treatments, and VAN treatments).

Since the MultiQC aggregation step was abandoned due to software incompatibility, we will perform a **programmatic aggregation and data cleaning** using the Python **Pandas** library. This will allow us to create a single, clean comparison table, which is essential for biological interpretation.

In [None]:
# Import necessary libraries
import pandas as pd
import glob
import os

print("--- 1. Verification: Pandas is ready for analysis ---")

# Define paths
REPORT_DIR = os.path.join("..", "results", "snpeff_reports")

# --------------------------------------------------------------------------
# Major Step 1.1: Load and Clean the SnpEff Reports
# --------------------------------------------------------------------------
print("\n--- 2. Loading and cleaning reports ---")

# Find all the original .genes.txt files
report_files = glob.glob(os.path.join(REPORT_DIR, "*.snpEff_summary.genes.txt"))
all_data = []

# Loop through each file
for file_path in report_files:
    # Extract the sample name (e.g., SRR11187849)
    sample_name = os.path.basename(file_path).split('.')[0] 
    
    # Load the file, skipping the first line (a comment)
    # and treating the next line (the header) as the column names.
    df = pd.read_csv(
        file_path, 
        sep='\t', 
        skiprows=1, 
        header=0
    )
    
    # Clean the column names (remove the leading '#')
    df.columns = df.columns.str.lstrip('#')
    
    # Add the sample name as a column for easier comparison
    df['Sample'] = sample_name
    
    # Add to our list
    all_data.append(df)

# Concatenate all DataFrames into one master comparison table
comparison_df = pd.concat(all_data, ignore_index=True)

print(f"Successfully loaded and concatenated {len(all_data)} reports.")
print(f"Master comparison table has {comparison_df.shape[0]} rows and {comparison_df.shape[1]} columns.")

# Display the key data columns
print("\n--- 3. Clean Comparison Data (Head) ---")
print(comparison_df[['Sample', 'GeneName', 'variants_impact_HIGH', 'variants_effect_missense_variant']].head(10))

### MAJOR STEP 2: Data Cleaning and Defining Sample Roles

Before comparison, we must clean the dataset by assigning clear biological roles to each sample (Control vs. Treatment) and filtering out redundant information.

In [None]:
# ---
# Major Step 2.1: Assigning Biological Groups
#
# WHY: We need to replace the technical SRR Accessions with their
#      biological function (Control, DAP, VAN) for interpretability.
# ---

# Mapping SRR Accessions to their biological group (from Project Description)
sample_map = {
    'SRR5100333': 'Control_Parent',
    'SRR11187850': 'DAP_P5',
    'SRR11187849': 'DAP_P20',
    'SRR5100329': 'DAP_Final',
    'SRR11187852': 'VAN_P5',
    'SRR11187851': 'VAN_P20',
    'SRR5100339': 'VAN_Final'
}

# Create a new column 'Group' and 'Treatment'
comparison_df['Group'] = comparison_df['Sample'].map(sample_map)
comparison_df['Treatment'] = comparison_df['Group'].apply(lambda x: x.split('_')[0])

print("--- 2.1 Sample Group Assignment Complete ---")
print(comparison_df[['Sample', 'Group', 'Treatment']].drop_duplicates())

In [None]:
# ---
# Major Step 2.2: Focusing on Non-MODIFIER/LOW Variants
#
# WHY: MODIFIER (e.g., intergenic) and LOW impact variants are rarely
#      the primary cause of antibiotic resistance. We focus on MODERATE and HIGH.
# ---

# List all variant impact types we need to keep: HIGH, MODERATE, LOW
IMPACT_COLS = [col for col in comparison_df.columns if col.startswith('variants_impact_')]

# Check if the sum of all HIGH and MODERATE impacts is greater than zero
# We sum the values across the relevant impact columns for each row
comparison_df['Total_Significant_Impact'] = comparison_df['variants_impact_HIGH'] + comparison_df['variants_impact_MODERATE']

# Filter the DataFrame: Keep only rows where Total_Significant_Impact > 0
# AND keep the Control sample, as it is our baseline reference.
filtered_df = comparison_df[
    (comparison_df['Total_Significant_Impact'] > 0) | 
    (comparison_df['Group'] == 'Control_Parent')
].copy()

print("--- 2.2 Data Cleaning Complete ---")
print(f"Original rows: 971")
print(f"Filtered rows (Significant Impacts + Control): {filtered_df.shape[0]}")

# Show the remaining impact distribution
print("\nRemaining rows by Group:")
print(filtered_df.groupby('Group')['GeneName'].count().sort_values(ascending=False))

### MAJOR STEP 3: Isolating Treatment-Specific Variants (Set Difference)

Now that we have a clean list of 95 candidate gene-rows (Significant Impacts + Control), our primary goal is to perform a **set difference**.

We will create a "fingerprint" of all genes that were mutated in the `Control_Parent` sample. Then, we will filter our list of treated variants to find only those mutations in genes that are **NOT** in the control list.

These are our *de novo* mutations—the most likely candidates for causing antibiotic resistance.

In [None]:
# ---
# Major Step 3.1: Create the "Control Fingerprint"
#
# WHY: We need a definitive list of all genes that SnpEff
#      reported (regardless of impact) in the Control sample.
# ---

# We use the 'filtered_df' which already has the 42 control rows
control_genes = filtered_df[filtered_df['Group'] == 'Control_Parent']['GeneName'].unique()

print(f"--- 3.1 Control Fingerprint Created ---")
print(f"Found {len(control_genes)} unique mutated genes in the Control sample.")

# ---
# Major Step 3.2: Isolate Significant Treated Variants
#
# WHY: We now create a list of *only* the significant (HIGH/MODERATE)
#      variants from the treated (DAP/VAN) samples.
# ---

# We filter the 'filtered_df' again, this time *excluding* the control
treated_significant_variants_df = filtered_df[filtered_df['Group'] != 'Control_Parent'].copy()

print(f"Found {treated_significant_variants_df.shape[0]} significant variant-rows in treated samples.")

In [None]:
# ---
# Major Step 3.3: Perform the Set Difference
#
# WHY: We find rows in 'treated_significant_variants_df' where the GeneName
#      is *NOT IN* the 'control_genes' list.
# ---

# The '~' (tilde) symbol means NOT
de_novo_variants_df = treated_significant_variants_df[
    ~treated_significant_variants_df['GeneName'].isin(control_genes)
]

print(f"--- 3.3 Set Difference Complete ---")
print(f"Identified {de_novo_variants_df.shape[0]} *de novo* variant-rows (Prime Suspects).")

# ---
# Major Step 3.4: Final Interpretation
#
# WHY: Let's look at the final table of candidate genes.
# ---
print("\n--- FINAL CANDIDATE LIST (De Novo Mutations) ---")

# We select the most important columns for interpretation
final_columns = [
    'Group', 
    'GeneName', 
    'variants_impact_HIGH', 
    'variants_impact_MODERATE', 
    'variants_effect_missense_variant', # A key type of moderate mutation
    'variants_effect_frameshift_variant' # A key type of high impact mutation
]

# Display the results
# We sort by 'GeneName' to see if the same gene mutated in different samples
print(de_novo_variants_df[final_columns].sort_values(by=['GeneName', 'Group']))

In [None]:
# ---
# Major Step 3.4: Saving the Final Candidate List
#
# WHY: We save our final, clean "de_novo_variants_df" to a CSV file.
#      This allows the next notebook (06_Visualization) to load
#      this data directly, creating a clean separation of concerns
#      (Analysis vs. Visualization).
# ---

# Define the output directory and file path
output_dir = "../results/final_analysis_data/"
output_file = os.path.join(output_dir, "de_novo_candidates.csv")

# Create the directory if it doesn't exist
!mkdir -p {output_dir}

# Save the DataFrame (the 48 rows)
de_novo_variants_df.to_csv(output_file, index=False)

print(f"--- 3.5 Final Candidate List (Prime Suspects) saved to: ---")
print(output_file)

### MAJOR STEP 4: Biological Interpretation and Conclusion

The comparative analysis has successfully identified 48 *de novo* mutations (variants present in treated samples but absent from the Control). These "prime suspects" tell a clear biological story of adaptation:

**1. Daptomycin (DAP) Resistance Pathway (Membrane Adaptation):**
The DAP-treated samples (`DAP_P20`, `DAP_Final`) show a cluster of significant `missense` mutations in genes critical for cell membrane homeostasis:
* **`mprF`**: A well-known DAP resistance gene that alters membrane charge.
* **`agrA`**: The master regulator of virulence, linked to membrane properties.
* **`pgsA`** & **`cls`**: Key enzymes in membrane lipid biosynthesis.

**2. Vancomycin (VAN) Resistance Pathway (Cell Wall Stress Response):**
The VAN-treated samples show `missense` mutations in two-component sensor systems:
* **`vraG`** & **`vraT`**: Components of the `Vra` system that senses cell wall damage.
* **`srrA`**: Another stress-response regulator.
This suggests the VAN samples adapted by permanently activating their cell wall defense pathways.

**3. The "Hypermutator" Engine (`mutL`):**
Perhaps the most significant finding is the emergence of **high-impact `frameshift` mutations** in the DNA mismatch-repair gene **`mutL`** in the DAP lineage. This "knock-out" mutation likely created a "hypermutator" state, accelerating the rate of mutation and allowing genes like `mprF` to acquire resistance-conferring mutations rapidly.

**Conclusion:**
The project has successfully built a pipeline to identify and interpret these key evolutionary pathways to antibiotic resistance.