# Test Prep Function
This notebook tests `prep_ip()` - the data loading and preparation step.

**Output:** Saves `data_after_prep.pkl` for use in subsequent notebooks.

In [2]:
# Cell 1: Imports
import sys
sys.path.append('..')

from ipms.prep import prep_ip
import pandas as pd

print("✓ All imports successful!")



# Cell 2: Run prep_ip()
# This will load data, filter, and automatically save to data_after_prep.pkl

data = prep_ip('../config/example_config.yaml')

print("\n✓ prep_ip() complete!")
print("\nData has been saved to: results/data_after_prep.pkl")
print("Next step: Run 02_test_qc.ipynb")



# Cell 3: Quick Inspection
print("="*60)
print("DATA SUMMARY")
print("="*60)

print(f"\nProteins: {data['metadata']['n_proteins']}")
print(f"Samples: {data['metadata']['n_samples']}")
print(f"Conditions: {data['metadata']['conditions']}")

print(f"\nIntensity columns per condition:")
for condition, cols in data['intensity_cols'].items():
    print(f"  {condition}: {len(cols)} replicates")

print(f"\nFirst few proteins:")
df = data['df']
print(df[['Accession', 'Gene_Symbol', '# Peptides']].head())




# Cell 4: Verify Saved File
import os

save_path = '../results/data_after_prep.pkl'

if os.path.exists(save_path):
    size_mb = os.path.getsize(save_path) / (1024 * 1024)
    print(f"✓ Data saved successfully!")
    print(f"  Location: {save_path}")
    print(f"  Size: {size_mb:.1f} MB")
    print(f"\n✓ Ready for next step: 02_test_qc.ipynb")
else:
    print(f"✗ Save file not found: {save_path}")

✓ All imports successful!

STEP 1: LOADING DATA AND CONFIGURATION

> Configuration loaded
  Experiment: AS3MT_IP-MS
  Control: EV
  Treatments: WT, d2d3

[1/8] Loading proteomics data...


Input sequence provided is already in string format. No operation performed
Input sequence provided is already in string format. No operation performed


  > Loaded 6527 proteins, 30 columns

[2/8] Identifying intensity columns...
  EV: 5 replicates
  WT: 5 replicates
  d2d3: 5 replicates
  > Total intensity columns: 15

[3/8] Filtering by minimum peptides...
  > Removed 1336 proteins with < 2 peptides
    Remaining: 5191 proteins

[4/8] Filtering proteins by missingness in treatments...
  WT: 3127 proteins have >=3/5 valid values
  d2d3: 3559 proteins have >=3/5 valid values

  > Removed 1490 low-quality proteins
    Remaining: 3701 proteins
    (Kept proteins present in >=50% of replicates in at least one treatment)

[5/8] Checking for gene symbols...

  Gene symbols missing - mapping from protein IDs using mygene...
  Querying mygene for 3701 proteins...
  Cleaned to 3688 unique base IDs
  Trying scope: uniprot...


14 input query terms found dup hits:	[('P04908', 2), ('Q96PK6', 2), ('P68431', 10), ('P59665', 2), ('Q9ULR0', 2), ('P62807', 6), ('Q9Y3E7
53 input query terms found no hit:	['P01857', 'Q5SS57', 'A0A0U1RRH7', 'A0A140TA69', 'F8WE88', 'A0A0G2JMZ8', 'A8MWD9', 'A0A994J749', 'A0
Input sequence provided is already in string format. No operation performed
Input sequence provided is already in string format. No operation performed


    > Mapped 3635 proteins
  Trying scope: accession,uniprot,refseq,ensembl.protein...


46 input query terms found no hit:	['Q5SS57', 'A0A0U1RRH7', 'A0A140TA69', 'F8WE88', 'A0A0G2JMZ8', 'A0A994J749', 'A0A590UK80', 'J3KR72',


    > Mapped 7 proteins

  > Successfully mapped 3655/3701 proteins to gene symbols

[6/8] Removing manual contaminants...
    Found 42 proteins matching 'KRT'
    Found 1 proteins matching 'Keratin'
    Found 5 proteins matching 'TRYP'
    Found 3 proteins matching 'ALB'
    Found 10 proteins matching 'Immunoglobulin'

  > Removed 61 manual contaminant proteins
    Remaining: 3640 proteins

[7/8] Filtering contaminants using CRAPome database...
  Skipping CRAPome filtering

[8/8] Saving filtered data as CSV for reference...
  > Saved: filtered_proteins_after_prep.csv
    Location: /Users/richard.cassidy/ipms_pipeline/results/tables
    3640 proteins x 31 columns
  > Saved: filtered_proteins_summary.csv
    (Protein info + intensities only)

Data quality summary...

  Missing values by condition:
    EV: 7.2% missing
    WT: 19.7% missing
    d2d3: 12.2% missing

  Proteins detected per condition:
    EV: 3629 proteins
    WT: 3542 proteins
    d2d3: 3635 proteins

> Output directories

## Next Steps

Data is now saved and ready for:
- **02_test_qc.ipynb** - Quality control plots
- **03_test_norm.ipynb** - Normalization

You can close this notebook - the data is saved!