# Test QC Function
This notebook tests `qc_ip()` and allows you to drop bad samples.

**Workflow:**
1. Load data from prep
2. Generate QC plots
3. Review plots
4. Drop problematic samples (optional)
5. Save cleaned data

# Run QC

In [None]:
# Cell 1: Imports
import sys
sys.path.append('..')


from ipms.analysis import load_data, qc_ip, drop_samples, save_data
import pandas as pd

print("âœ“ All imports successful!")




# Cell 2: Load Data
data = load_data('../results/data_after_prep.pkl')

print(f"\nâœ“ Loaded {data['metadata']['n_proteins']} proteins")
print(f"Samples: {data['metadata']['n_samples']}")


# Cell 3: Run QC Analysis
qc_ip(data)

print("\nâœ“ QC plots saved to: results/figures/qc/")
print("\nâš  STOP HERE and review the plots before continuing!")
print("\nOpen the QC plots and check:")
print("  1. Correlation heatmap - Do replicates correlate well?")
print("  2. PCA plot - Do replicates cluster together?")
print("  3. Distributions - Are any samples very different?")
print("\nIf any samples look bad, note them and run the next cell.")





# Check missingness per sample (CORRECTED)
df = data['df']
intensity_cols = data['intensity_cols']

# First get all intensity columns
all_intensity = []
for cols in intensity_cols.values():
    all_intensity.extend(cols)

print("="*60)
print("MISSING VALUES PER SAMPLE")
print("="*60)

for condition, cols in intensity_cols.items():
    print(f"\n{condition}:")
    for col in cols:
        missing = df[col].isna().sum()
        present = df[col].notna().sum()
        total = len(df)
        pct_missing = (missing / total) * 100
        pct_present = (present / total) * 100
        
        # Extract sample name for readability
        sample_name = col.split(',')[-1].strip() if ',' in col else col[-10:]
        
        print(f"  {sample_name:15} {pct_missing:5.1f}% missing, {pct_present:5.1f}% present ({missing:4} missing, {present:4} present)")

print("\n" + "="*60)
print("MISSING VALUES BY CONDITION (AVERAGE)")
print("="*60)
for condition, cols in intensity_cols.items():
    total_values = len(df) * len(cols)
    missing = df[cols].isna().sum().sum()
    present = df[cols].notna().sum().sum()
    pct_missing = (missing / total_values) * 100
    pct_present = (present / total_values) * 100
    print(f"  {condition}: {pct_missing:.1f}% missing, {pct_present:.1f}% present")

overall_missing = df[all_intensity].isna().sum().sum()
overall_total = len(df) * len(all_intensity)
print(f"\nOverall: {(overall_missing/overall_total)*100:.1f}% missing, {((overall_total-overall_missing)/overall_total)*100:.1f}% present")

## ðŸ›‘ STOP AND REVIEW QC PLOTS

Before proceeding, open and review QC PLOTS. 

### **bad samples**

d2d3 gel 3 -> clusters with ev(#13)

d2d3 gel 2 -> outlier (#12)

ev gel 4 -> clustered with wt and d2d3 (#5)

wt gel 5 -> clustered with d2d3 (#10)


# Drop bad samples

In [None]:
# Cell 4: Drop Samples - INTERACTIVE MODE
# Run this if you want to interactively select samples to drop

data2 = drop_samples(data)

# This will:
# 1. Show all samples with numbers
# 2. Prompt you to enter numbers of samples to drop
# 3. Ask for confirmation
# 4. Remove those samples from the data

# Re-check missingness

In [None]:
qc_ip(data2, output_suffix='_rm_outliers')


# Check missingness per sample (CORRECTED)
df = data2['df']
intensity_cols = data2['intensity_cols']

# First get all intensity columns
all_intensity = []
for cols in intensity_cols.values():
    all_intensity.extend(cols)

print("="*60)
print("MISSING VALUES PER SAMPLE")
print("="*60)

for condition, cols in intensity_cols.items():
    print(f"\n{condition}:")
    for col in cols:
        missing = df[col].isna().sum()
        present = df[col].notna().sum()
        total = len(df)
        pct_missing = (missing / total) * 100
        pct_present = (present / total) * 100
        
        # Extract sample name for readability
        sample_name = col.split(',')[-1].strip() if ',' in col else col[-10:]
        
        print(f"  {sample_name:15} {pct_missing:5.1f}% missing, {pct_present:5.1f}% present ({missing:4} missing, {present:4} present)")

print("\n" + "="*60)
print("MISSING VALUES BY CONDITION (AVERAGE)")
print("="*60)
for condition, cols in intensity_cols.items():
    total_values = len(df) * len(cols)
    missing = df[cols].isna().sum().sum()
    present = df[cols].notna().sum().sum()
    pct_missing = (missing / total_values) * 100
    pct_present = (present / total_values) * 100
    print(f"  {condition}: {pct_missing:.1f}% missing, {pct_present:.1f}% present")

overall_missing = df[all_intensity].isna().sum().sum()
overall_total = len(df) * len(all_intensity)
print(f"\nOverall: {(overall_missing/overall_total)*100:.1f}% missing, {((overall_total-overall_missing)/overall_total)*100:.1f}% present")





# Cell 6: Save Cleaned Data
# Save the data (with or without dropped samples) for the next step
save_data(data2, '../results/data_after_qc.pkl')

print("\nâœ“ Data saved!")
print("\nNext step: 03_test_norm.ipynb for normalization")





# Cell 7: Final Summary
print("="*60)
print("QC SUMMARY")
print("="*60)

print(f"\nFinal dataset:")
print(f"  Proteins: {data2['metadata']['n_proteins']}")
print(f"  Samples: {data2['metadata']['n_samples']}")
print(f"  Conditions: {data2['metadata']['conditions']}")

print(f"\nReplicates per condition:")
for condition, n_reps in data2['metadata']['replicates_per_condition'].items():
    print(f"  {condition}: {n_reps}")

if 'samples_dropped' in data2['metadata'] and data2['metadata']['samples_dropped']:
    print(f"\nSamples dropped: {len(data2['metadata']['samples_dropped'])}")
    for sample in data2['metadata']['samples_dropped']:
        print(f"  - {sample}")
else:
    print(f"\nNo samples dropped - all samples passed QC")

print("\n" + "="*60)




## Notes

### When to Drop Samples:
- **Low correlation** with replicates (<0.7)
- **Outlier in PCA** - far from other replicates
- **Different distribution** - very different from siblings
- **Technical failure** - you know something went wrong during prep

### When NOT to Drop:
- Sample is slightly different but still >0.8 correlation
- Biological variation (expected differences between conditions)
- You'd drop ALL replicates of a condition (keep at least 2!)

### Best Practice:
- Keep at least 2-3 replicates per condition
- Document why you dropped samples
- Consider re-running QC after dropping to verify improvement

---

## Next Steps

Data saved to: `results/data_after_qc.pkl`

Continue to: **03_test_norm.ipynb** for normalization