# Quick Demo: Biomedical Active Learning

A streamlined demonstration of the key results from our biomedical active learning research.

## 🎯 Main Finding
**Active learning achieves comparable performance to full models while using significantly less training data.**

## 📊 Key Results
- **BBB Dataset**: RF Active Learning: MCC 0.620 vs Full Model: MCC 0.655 (94% performance with fraction of data)
- **Breast Cancer**: QBC Active Learning: **MCC 0.942 vs Full Model: MCC 0.925** (outperformed!)
- **Efficiency**: Active learning reaches peak performance within 5-10 iterations

## 🔬 Datasets
1. **Blood-Brain Barrier Penetration**: 1,976 molecular samples (SMILES → features)
2. **Breast Cancer Wisconsin**: 569 clinical samples (30 features)

## 🧠 Methods
- **Random Forest** with uncertainty sampling
- **Query-by-Committee** with vote entropy
- Multiple runs with statistical analysis

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Visualize key results
datasets = ['BBB', 'Breast Cancer']
full_model_mcc = [0.655, 0.925]
al_model_mcc = [0.620, 0.942]
al_model_std = [0.030, 0.006]

x = np.arange(len(datasets))
width = 0.35

fig, ax = plt.subplots(figsize=(10, 6))
bars1 = ax.bar(x - width/2, full_model_mcc, width, label='Full Model', color='skyblue', edgecolor='black')
bars2 = ax.bar(x + width/2, al_model_mcc, width, yerr=al_model_std, 
               label='Active Learning', color='lightcoral', edgecolor='black', capsize=5)

ax.set_xlabel('Dataset')
ax.set_ylabel('Matthews Correlation Coefficient (MCC)')
ax.set_title('Active Learning vs Full Model Performance')
ax.set_xticks(x)
ax.set_xticklabels(datasets)
ax.legend()
ax.grid(True, alpha=0.3)

# Add value labels on bars
def autolabel(rects, values, errors=None):
    for i, rect in enumerate(rects):
        height = rect.get_height()
        if errors is not None:
            ax.annotate(f'{values[i]:.3f}±{errors[i]:.3f}',
                       xy=(rect.get_x() + rect.get_width() / 2, height),
                       xytext=(0, 3),
                       textcoords="offset points",
                       ha='center', va='bottom', fontsize=10)
        else:
            ax.annotate(f'{values[i]:.3f}',
                       xy=(rect.get_x() + rect.get_width() / 2, height),
                       xytext=(0, 3),
                       textcoords="offset points",
                       ha='center', va='bottom', fontsize=10)

autolabel(bars1, full_model_mcc)
autolabel(bars2, al_model_mcc, al_model_std)

plt.tight_layout()
plt.show()

print("🎉 Key Insights:")
print("1. BBB Dataset: Active learning achieved 94% of full model performance")
print("2. Breast Cancer: Active learning OUTPERFORMED the full model!")
print("3. Significant data efficiency: ~80% reduction in required training samples")
print("4. Robust results with low variance across multiple runs")

## 💡 Impact

### For Biomedical Research:
- **Reduced annotation costs**: Less expert labeling required
- **Faster model development**: Quicker iteration cycles
- **Better resource utilization**: Focus on most informative samples

### For Drug Discovery:
- **BBB permeability prediction**: Efficient screening of molecular libraries
- **Clinical decision support**: Optimized diagnostic model training
- **Rare disease applications**: Effective learning from limited data

## 🚀 Next Steps
1. **Scale to larger datasets**: Test on genomic and proteomic data
2. **Multi-modal active learning**: Combine molecular and clinical features
3. **Real-time deployment**: Implement in clinical decision support systems
4. **Domain adaptation**: Transfer learning across biomedical domains

---

**For complete experiments and analysis, see the other notebooks in this collection.**