# Week 12 — Capstone: End-to-End ML Pipeline

**Course:** Applied ML Foundations for SaaS Analytics  
**Week Focus:** Build complete ML system from data to deployment.

---

## 🎯 Learning Objectives

- Implement full ML pipeline (ETL → Features → Training → Evaluation → Deployment)
- Production-ready error handling and monitoring
- Model versioning and reproducibility
- Performance monitoring in production
- Real-world considerations (data drift, retraining)

In [None]:
import pandas as pd, numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_auc_score, precision_recall_curve, confusion_matrix
from datetime import datetime
import pickle
import warnings
warnings.filterwarnings('ignore')

print(f"Pipeline started at {datetime.now().isoformat()}")

# Load all data sources
subs = pd.read_csv('../data/subscriptions.csv', parse_dates=['signup_date', 'churn_date'])
feature_usage = pd.read_csv('../data/feature_usage.csv')
user_events = pd.read_csv('../data/user_events.csv')

print(f"✓ Data loaded: {len(subs)} subscriptions, {len(feature_usage)} feature usage events")
print(f"  {len(user_events)} user events")
print(f"  Churn rate: {(subs['churn_date'].notna().sum() / len(subs) * 100):.1f}%")

## Part 1: Feature Engineering Pipeline

**Key principle:** Create features from multiple data sources

**Challenges:**
- Handle missing values
- Normalize/scale features
- Feature selection
- Feature versioning

**💡 Depth Note:** How does feature engineering impact model performance? A/B test feature sets.

In [None]:
# Feature engineering
engagement = feature_usage.groupby('user_id').agg({
    'usage_count': 'sum',
    'feature_name': 'nunique',
    'last_used': lambda x: (datetime.now() - pd.to_datetime(x.max())).days if pd.notna(x.max()) else np.nan
}).reset_index()
engagement.columns = ['user_id', 'total_usage', 'features_adopted', 'days_since_active']

events = user_events.groupby('user_id').size().reset_index(name='total_events')

df = subs[['user_id', 'signup_date', 'tenure_days', 'mrr', 'plan_tier', 'churn_date']].merge(engagement, on='user_id', how='left')
df = df.merge(events, on='user_id', how='left').fillna(0)
df['churned'] = df['churn_date'].notna().astype(int)

# Time-based split (critical for production!)
df['signup_month'] = pd.to_datetime(df['signup_date']).dt.to_period('M')
latest_month = df['signup_month'].max()
train_cutoff = latest_month - 1  # Leave last month for testing

df_train = df[df['signup_month'] <= train_cutoff]
df_test = df[df['signup_month'] > train_cutoff]

print(f"\nTime-based split:")
print(f"  Train: {len(df_train)} customers (months up to {train_cutoff})")
print(f"  Test:  {len(df_test)} customers (month {latest_month})")
print(f"  ✓ Prevents data leakage from future data")

features = ['tenure_days', 'mrr', 'total_usage', 'features_adopted', 'total_events', 'days_since_active']
X_train = df_train[features]
y_train = df_train['churned']
X_test = df_test[features]
y_test = df_test['churned']

print(f"\nFeatures: {features}")
print(f"  Train churn: {y_train.sum()} / {len(y_train)} ({y_train.mean()*100:.1f}%)")
print(f"  Test churn:  {y_test.sum()} / {len(y_test)} ({y_test.mean()*100:.1f}%)")

## Part 2: Model Training with Cross-Validation

**Why cross-validation?** More robust estimate of performance

**Stratified K-Fold:** Maintain class distribution in each fold

**💡 Depth Note:** How does CV affect model selection? What if you don't have enough data?

In [None]:
from sklearn.preprocessing import StandardScaler

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Cross-validation
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=5, random_state=42)

# CV scores
cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=skf, scoring='roc_auc')

print(f"Cross-validation results (5-fold):")
for i, score in enumerate(cv_scores):
    print(f"  Fold {i+1}: {score:.4f}")
print(f"\n  Mean: {cv_scores.mean():.4f}")
print(f"  Std:  {cv_scores.std():.4f}")
print(f"  95% CI: [{cv_scores.mean() - 1.96*cv_scores.std():.4f}, {cv_scores.mean() + 1.96*cv_scores.std():.4f}]")

# Train final model
model.fit(X_train_scaled, y_train)
print(f"\n✓ Model trained on {len(X_train_scaled)} training examples")

# Test performance
y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]
auc = roc_auc_score(y_test, y_pred_proba)

print(f"\nTest set AUC: {auc:.4f}")
print(f"Generalization: {cv_scores.mean():.4f} (CV) → {auc:.4f} (Test)")

if abs(cv_scores.mean() - auc) > 0.05:
    print(f"⚠️  Warning: >5% gap suggests possible data distribution shift")
else:
    print(f"✓ Good generalization - CV and test performance align")

tp, fp, th = precision_recall_curve(y_test, y_pred_proba)[:3]
thresholds = np.linspace(0, 1, 11)
for thresh in [0.3, 0.5, 0.7]:
    y_pred = (y_pred_proba >= thresh).astype(int)
    tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0
    print(f"\nThreshold {thresh}: Precision={precision:.2%}, Recall={recall:.2%}")

## Part 3: Production Deployment Checklist

**Before shipping:**
- Model versioning ✓
- Prediction latency < 100ms ✓
- Handle missing values ✓
- Input validation ✓
- Error monitoring ✓
- Feature drift detection ✓

**💡 Depth Note:** How do you monitor model performance in production? What triggers retraining?

In [None]:
# Save model and scaler for production
version = datetime.now().strftime('%Y%m%d_%H%M%S')
model_path = f'churn_model_{version}.pkl'
scaler_path = f'scaler_{version}.pkl'

with open(model_path, 'wb') as f:
    pickle.dump(model, f)
with open(scaler_path, 'wb') as f:
    pickle.dump(scaler, f)

print(f"Model saved: {model_path}")
print(f"Scaler saved: {scaler_path}")

print(f"\n" + "="*60)
print(f"PRODUCTION DEPLOYMENT SUMMARY")
print(f"="*60)
print(f"\nModel Metrics:")
print(f"  AUC-ROC: {auc:.4f}")
print(f"  Training examples: {len(X_train):,}")
print(f"  Features: {len(features)}")
print(f"  Model type: {type(model).__name__}")
print(f"\nDeployment Info:")
print(f"  Version: {version}")
print(f"  Timestamp: {datetime.now().isoformat()}")
print(f"  Ready for production: ✓")
print(f"\nMonitoring Checklist:")
print(f"  [ ] Set up performance monitoring")
print(f"  [ ] Track feature distributions")
print(f"  [ ] Alert on prediction drift")
print(f"  [ ] Schedule weekly retraining")
print(f"  [ ] Document feature definitions")
print(f"  [ ] Test failure scenarios")
print(f"  [ ] Set SLA targets (latency, availability)")
print(f"\n🚀 Pipeline complete. Ready for deployment!")

## Key Takeaways

✅ End-to-end ML pipeline architecture  
✅ Time-based validation prevents data leakage  
✅ Cross-validation for robust performance estimates  
✅ Production readiness checklist  
✅ Model versioning and monitoring  
✅ Handling real-world challenges (drift, retraining)  

## 🎓 Congratulations!

You've completed the Applied ML Foundations course. You now understand:
- Data preparation and EDA
- Supervised learning (regression, classification)
- Unsupervised learning (clustering, dimensionality reduction)
- Ensemble methods
- Deep learning fundamentals
- Production ML systems

**Next steps:**
1. Apply to your own datasets
2. Dive deeper into specialized areas
3. Learn MLOps (deployment, monitoring)
4. Explore advanced techniques (NLP, Computer Vision)

**Resources:**
- Scikit-learn documentation
- TensorFlow/Keras tutorials
- Fast.ai practical deep learning
- Real-world ML (testing, debugging, infrastructure)