# Week 7 ‚Äî Supervised Learning: Regression

**Course:** Applied ML Foundations for SaaS Analytics  
**Week Focus:** Predict continuous values‚Äîcustomer lifetime value, revenue, usage trends‚Äîwith regression models.

---

## üéØ Learning Objectives

By the end of this week, you will:
- Build and evaluate regression models (Linear, Ridge, Random Forest Regressor)
- Interpret regression coefficients and feature effects
- Handle heteroscedasticity and non-linear relationships
- Predict customer lifetime value (CLV) and monthly recurring revenue (MRR)
- Optimize for business metrics (MAE, RMSE, R¬≤ in business context)
- Understand when to use regression vs classification

## üìä Real-World Context

At a SaaS company like CloudWave, you need to predict:
- **Customer Lifetime Value (CLV)**: How much revenue will this customer generate?
- **Monthly Recurring Revenue (MRR)**: What's our projected monthly revenue?
- **Usage Trends**: How many API calls will customers make next month?
- **Support Costs**: How much will customer support cost per account?

**Business Impact:**
- üí∞ Better resource allocation based on predicted revenue
- üéØ Target high-value customers for upselling
- üìä Accurate financial forecasting for investors
- ‚ö†Ô∏è Identify customers with declining usage before they churn

In [None]:
from IPython.display import HTML
HTML('''
<style>
details {
  margin: 10px 0;
  padding: 8px 12px;
  border: 1px solid #d9e2ec;
  border-radius: 8px;
  background: #f9fbfd;
}
details summary {
  font-weight: 600;
  color: #0056b3;
  cursor: pointer;
}
details[open] {
  background: #f1f7ff;
  border-color: #c3d4f0;
}
details pre {
  background: #f8f9fa;
  padding: 8px;
  border-radius: 6px;
}
</style>
''')

## üè¢ Scenario ‚Äî Predicting Customer Lifetime Value

Your CFO asks: **"How much revenue will each customer generate over their lifetime?"**

**Why CLV Matters:**
- Determine how much to spend on customer acquisition
- Prioritize high-value customers for retention efforts
- Set pricing and discount strategies
- Make data-driven investment decisions

**The Challenge:**
- Classification predicts categories (will churn: yes/no)
- **Regression predicts numbers** (will generate: $X revenue)
- Need to estimate a continuous value, not a discrete label

<details>
<summary>üí° Hint ‚Äî Regression vs Classification</summary>

**When to use Classification:**
- Binary outcome: Will churn? (Yes/No)
- Categories: Plan tier? (Free/Pro/Enterprise)
- Discrete labels: Support priority? (Low/Medium/High)

**When to use Regression:**
- Continuous values: Customer Lifetime Value? ($0 - $100,000)
- Counts: API calls next month? (0 - 1,000,000)
- Percentages: Engagement score? (0% - 100%)
- Time: Days until churn? (0 - 365)

**Key Difference:**
- Classification: Predict a category
- Regression: Predict a number

</details>

## üìö Part 1: Building Regression Features

For predicting CLV, we need features that capture:
1. **Engagement**: How actively they use the product
2. **Tenure**: How long they've been a customer
3. **Plan**: Current pricing tier
4. **Growth**: Trend in usage over time

In [None]:
import pandas as pd
import numpy as np
from datetime import datetime

# Load subscription data
subs = pd.read_csv('../data/subscriptions.csv', parse_dates=['signup_date', 'churn_date'])

print("=" * 70)
print("SUBSCRIPTION DATA OVERVIEW")
print("=" * 70)
print(f"Total customers: {len(subs):,}")
print(f"Active customers: {subs['churn_date'].isna().sum():,}")
print(f"Churned customers: {subs['churn_date'].notna().sum():,}")
print(f"\nPlan distribution:")
print(subs['plan_tier'].value_counts())
print(f"\nMRR statistics:")
print(subs['mrr'].describe())

In [None]:
# Feature Engineering for CLV Prediction

# 1. Calculate customer lifetime in days
today = pd.Timestamp.now()
subs['end_date'] = subs['churn_date'].fillna(today)
subs['lifetime_days'] = (subs['end_date'] - subs['signup_date']).dt.days

# 2. Calculate total revenue (CLV for churned customers, current for active)
subs['lifetime_months'] = subs['lifetime_days'] / 30.0
subs['lifetime_value'] = subs['mrr'] * subs['lifetime_months']

# 3. Load engagement data
feature_usage = pd.read_csv('../data/feature_usage.csv')
user_events = pd.read_csv('../data/user_events.csv')

# Aggregate engagement metrics
engagement = feature_usage.groupby('user_id').agg({
    'usage_count': 'sum',
    'feature_name': 'nunique'
}).rename(columns={
    'usage_count': 'total_usage',
    'feature_name': 'features_used'
}).reset_index()

# Event frequency
events = user_events.groupby('user_id').size().reset_index(name='total_events')

# 4. Merge features
df = subs.merge(engagement, on='user_id', how='left')
df = df.merge(events, on='user_id', how='left')

# Fill missing values
df['total_usage'] = df['total_usage'].fillna(0)
df['features_used'] = df['features_used'].fillna(0)
df['total_events'] = df['total_events'].fillna(0)

# 5. Create plan tier dummies
df = pd.get_dummies(df, columns=['plan_tier'], drop_first=True)

print("\n" + "=" * 70)
print("FEATURE ENGINEERING COMPLETE")
print("=" * 70)
print(f"Final dataset: {df.shape[0]:,} rows √ó {df.shape[1]} columns")
print(f"\nFeatures for modeling:")
feature_cols = ['tenure_days', 'mrr', 'total_usage', 'features_used', 'total_events']
feature_cols += [col for col in df.columns if col.startswith('plan_tier_')]
print(f"  {feature_cols}")
print(f"\nTarget variable: lifetime_value")
print(f"  Mean CLV: ${df['lifetime_value'].mean():.2f}")
print(f"  Median CLV: ${df['lifetime_value'].median():.2f}")
print(f"  Max CLV: ${df['lifetime_value'].max():.2f}")

## üìä Part 2: Training Regression Models

We'll train three types of regression models:

1. **Linear Regression**: Simple, interpretable baseline
2. **Ridge Regression**: Linear with regularization (handles multicollinearity)
3. **Random Forest Regressor**: Non-linear, captures complex relationships

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Prepare features and target
feature_cols = ['tenure_days', 'mrr', 'total_usage', 'features_used', 'total_events']
feature_cols += [col for col in df.columns if col.startswith('plan_tier_')]

X = df[feature_cols]
y = df['lifetime_value']

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("=" * 70)
print("TRAINING REGRESSION MODELS")
print("=" * 70)
print(f"Training set: {len(X_train):,} customers")
print(f"Test set: {len(X_test):,} customers")
print(f"Features: {len(feature_cols)}")

In [None]:
# Model 1: Linear Regression
print("\n" + "=" * 70)
print("1. LINEAR REGRESSION")
print("=" * 70)

lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)

mae_lr = mean_absolute_error(y_test, y_pred_lr)
rmse_lr = np.sqrt(mean_squared_error(y_test, y_pred_lr))
r2_lr = r2_score(y_test, y_pred_lr)

print(f"MAE:  ${mae_lr:,.2f}")
print(f"RMSE: ${rmse_lr:,.2f}")
print(f"R¬≤:   {r2_lr:.4f}")

print(f"\nTop 3 Feature Coefficients:")
coef_df = pd.DataFrame({
    'feature': feature_cols,
    'coefficient': lr.coef_
}).sort_values('coefficient', ascending=False, key=abs)

for idx, row in coef_df.head(3).iterrows():
    print(f"  {row['feature']:.<30} ${row['coefficient']:>10,.2f}")

In [None]:
# Model 2: Ridge Regression (with regularization)
print("\n" + "=" * 70)
print("2. RIDGE REGRESSION (L2 Regularization)")
print("=" * 70)

ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)
y_pred_ridge = ridge.predict(X_test)

mae_ridge = mean_absolute_error(y_test, y_pred_ridge)
rmse_ridge = np.sqrt(mean_squared_error(y_test, y_pred_ridge))
r2_ridge = r2_score(y_test, y_pred_ridge)

print(f"MAE:  ${mae_ridge:,.2f}")
print(f"RMSE: ${rmse_ridge:,.2f}")
print(f"R¬≤:   {r2_ridge:.4f}")
print(f"\nüí° Ridge helps when features are correlated (multicollinearity)")

In [None]:
# Model 3: Random Forest Regressor (non-linear)
print("\n" + "=" * 70)
print("3. RANDOM FOREST REGRESSOR")
print("=" * 70)

rf = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

mae_rf = mean_absolute_error(y_test, y_pred_rf)
rmse_rf = np.sqrt(mean_squared_error(y_test, y_pred_rf))
r2_rf = r2_score(y_test, y_pred_rf)

print(f"MAE:  ${mae_rf:,.2f}")
print(f"RMSE: ${rmse_rf:,.2f}")
print(f"R¬≤:   {r2_rf:.4f}")

print(f"\nFeature Importance (Top 5):")
importance_df = pd.DataFrame({
    'feature': feature_cols,
    'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)

for idx, row in importance_df.head(5).iterrows():
    print(f"  {row['feature']:.<30} {row['importance']:>6.2%}")

In [None]:
# Model Comparison
print("\n" + "=" * 70)
print("MODEL COMPARISON")
print("=" * 70)

comparison = pd.DataFrame({
    'Model': ['Linear Regression', 'Ridge Regression', 'Random Forest'],
    'MAE ($)': [mae_lr, mae_ridge, mae_rf],
    'RMSE ($)': [rmse_lr, rmse_ridge, rmse_rf],
    'R¬≤': [r2_lr, r2_ridge, r2_rf]
})

print(comparison.to_string(index=False))
print(f"\n‚úÖ Best model: {comparison.loc[comparison['R¬≤'].idxmax(), 'Model']}")
print(f"   (Highest R¬≤ = {comparison['R¬≤'].max():.4f})")

## üéØ Part 3: Understanding Regression Metrics

### Key Metrics Explained

**1. MAE (Mean Absolute Error)**
- Average absolute prediction error
- Example: MAE = $500 means "on average, we're off by $500"
- ‚úÖ Easy to interpret in business terms
- ‚úÖ Not sensitive to outliers

**2. RMSE (Root Mean Squared Error)**
- Square root of average squared error
- ‚ùå Penalizes large errors more heavily
- Example: RMSE = $800 (worse than MAE = $500) means some predictions are way off

**3. R¬≤ (R-squared / Coefficient of Determination)**
- Proportion of variance explained by the model
- Range: 0 to 1 (higher is better)
- R¬≤ = 0.75 means "model explains 75% of the variance in CLV"
- Remaining 25% is unexplained (randomness, missing features)

### Business Interpretation

**Scenario:** Predicting CLV for customers with actual CLV ranging from $0 - $50,000

- **MAE = $2,000**: On average, predictions are within $2,000 of actual
- **R¬≤ = 0.80**: Model captures 80% of CLV variation
- **Business Value**: Good enough for segmentation (high/medium/low value customers)
- **Not Good Enough For**: Exact revenue forecasting (need MAE < $500)

<details>
<summary>üí° Hint ‚Äî When is R¬≤ Too Low?</summary>

**R¬≤ Interpretation Guidelines:**

- **R¬≤ > 0.90**: Excellent (rare in business data)
- **R¬≤ = 0.70-0.90**: Good (explains most variance)
- **R¬≤ = 0.50-0.70**: Moderate (useful for segmentation)
- **R¬≤ = 0.30-0.50**: Weak (better than random guessing)
- **R¬≤ < 0.30**: Poor (model adds little value)

**Context Matters:**
- Predicting physics: Expect R¬≤ > 0.95
- Predicting customer behavior: R¬≤ = 0.60 is often very good!
- Human decisions are inherently noisy

**When R¬≤ is Low:**
1. Missing important features
2. Non-linear relationship (try Random Forest)
3. High natural variance in target variable
4. Insufficient data

</details>

## üíº Part 4: Business Application

### Use Case: Customer Segmentation by Predicted CLV

In [None]:
# Use best model (Random Forest) to predict CLV for all customers
df['predicted_clv'] = rf.predict(X)

# Segment customers by predicted CLV
df['clv_segment'] = pd.cut(
    df['predicted_clv'],
    bins=[0, 5000, 15000, 100000],
    labels=['Low Value', 'Medium Value', 'High Value']
)

print("=" * 70)
print("CUSTOMER SEGMENTATION BY PREDICTED CLV")
print("=" * 70)

segment_summary = df.groupby('clv_segment').agg({
    'user_id': 'count',
    'predicted_clv': 'mean',
    'lifetime_value': 'mean',
    'mrr': 'mean',
    'features_used': 'mean'
}).round(2)

segment_summary.columns = ['Count', 'Predicted CLV', 'Actual CLV', 'Avg MRR', 'Avg Features']
print(segment_summary)

print(f"\nüí° Business Actions:")
print(f"   High Value ({segment_summary.loc['High Value', 'Count']:.0f} customers):")
print(f"      ‚Üí Assign dedicated account managers")
print(f"      ‚Üí Offer premium support and custom features")
print(f"   Medium Value ({segment_summary.loc['Medium Value', 'Count']:.0f} customers):")
print(f"      ‚Üí Upsell campaigns to move to high value")
print(f"      ‚Üí Feature adoption programs")
print(f"   Low Value ({segment_summary.loc['Low Value', 'Count']:.0f} customers):")
print(f"      ‚Üí Automated onboarding and self-service support")
print(f"      ‚Üí Monitor for churn signals")

## ü§î Reflection & Application

**Question 1:** Why might Random Forest outperform Linear Regression for CLV prediction?

<details>
<summary>Click for answer</summary>

**Non-linear Relationships:**
- CLV doesn't increase linearly with features
- Example: First 5 features used ‚Üí huge CLV boost. Next 5 ‚Üí diminishing returns
- Linear models assume straight-line relationships
- Random Forest captures curves, thresholds, interactions

**Feature Interactions:**
- High MRR + high usage ‚Üí very high CLV (multiplicative effect)
- Linear models: `CLV = a√óMRR + b√óusage` (additive only)
- Random Forest: Learns `CLV = f(MRR, usage)` where f can be any shape

</details>

**Question 2:** When should you use Linear Regression instead of Random Forest?

<details>
<summary>Click for answer</summary>

**Use Linear Regression When:**
1. **Interpretability is critical**: Need to explain "$1 increase in MRR ‚Üí $30 increase in CLV"
2. **Small datasets**: < 1000 samples (Random Forest needs more data)
3. **Extrapolation needed**: Predicting values outside training range
4. **Regulatory requirements**: Finance/healthcare often require interpretable models
5. **Baseline**: Always start simple, add complexity only if needed

**Random Forest Advantages:**
- Better predictions (usually)
- Handles non-linearity
- Less feature engineering needed
- Built-in feature importance

</details>

**Question 3:** How do you know if your model is good enough for production?

<details>
<summary>Click for answer</summary>

**Compare to Baselines:**
1. **Naive baseline**: Predict the mean CLV for everyone
   - If MAE_model ‚âà MAE_baseline ‚Üí model adds no value!
2. **Business baseline**: Current method (manual estimates, rules)
   - Model should improve on existing process

**Business Value Check:**
- What's the cost of a $2,000 prediction error?
- If low impact ‚Üí R¬≤ = 0.50 is fine
- If high stakes (financial decisions) ‚Üí need R¬≤ > 0.80

**Test in Production:**
- Deploy to 10% of customers
- Compare predicted vs actual CLV after 3 months
- If predictions hold up ‚Üí scale to 100%

</details>

## ‚úçÔ∏è Hands-On Exercises

### Exercise 1: Feature Engineering for Better Predictions

Current features might be too simple. Create these advanced features:

1. **Engagement Velocity**: Change in usage over last 30 days vs previous 30 days
2. **Feature Diversity Score**: (features_used / total_available_features)
3. **MRR per Event**: (mrr / total_events) ‚Äî efficiency metric
4. **Cohort Age**: Months since account creation (tenure in months)

Re-train your Random Forest and see if R¬≤ improves.

In [None]:
# Your solution here!

# TODO: Create new features
# df['engagement_velocity'] = ...
# df['feature_diversity'] = ...
# df['mrr_per_event'] = ...
# df['cohort_age_months'] = ...

# TODO: Retrain model with new features
# TODO: Compare R¬≤ before and after

### Exercise 2: Residual Analysis ‚Äî Finding Model Weaknesses

Analyze prediction errors to find patterns:

1. Calculate residuals: `residuals = y_test - y_pred`
2. Plot residuals vs predicted values (should be random scatter)
3. Find customers with largest errors (top 10 over-predictions and under-predictions)
4. Analyze: What do these customers have in common?

**Goal**: Identify systematic errors to improve the model.

In [None]:
# Your solution here!

# TODO: Calculate residuals
# TODO: Plot residuals (bonus: use matplotlib/seaborn)
# TODO: Find top 10 worst predictions
# TODO: Investigate common patterns

### Exercise 3: Time-Based Validation

Our train/test split was random. For time-series data, we should train on past and test on future.

1. Split data by time: Train on customers who signed up before 2024-06-01
2. Test on customers who signed up after 2024-06-01
3. Compare R¬≤ to random split ‚Äî is it lower? (It usually is!)
4. Why? Discuss the difference between in-sample and out-of-sample performance.

In [None]:
# Your solution here!

# TODO: Create time-based train/test split
# TODO: Train model on past data only
# TODO: Test on future data
# TODO: Compare to random split results

## üìù Practice Assignment

**Problem:** Predict Monthly Recurring Revenue (MRR) 30 days into the future.

**Dataset:** Use current engagement metrics to predict next month's MRR.

**Steps:**
1. Create target variable: `future_mrr` (MRR one month later)
2. Features: Current usage, features adopted, event frequency, historical MRR trend
3. Train Linear, Ridge, and Random Forest models
4. Evaluate with MAE (in dollars) and R¬≤
5. Identify customers with predicted MRR decrease > 20% (churn risk!)

**Deliverable:** Notebook showing model comparison and business recommendations.

**Bonus:** Build a dashboard showing:
- Predicted vs Actual MRR for test set
- Feature importance
- Top 20 at-risk customers (declining MRR predictions)

## üîó Next Steps

In **Week 8**, we'll explore **Unsupervised Learning: Clustering** to segment customers without predefined labels. We'll discover natural groupings in customer behavior and create data-driven personas.