# Week 5 ‚Äî Feature Engineering & Data Preprocessing

**Course:** Applied ML Foundations for SaaS Analytics  
**Week Focus:** Transform raw data into powerful features that capture customer behavior and business signals.

---

## üéØ Learning Objectives

By the end of this week, you will:
- Encode categorical variables (one-hot, target encoding)
- Handle missing data strategically
- Scale and normalize numerical features
- Create interaction and polynomial features
- Build domain-driven features from business knowledge
- Detect and handle outliers appropriately
- Validate feature quality and distribution

In [None]:
from IPython.display import HTML
HTML('''
<style>
details {
  margin: 10px 0;
  padding: 8px 12px;
  border: 1px solid #d9e2ec;
  border-radius: 8px;
  background: #f9fbfd;
}
details summary {
  font-weight: 600;
  color: #0056b3;
  cursor: pointer;
}
details[open] {
  background: #f1f7ff;
  border-color: #c3d4f0;
}
details pre {
  background: #f8f9fa;
  padding: 8px;
  border-radius: 6px;
}
</style>
''')

## üè¢ Scenario ‚Äî Build a Churn Prediction Dataset

You need to train a churn prediction model. Raw data has:
- Mixed types: plan_tier (categorical), signup_date (datetime), usage_count (numeric)
- Missing values: some users have no feature usage data
- Outliers: a few power users with 1000x normal usage
- Business knowledge: days_since_signup, recent_engagement_change, plan_change_count

Task: Transform into a clean, ML-ready dataset.

## ‚úçÔ∏è Hands-on Exercises

1. **Categorical Encoding**: One-hot encode plan_tier, region, and customer_segment
2. **Temporal Features**: From signup_date, create: days_active, months_active, signup_quarter, is_recent
3. **Scaling**: Normalize usage_count and revenue features with StandardScaler or MinMaxScaler
4. **Interactions**: Create plan_tier √ó region, recent_usage √ó lifetime features
5. **Missing Data Strategy**: Define whether to drop, fill with mean, or use indicator variables for each column

<details>
<summary>üí° Hint ‚Äî Feature Engineering Workflow</summary>

**Step 1: Understand Data Types**
```python
df.dtypes  # what are we working with?
df.isnull().sum()  # where are the gaps?
```

**Step 2: Missing Data Strategy**
- Numeric: mean/median imputation or "unknown" indicator
- Categorical: mode or "Unknown" category
- When to drop: if > 50% missing in a feature

**Step 3: Encode Categoricals**
- Few categories (< 5): one-hot encoding
- Many categories (> 100): target encoding or embedding

**Step 4: Scale/Normalize**
- Tree models: no scaling needed
- Linear models, neural nets, KNN: use StandardScaler or MinMaxScaler

**Step 5: Feature Selection**
- Remove highly correlated features
- Remove near-zero variance features
- Use domain knowledge to keep business-meaningful features

</details>

<details>
<summary>‚úÖ Solution ‚Äî Complete Feature Engineering Pipeline</summary>

```python
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelEncoder
from datetime import datetime

# Load and merge data
subs = pd.read_csv('../data/subscriptions.csv', parse_dates=['signup_date','churn_date'])
feature_usage = pd.read_csv('../data/feature_usage.csv')

# Aggregate features by user
user_features = feature_usage.groupby('user_id').agg({
    'usage_count': 'sum',
    'feature_name': 'nunique'
}).rename(columns={'usage_count': 'total_usage', 'feature_name': 'num_features'})

# Merge
df = subs.merge(user_features, left_on='user_id', right_index=True, how='left')

# FEATURE ENGINEERING
# 1. Create temporal features
today = pd.Timestamp.now()
df['days_active'] = (df['churn_date'].fillna(today) - df['signup_date']).dt.days
df['signup_month'] = df['signup_date'].dt.month
df['signup_quarter'] = df['signup_date'].dt.quarter

# 2. Handle missing engagement data
df['total_usage'] = df['total_usage'].fillna(0)
df['num_features'] = df['num_features'].fillna(0)

# 3. Create target: churned in next 30 days?
df['target'] = df['churn_date'].notna().astype(int)

# 4. Encode categorical
df_encoded = pd.get_dummies(df, columns=['plan_tier'], drop_first=True)

# 5. Scale numeric features
scaler = StandardScaler()
numeric_cols = ['days_active', 'total_usage', 'num_features']
df_encoded[numeric_cols] = scaler.fit_transform(df_encoded[numeric_cols])

# Result: ML-ready dataset
print(f"Feature matrix shape: {df_encoded.shape}")
print(f"Null values: {df_encoded.isnull().sum().sum()}")
print(f"Target distribution: {df_encoded['target'].value_counts().to_dict()}")
```

**Why this works:**
- Clear separation: raw data ‚Üí aggregation ‚Üí features ‚Üí encoding ‚Üí scaling
- Handles missing values explicitly
- Target clearly defined
- Ready for train/test split and modeling

</details>

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns

print("=" * 70)
print("WEEK 5: FEATURE ENGINEERING DEMO")
print("=" * 70)

# Load data
subs = pd.read_csv('../data/subscriptions.csv', parse_dates=['signup_date','churn_date'])
feature_usage = pd.read_csv('../data/feature_usage.csv')

print("\n1. DATA PREPARATION")
print("-" * 70)
print(f"Subscriptions: {len(subs)} records")
print(f"Feature usage: {len(feature_usage)} records")

# Aggregate user metrics
user_features = feature_usage.groupby('user_id').agg({
    'usage_count': 'sum',
    'feature_name': 'nunique'
}).rename(columns={'usage_count': 'total_usage', 'feature_name': 'num_features'})

print(f"Unique users in feature data: {len(user_features)}")

# Merge
df = subs.merge(user_features, left_on='user_id', right_index=True, how='left')
print(f"After merge: {len(df)} subscriptions with feature data")

print("\n2. FEATURE ENGINEERING")
print("-" * 70)

# Temporal features
from datetime import datetime
today = pd.Timestamp.now()
df['days_active'] = (df['churn_date'].fillna(today) - df['signup_date']).dt.days
df['is_churned'] = df['churn_date'].notna()

# Handle missing engagement
df['total_usage'] = df['total_usage'].fillna(0)
df['num_features'] = df['num_features'].fillna(0)

print(f"Days active: min={df['days_active'].min()}, max={df['days_active'].max()}")
print(f"Total usage: min={df['total_usage'].min():.0f}, max={df['total_usage'].max():.0f}")
print(f"Users with feature data: {(df['num_features'] > 0).sum()} / {len(df)}")

print("\n3. FEATURE SCALING")
print("-" * 70)

# Before scaling
print(f"Total usage (before): mean={df['total_usage'].mean():.1f}, std={df['total_usage'].std():.1f}")
print(f"Days active (before): mean={df['days_active'].mean():.1f}, std={df['days_active'].std():.1f}")

# Apply scaling
scaler = StandardScaler()
scaled_features = scaler.fit_transform(df[['total_usage', 'days_active']])

print(f"\nTotal usage (after): mean={scaled_features[:,0].mean():.3f}, std={scaled_features[:,0].std():.3f}")
print(f"Days active (after): mean={scaled_features[:,1].mean():.3f}, std={scaled_features[:,1].std():.3f}")

print("\n4. CHURN PREDICTION FEATURE SET")
print("-" * 70)
print(f"Churn rate overall: {df['is_churned'].mean():.1%}")
print(f"Churn rate (high feature users): {df[df['num_features'] > 0]['is_churned'].mean():.1%}")
print(f"Churn rate (low feature users): {df[df['num_features'] == 0]['is_churned'].mean():.1%}")
print("\nInsight: Feature adoption is protective against churn!")
print("=" * 70)

## üìö Key Concepts ‚Äî Feature Engineering Best Practices

### The Feature Engineering Hierarchy
1. **Data collection**: Ensure you have the right data
2. **Data cleaning**: Handle missing, duplicates, outliers
3. **Domain features**: Leverage business knowledge (recency, frequency, customer lifecycle)
4. **Statistical features**: Interactions, ratios, transformations
5. **Automated features**: Deep learning, AutoML (usually unnecessary for SaaS)

### Feature Quality Checklist
- [ ] No missing values (or strategy documented)
- [ ] Appropriate for model type (tree vs linear)
- [ ] Interpretable to business stakeholders
- [ ] Not highly correlated with other features
- [ ] Not a data leak (information from the future)
- [ ] Distribution makes sense (outliers justified)

### Data Leakage: The Silent Killer
```python
# BAD: Using future information
df['has_churn_flag'] = df['churn_date'].notna()  # we're predicting this!

# GOOD: Using only historical information
df['days_since_signup'] = (today - df['signup_date']).days
```

## ü§î Reflection & Application

**Question 1:** Which single chart would you show a CEO in 30 seconds?
- Line chart: Trend (retention vs churn over time)
- Bar chart: Comparison (segment A vs B)
- Combination: Show both signal and uncertainty

**Question 2:** Should you include all features in your model?
- No! Too many features ‚Üí overfitting ‚Üí poor generalization
- Use correlation analysis, feature importance, or domain knowledge to select

**Question 3:** How do you avoid data leakage in production?
- Clearly timestamp each data point
- Use only features available at decision time
- Test on truly holdout (future) data

## üìù Practice Assignment

**Problem:** Create a customer quality score (0-100) combining:
1. Engagement: Feature adoption and usage
2. Stability: Customer lifetime and churn risk
3. Value: Plan tier and payment health

**Steps:**
1. Engineer the 3 dimension features
2. Normalize each to 0-100 scale
3. Create composite score (weighted average)
4. Validate: does high score correlate with lower churn?

## üîó Next Steps

In Week 6, we'll use these engineered features to train classification models that predict which customers are at risk.