# Phase 4: Feature Engineering

This notebook creates derived features and defines feature sets for the diabetes prediction models.

## Objectives
1. Create derived features with clinical/scientific rationale
2. Validate feature calculations (distributions, edge cases)
3. Define feature sets (with_labs, without_labs)
4. Prepare final datasets for modeling

In [None]:
import sys
from pathlib import Path

# Add src to path
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root / 'src'))

import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from features.builders import (
    create_all_derived_features,
    get_feature_sets,
    validate_feature_availability,
    prepare_modeling_data
)

# Visualization settings
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.dpi'] = 100
plt.rcParams['savefig.dpi'] = 300

# Paths
DATA_INTERIM = project_root / 'data' / 'interim'
DATA_PROCESSED = project_root / 'data' / 'processed'
FIGURES_DIR = project_root / 'reports' / 'figures'

DATA_PROCESSED.mkdir(parents=True, exist_ok=True)
print(f"Project root: {project_root}")

## 1. Load Cleaned Data

We load the minimal imputation dataset (preserves NaN for tree models).

In [None]:
# Load the cleaned data
df = pd.read_parquet(DATA_INTERIM / 'cleaned_minimal_impute.parquet')
print(f"Loaded data: {df.shape[0]:,} rows, {df.shape[1]:,} columns")
print(f"\nTarget distribution:")
print(df['DIABETES_STATUS'].value_counts().sort_index())

## 2. Create Derived Features

### 2.1 Feature Creation Overview

| Feature | Formula | Clinical Rationale |
|---------|---------|--------------------|
| AVG_SYS_BP | mean(BPXSY1-3) | Averages multiple readings for stability |
| AVG_DIA_BP | mean(BPXDI1-3) | Averages multiple readings for stability |
| TOTAL_WATER | sum(water columns) | Hydration affects glucose regulation |
| ACR_RATIO | albumin/creatinine | Early kidney damage marker |
| WEIGHT_CHANGE_10YR | current - 10yr ago | Weight gain trajectory |
| WEIGHT_CHANGE_25 | current - age 25 | Lifetime weight gain |
| WEIGHT_FROM_MAX | max - current | Weight loss from peak |
| WAKE_TIME_DIFF | weekend - weekday | Social jet lag affects metabolism |
| WAIST_HEIGHT_RATIO | waist / height | Central obesity measure |
| SAT_FAT_PCT | sat_fat / total_fat | Dietary fat quality |

In [None]:
# Create all derived features
df_feat, feature_stats = create_all_derived_features(df)

print(f"\nShape after feature engineering: {df_feat.shape}")
print(f"New columns: {df_feat.shape[1] - df.shape[1]}")

In [None]:
# Summary of created features
feature_summary = pd.DataFrame([
    {
        'Feature': name,
        'N Valid': stats.get('n_valid', 0),
        'Mean': stats.get('mean', np.nan),
        'Median': stats.get('median', np.nan),
        'Min': stats.get('range', [np.nan, np.nan])[0],
        'Max': stats.get('range', [np.nan, np.nan])[1]
    }
    for name, stats in feature_stats.items()
])

print("\nDerived Feature Summary:")
display(feature_summary)

## 3. Validate Derived Features

For each feature, we check:
- Distribution shape (histogram)
- Relationship with target
- Edge cases and outliers

### 3.1 Blood Pressure Averages

**Clinical Context:**
- Normal: <120/<80 mmHg
- Elevated: 120-129/<80
- Hypertension Stage 1: 130-139/80-89
- Hypertension Stage 2: ≥140/≥90

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Systolic BP distribution
ax = axes[0, 0]
df_feat['AVG_SYS_BP'].hist(bins=50, ax=ax, edgecolor='black', alpha=0.7)
ax.axvline(130, color='orange', linestyle='--', label='Hypertension (130)')
ax.axvline(140, color='red', linestyle='--', label='Stage 2 (140)')
ax.set_xlabel('Average Systolic BP (mmHg)')
ax.set_ylabel('Count')
ax.set_title('Distribution of Average Systolic BP')
ax.legend()

# Diastolic BP distribution
ax = axes[0, 1]
df_feat['AVG_DIA_BP'].hist(bins=50, ax=ax, edgecolor='black', alpha=0.7)
ax.axvline(80, color='orange', linestyle='--', label='Elevated (80)')
ax.axvline(90, color='red', linestyle='--', label='Stage 2 (90)')
ax.set_xlabel('Average Diastolic BP (mmHg)')
ax.set_ylabel('Count')
ax.set_title('Distribution of Average Diastolic BP')
ax.legend()

# Systolic by diabetes status
ax = axes[1, 0]
df_feat.boxplot(column='AVG_SYS_BP', by='DIABETES_STATUS', ax=ax)
ax.set_xlabel('Diabetes Status (0=No, 1=Pre, 2=Diabetes)')
ax.set_ylabel('Avg Systolic BP (mmHg)')
ax.set_title('Systolic BP by Diabetes Status')
plt.suptitle('')

# Diastolic by diabetes status
ax = axes[1, 1]
df_feat.boxplot(column='AVG_DIA_BP', by='DIABETES_STATUS', ax=ax)
ax.set_xlabel('Diabetes Status (0=No, 1=Pre, 2=Diabetes)')
ax.set_ylabel('Avg Diastolic BP (mmHg)')
ax.set_title('Diastolic BP by Diabetes Status')
plt.suptitle('')

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'derived_bp_validation.png', bbox_inches='tight')
plt.show()

# Statistics by diabetes status
print("\nMean BP by Diabetes Status:")
print(df_feat.groupby('DIABETES_STATUS')[['AVG_SYS_BP', 'AVG_DIA_BP']].mean().round(1))

### 3.2 ACR Ratio (Kidney Function)

**Clinical Context:**
- Normal: <30 mg/g
- Microalbuminuria (early kidney damage): 30-299 mg/g
- Macroalbuminuria (overt kidney disease): ≥300 mg/g

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# ACR distribution (log scale due to skew)
ax = axes[0]
acr_valid = df_feat['ACR_RATIO'].dropna()
acr_valid[acr_valid > 0].apply(np.log10).hist(bins=50, ax=ax, edgecolor='black', alpha=0.7)
ax.axvline(np.log10(30), color='orange', linestyle='--', label='Microalbuminuria (30)')
ax.axvline(np.log10(300), color='red', linestyle='--', label='Macroalbuminuria (300)')
ax.set_xlabel('Log10(ACR Ratio)')
ax.set_ylabel('Count')
ax.set_title('Distribution of ACR Ratio (log scale)')
ax.legend()

# ACR by diabetes status
ax = axes[1]
acr_data = df_feat[['ACR_RATIO', 'DIABETES_STATUS']].dropna()
# Use log scale for visualization
for status in [0, 1, 2]:
    data = acr_data[acr_data['DIABETES_STATUS'] == status]['ACR_RATIO']
    ax.hist(np.log10(data[data > 0]), bins=30, alpha=0.5, 
            label=f'Status {status} (n={len(data)})')
ax.axvline(np.log10(30), color='orange', linestyle='--')
ax.axvline(np.log10(300), color='red', linestyle='--')
ax.set_xlabel('Log10(ACR Ratio)')
ax.set_ylabel('Count')
ax.set_title('ACR by Diabetes Status')
ax.legend()

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'derived_acr_validation.png', bbox_inches='tight')
plt.show()

# ACR categories by diabetes status
print("\nACR Categories by Diabetes Status:")
acr_cat = pd.cut(df_feat['ACR_RATIO'], bins=[0, 30, 300, float('inf')], 
                 labels=['Normal (<30)', 'Micro (30-299)', 'Macro (≥300)'])
print(pd.crosstab(df_feat['DIABETES_STATUS'], acr_cat, normalize='index').round(3) * 100)

### 3.3 Weight Change Features

**Clinical Context:**
- Weight gain is a major modifiable risk factor for Type 2 diabetes
- Even modest weight loss (5-10%) can significantly reduce diabetes risk
- Weight trajectory matters: recent gain vs stable obesity

In [None]:
fig, axes = plt.subplots(2, 3, figsize=(15, 10))

weight_features = ['WEIGHT_CHANGE_10YR', 'WEIGHT_CHANGE_25', 'WEIGHT_FROM_MAX']
titles = ['Weight Change (10 yrs)', 'Weight Change (from 25)', 'Weight From Maximum']

for i, (feat, title) in enumerate(zip(weight_features, titles)):
    # Distribution
    ax = axes[0, i]
    data = df_feat[feat].dropna()
    # Clip extreme values for visualization
    data_clipped = data.clip(-50, 50)
    data_clipped.hist(bins=50, ax=ax, edgecolor='black', alpha=0.7)
    ax.axvline(0, color='red', linestyle='--', label='No change')
    ax.set_xlabel(f'{title} (kg)')
    ax.set_ylabel('Count')
    ax.set_title(f'Distribution: {title}')
    ax.legend()
    
    # By diabetes status
    ax = axes[1, i]
    df_feat.boxplot(column=feat, by='DIABETES_STATUS', ax=ax)
    ax.set_xlabel('Diabetes Status')
    ax.set_ylabel(f'{title} (kg)')
    ax.set_title(f'{title} by Status')
    plt.suptitle('')

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'derived_weight_change_validation.png', bbox_inches='tight')
plt.show()

# Statistics
print("\nWeight Change Features by Diabetes Status (mean kg):")
print(df_feat.groupby('DIABETES_STATUS')[weight_features].mean().round(1))

### 3.4 Waist-to-Height Ratio

**Clinical Context:**
- Simple rule: "Keep your waist less than half your height"
- Ratio >0.5: Increased cardiometabolic risk
- Ratio >0.6: Substantially increased risk
- Better predictor of diabetes than BMI for many populations

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Distribution
ax = axes[0]
df_feat['WAIST_HEIGHT_RATIO'].hist(bins=50, ax=ax, edgecolor='black', alpha=0.7)
ax.axvline(0.5, color='orange', linestyle='--', label='Elevated (0.5)')
ax.axvline(0.6, color='red', linestyle='--', label='High risk (0.6)')
ax.set_xlabel('Waist-to-Height Ratio')
ax.set_ylabel('Count')
ax.set_title('Distribution of Waist-to-Height Ratio')
ax.legend()

# By diabetes status
ax = axes[1]
df_feat.boxplot(column='WAIST_HEIGHT_RATIO', by='DIABETES_STATUS', ax=ax)
ax.axhline(0.5, color='orange', linestyle='--', alpha=0.7)
ax.axhline(0.6, color='red', linestyle='--', alpha=0.7)
ax.set_xlabel('Diabetes Status')
ax.set_ylabel('Waist-to-Height Ratio')
ax.set_title('Waist-Height Ratio by Status')
plt.suptitle('')

# Comparison with BMI
ax = axes[2]
sample = df_feat.sample(min(2000, len(df_feat)))
colors = {0: 'green', 1: 'orange', 2: 'red'}
for status in [0, 1, 2]:
    mask = sample['DIABETES_STATUS'] == status
    ax.scatter(sample.loc[mask, 'BMXBMI'], 
               sample.loc[mask, 'WAIST_HEIGHT_RATIO'],
               c=colors[status], alpha=0.3, label=f'Status {status}')
ax.set_xlabel('BMI')
ax.set_ylabel('Waist-to-Height Ratio')
ax.set_title('Waist-Height Ratio vs BMI')
ax.legend()

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'derived_waist_height_validation.png', bbox_inches='tight')
plt.show()

print("\nWaist-Height Ratio by Diabetes Status:")
print(df_feat.groupby('DIABETES_STATUS')['WAIST_HEIGHT_RATIO'].describe().round(3))

### 3.5 Wake Time Difference (Social Jet Lag)

**Clinical Context:**
- "Social jet lag" disrupts circadian rhythms
- Associated with metabolic dysfunction and obesity
- Large differences (>2 hours) may indicate sleep debt

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Distribution
ax = axes[0]
wake_diff = df_feat['WAKE_TIME_DIFF'].dropna()
wake_diff = wake_diff[(wake_diff > -6) & (wake_diff < 6)]  # Filter extreme values
wake_diff.hist(bins=40, ax=ax, edgecolor='black', alpha=0.7)
ax.axvline(0, color='gray', linestyle='--', label='Same wake time')
ax.axvline(2, color='orange', linestyle='--', label='2 hrs later')
ax.set_xlabel('Wake Time Difference (hours)')
ax.set_ylabel('Count')
ax.set_title('Weekend - Weekday Wake Time')
ax.legend()

# By diabetes status
ax = axes[1]
df_feat.boxplot(column='WAKE_TIME_DIFF', by='DIABETES_STATUS', ax=ax)
ax.set_xlabel('Diabetes Status')
ax.set_ylabel('Wake Time Difference (hours)')
ax.set_title('Wake Time Diff by Status')
plt.suptitle('')

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'derived_wake_time_validation.png', bbox_inches='tight')
plt.show()

print("\nWake Time Difference by Diabetes Status:")
print(df_feat.groupby('DIABETES_STATUS')['WAKE_TIME_DIFF'].describe().round(2))

### 3.6 Dietary Features (Water & Fat Quality)

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Total water distribution
ax = axes[0, 0]
water = df_feat['TOTAL_WATER'].dropna()
water = water[water < 5000]  # Filter extreme
water.hist(bins=50, ax=ax, edgecolor='black', alpha=0.7)
ax.set_xlabel('Total Water Intake (g)')
ax.set_ylabel('Count')
ax.set_title('Distribution of Total Water Intake')

# Water by status
ax = axes[0, 1]
df_feat.boxplot(column='TOTAL_WATER', by='DIABETES_STATUS', ax=ax)
ax.set_xlabel('Diabetes Status')
ax.set_ylabel('Total Water (g)')
ax.set_title('Water Intake by Status')
plt.suptitle('')

# Saturated fat percentage distribution
ax = axes[1, 0]
df_feat['SAT_FAT_PCT'].dropna().hist(bins=50, ax=ax, edgecolor='black', alpha=0.7)
ax.axvline(33, color='red', linestyle='--', label='~1/3 of fat (typical)')
ax.set_xlabel('Saturated Fat % of Total Fat')
ax.set_ylabel('Count')
ax.set_title('Distribution of Saturated Fat Percentage')
ax.legend()

# Sat fat by status
ax = axes[1, 1]
df_feat.boxplot(column='SAT_FAT_PCT', by='DIABETES_STATUS', ax=ax)
ax.set_xlabel('Diabetes Status')
ax.set_ylabel('Saturated Fat %')
ax.set_title('Sat Fat % by Status')
plt.suptitle('')

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'derived_dietary_validation.png', bbox_inches='tight')
plt.show()

## 4. Define Feature Sets

We create two feature sets:
1. **with_labs**: All features including laboratory values (requires blood draw)
2. **without_labs**: Excludes lab values (for screening without blood tests)

In [None]:
# Get feature set definitions
feature_sets = get_feature_sets()

print("Feature Set Summary:")
print("=" * 60)
for name, info in feature_sets.items():
    print(f"\n{name.upper()}:")
    print(f"  Description: {info['description']}")
    print(f"  Total features: {info['n_features']}")
    if 'categories' in info:
        print("  By category:")
        for cat, count in info['categories'].items():
            print(f"    - {cat}: {count}")

In [None]:
# Check feature availability in our data
for name in ['with_labs', 'without_labs']:
    features = feature_sets[name]['features']
    available, missing, stats = validate_feature_availability(df_feat, features)
    
    print(f"\n{name.upper()} Feature Availability:")
    print(f"  Requested: {stats['total_requested']}")
    print(f"  Available: {stats['available']}")
    print(f"  Missing: {stats['missing']}")
    if missing:
        print(f"  Missing features: {missing}")

## 5. Feature Availability Analysis

Check missing rates for each feature in our defined sets.

In [None]:
# Calculate missing rates for with_labs features
with_labs_features = feature_sets['with_labs']['features']
available_features = [f for f in with_labs_features if f in df_feat.columns]

missing_rates = df_feat[available_features].isna().mean().sort_values(ascending=False)

# Show features with >10% missing
high_missing = missing_rates[missing_rates > 0.10]
print(f"Features with >10% missing ({len(high_missing)}):")
print(high_missing.to_string())

In [None]:
# Visualize missing rates
fig, ax = plt.subplots(figsize=(12, 8))

# Show top 30 features by missing rate
top_missing = missing_rates.head(30)
colors = ['red' if x > 0.5 else 'orange' if x > 0.1 else 'green' for x in top_missing]
top_missing.plot(kind='barh', ax=ax, color=colors)
ax.axvline(0.5, color='red', linestyle='--', label='>50% missing')
ax.axvline(0.1, color='orange', linestyle='--', label='>10% missing')
ax.set_xlabel('Missing Rate')
ax.set_title('Missing Rates for Model Features')
ax.legend()

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'feature_missing_rates.png', bbox_inches='tight')
plt.show()

## 6. Prepare Final Datasets

Create modeling-ready datasets for both feature sets.

In [None]:
# Prepare with_labs dataset
X_with_labs, y_with_labs, meta_with_labs = prepare_modeling_data(
    df_feat, 
    feature_set='with_labs',
    target_col='DIABETES_STATUS',
    include_missing_flags=True
)

print("WITH_LABS Dataset:")
print(f"  Shape: {X_with_labs.shape}")
print(f"  Base features: {meta_with_labs['n_base_features']}")
print(f"  Missing flags: {meta_with_labs['n_missing_flags']}")
print(f"  Samples: {meta_with_labs['n_samples']}")

In [None]:
# Prepare without_labs dataset
X_without_labs, y_without_labs, meta_without_labs = prepare_modeling_data(
    df_feat, 
    feature_set='without_labs',
    target_col='DIABETES_STATUS',
    include_missing_flags=True
)

print("\nWITHOUT_LABS Dataset:")
print(f"  Shape: {X_without_labs.shape}")
print(f"  Base features: {meta_without_labs['n_base_features']}")
print(f"  Missing flags: {meta_without_labs['n_missing_flags']}")
print(f"  Samples: {meta_without_labs['n_samples']}")

In [None]:
# Verify target distribution is preserved
print("\nTarget Distribution (y_with_labs):")
print(y_with_labs.value_counts().sort_index())
print(f"\nPercentages:")
print((y_with_labs.value_counts(normalize=True).sort_index() * 100).round(1))

## 7. Save Processed Data

Save the feature-engineered datasets and metadata.

In [None]:
# Save the full feature-engineered dataframe
df_feat.to_parquet(DATA_PROCESSED / 'features_engineered.parquet', index=False)
print(f"Saved: features_engineered.parquet ({df_feat.shape})")

# Save modeling datasets
X_with_labs.to_parquet(DATA_PROCESSED / 'X_with_labs.parquet', index=False)
y_with_labs.to_frame().to_parquet(DATA_PROCESSED / 'y_with_labs.parquet', index=False)
print(f"Saved: X_with_labs.parquet ({X_with_labs.shape})")
print(f"Saved: y_with_labs.parquet ({len(y_with_labs)}")

X_without_labs.to_parquet(DATA_PROCESSED / 'X_without_labs.parquet', index=False)
y_without_labs.to_frame().to_parquet(DATA_PROCESSED / 'y_without_labs.parquet', index=False)
print(f"Saved: X_without_labs.parquet ({X_without_labs.shape})")
print(f"Saved: y_without_labs.parquet ({len(y_without_labs)})")

In [None]:
# Save feature engineering report
feature_report = {
    'derived_features': feature_stats,
    'feature_sets': {
        'with_labs': {
            'n_features': meta_with_labs['n_features'],
            'n_base_features': meta_with_labs['n_base_features'],
            'n_missing_flags': meta_with_labs['n_missing_flags'],
            'n_samples': meta_with_labs['n_samples'],
            'missing_features': meta_with_labs['missing_features']
        },
        'without_labs': {
            'n_features': meta_without_labs['n_features'],
            'n_base_features': meta_without_labs['n_base_features'],
            'n_missing_flags': meta_without_labs['n_missing_flags'],
            'n_samples': meta_without_labs['n_samples'],
            'missing_features': meta_without_labs['missing_features']
        }
    },
    'high_missing_features': missing_rates[missing_rates > 0.10].to_dict()
}

with open(DATA_PROCESSED / 'feature_engineering_report.json', 'w') as f:
    json.dump(feature_report, f, indent=2, default=str)
print("\nSaved: feature_engineering_report.json")

## 8. Summary & Additional Feature Ideas

### Features Created

| Feature | Description | Clinical Value |
|---------|-------------|----------------|
| AVG_SYS_BP | Average systolic BP | More stable than single reading |
| AVG_DIA_BP | Average diastolic BP | More stable than single reading |
| TOTAL_WATER | Total water intake | Hydration affects glucose |
| ACR_RATIO | Albumin/creatinine ratio | Early kidney damage marker |
| WEIGHT_CHANGE_10YR | Weight change from 10 yrs ago | Recent weight trajectory |
| WEIGHT_CHANGE_25 | Weight change from age 25 | Lifetime weight gain |
| WEIGHT_FROM_MAX | Weight lost from maximum | Weight loss effort |
| WAKE_TIME_DIFF | Weekend-weekday wake difference | Social jet lag |
| WAIST_HEIGHT_RATIO | Waist/height ratio | Central obesity |
| SAT_FAT_PCT | Saturated fat % of total | Dietary fat quality |

### Additional Features to Consider (Future Work)

1. **Interaction features**: BMI × Age, Physical activity × Sedentary time
2. **PHQ-9 total score**: Sum of depression items
3. **Metabolic syndrome components**: Combine BP, waist, lipids
4. **Dietary quality scores**: HEI (Healthy Eating Index)
5. **Cardiovascular risk score**: Framingham-like composite

In [None]:
print("Phase 4: Feature Engineering Complete!")
print("="*60)
print(f"\nDerived features created: {len(feature_stats)}")
print(f"\nDatasets saved to {DATA_PROCESSED}:")
print(f"  - features_engineered.parquet")
print(f"  - X_with_labs.parquet ({X_with_labs.shape})")
print(f"  - X_without_labs.parquet ({X_without_labs.shape})")
print(f"  - y_with_labs.parquet / y_without_labs.parquet")
print(f"  - feature_engineering_report.json")
print(f"\nFigures saved to {FIGURES_DIR}:")
print(f"  - derived_bp_validation.png")
print(f"  - derived_acr_validation.png")
print(f"  - derived_weight_change_validation.png")
print(f"  - derived_waist_height_validation.png")
print(f"  - derived_wake_time_validation.png")
print(f"  - derived_dietary_validation.png")
print(f"  - feature_missing_rates.png")