# Diabetes Prediction: Enhanced Feature Engineering

## Goal
Break the 0.7 AUC barrier by adding domain-specific features before training AutoGluon.

## Strategy
1.  **Load Data**
2.  **Feature Engineering**:
    *   **BMI Interactions**: BMI * Age, BMI * Waist-to-Hip Ratio
    *   **Blood Pressure**: Pulse Pressure (Systolic - Diastolic), Mean Arterial Pressure
    *   **Cholesterol Ratios**: LDL/HDL, Total/HDL
    *   **Lifestyle Score**: Combine Diet, Exercise, Sleep, Alcohol
3.  **Train AutoGluon**: Use the new features with the same powerful stacking.

In [None]:
# Install AutoGluon (Fast Version)
!pip install -U pip
!pip install -U setuptools wheel
# Fix dependency conflicts: Force numpy < 2.0 and scikit-learn < 1.6
!pip install "numpy<2.0" "scikit-learn<1.6" autogluon.tabular

In [None]:
# VERIFICATION STEP
try:
    from autogluon.tabular import TabularPredictor
    print("\n✅ Success! AutoGluon is installed and working.")
except ImportError as e:
    print(f"\n❌ Installation Failed. Error: {e}")

In [None]:
import pandas as pd
import numpy as np
from autogluon.tabular import TabularPredictor
import os

# Load Data
if os.path.exists('/kaggle/input/playground-series-s5e12/train.csv'):
    data_path = '/kaggle/input/playground-series-s5e12/'
elif os.path.exists('train.csv'):
    data_path = './'
else:
    data_path = './'

train_df = pd.read_csv(f"{data_path}train.csv")
test_df = pd.read_csv(f"{data_path}test.csv")
submission_df = pd.read_csv(f"{data_path}sample_submission.csv")

# Drop ID
if 'id' in train_df.columns:
    train_df = train_df.drop(columns=['id'])
if 'id' in test_df.columns:
    test_df = test_df.drop(columns=['id'])

## Feature Engineering Function

In [None]:
def engineer_features(df):
    df = df.copy()
    
    # 1. Blood Pressure Features
    # Pulse Pressure = Systolic - Diastolic (Indicator of arterial stiffness)
    df['pulse_pressure'] = df['systolic_bp'] - df['diastolic_bp']
    # Mean Arterial Pressure (MAP) = Diastolic + 1/3(Pulse Pressure)
    df['map'] = df['diastolic_bp'] + (df['pulse_pressure'] / 3)
    
    # 2. Cholesterol Ratios (Critical for cardiovascular risk)
    # Avoid division by zero by adding a small epsilon if needed, but HDL is usually > 0
    df['cholesterol_ratio'] = df['cholesterol_total'] / (df['hdl_cholesterol'] + 1e-5)
    df['ldl_hdl_ratio'] = df['ldl_cholesterol'] / (df['hdl_cholesterol'] + 1e-5)
    df['non_hdl_cholesterol'] = df['cholesterol_total'] - df['hdl_cholesterol']
    
    # 3. BMI Interactions
    df['bmi_age'] = df['bmi'] * df['age']
    df['bmi_waist'] = df['bmi'] * df['waist_to_hip_ratio']
    
    # 4. Lifestyle Score (Simple additive model)
    # Normalize these first if scales are very different, but tree models handle it well.
    # Assuming higher diet_score is better, higher physical_activity is better.
    # Smoking/Alcohol might be negative.
    # Let's create specific interactions instead of a raw sum for AutoGluon to decide.
    df['activity_diet'] = df['physical_activity_minutes_per_week'] * df['diet_score']
    
    # 5. Risk Factors Count
    # Summing up binary risk factors
    risk_cols = ['family_history_diabetes', 'hypertension_history', 'cardiovascular_history', 'smoking_status']
    # Note: smoking_status might be categorical, check data. If it's 0/1, this works.
    # If categorical, we rely on AutoGluon to handle it, but we can count known risks.
    # Let's assume binary for now based on 'history' naming, but check if smoking is categorical.
    
    return df

print("Engineering features...")
train_df = engineer_features(train_df)
test_df = engineer_features(test_df)

print(f"New column count: {train_df.shape[1]}")
print(train_df.head())

## Train AutoGluon

In [None]:
predictor = TabularPredictor(
    label='diagnosed_diabetes',
    eval_metric='roc_auc',
    problem_type='binary'
).fit(
    train_df,
    presets='best_quality',
    time_limit=3600*2,  # 2 Hours
    ag_args_fit={'num_gpus': 1}
)

In [None]:
preds_proba = predictor.predict_proba(test_df)
positive_class_probs = preds_proba[1]

submission_df['diagnosed_diabetes'] = positive_class_probs
submission_df.to_csv('submission_diabetes_enhanced.csv', index=False)

print("Saved submission_diabetes_enhanced.csv")

In [None]:
predictor.leaderboard()