# Diabetes Prediction with AutoGluon

## Goal
Predict the probability of `diagnosed_diabetes` using the Area Under the ROC Curve (AUC) metric.

## Dataset
*   **Train**: 700,000 rows
*   **Target**: `diagnosed_diabetes` (Binary Classification)
*   **Metric**: `roc_auc`

## Prerequisite
Run this on **Kaggle** with **GPU P100** enabled.

In [None]:
# Install AutoGluon (Fast Version)
!pip install -U pip
!pip install -U setuptools wheel
# Fix dependency conflicts: Force numpy < 2.0 and scikit-learn < 1.6
!pip install "numpy<2.0" "scikit-learn<1.6" autogluon.tabular

In [None]:
# VERIFICATION STEP
# Run this cell. If it prints "Success!", then IGNORE the red errors above.
try:
    from autogluon.tabular import TabularPredictor
    print("\n✅ Success! AutoGluon is installed and working. You can ignore the pip errors.")
except ImportError as e:
    print(f"\n❌ Installation Failed. Error: {e}")

In [None]:
import pandas as pd
from autogluon.tabular import TabularPredictor
import os

# Load Data
if os.path.exists('/kaggle/input/playground-series-s5e12/train.csv'):
    data_path = '/kaggle/input/playground-series-s5e12/'
elif os.path.exists('train.csv'):
    data_path = './'
else:
    print("Data not found! Please upload the dataset.")
    data_path = './'

train_df = pd.read_csv(f"{data_path}train.csv")
test_df = pd.read_csv(f"{data_path}test.csv")
submission_df = pd.read_csv(f"{data_path}sample_submission.csv")

print(f"Train shape: {train_df.shape}")

# Drop ID column if present (it's not a feature)
if 'id' in train_df.columns:
    train_df = train_df.drop(columns=['id'])
if 'id' in test_df.columns:
    test_df = test_df.drop(columns=['id'])

## AutoGluon Training
We use `eval_metric='roc_auc'` because that is how the competition is scored.

In [None]:
predictor = TabularPredictor(
    label='diagnosed_diabetes',
    eval_metric='roc_auc',  # CRITICAL: Optimize for AUC
    problem_type='binary'   # It's a Yes/No prediction
).fit(
    train_df,
    presets='best_quality', # Stacking & Bagging for max performance
    time_limit=3600*2,      # Run for 2 hours (adjust as needed)
    ag_args_fit={'num_gpus': 1} # Use GPU
)

## Submission
We need to predict the **probability** of the positive class (1).

In [None]:
# Predict probabilities for class 1 (Diabetes)
preds_proba = predictor.predict_proba(test_df)
positive_class_probs = preds_proba[1] # Get probability of class 1

submission_df['diagnosed_diabetes'] = positive_class_probs
submission_df.to_csv('submission_diabetes_autogluon.csv', index=False)

print("Saved submission_diabetes_autogluon.csv")
print(submission_df.head())

In [None]:
# Leaderboard
predictor.leaderboard()