# Linear Discriminant Analysis on Pima Indians Diabetes Dataset
## Health Analytics: Handling Medical Noise and Variance

**Dataset Overview:**
- 768 female patients of Pima Indian heritage
- 8 diagnostic measurements
- Binary classification: Diabetes diagnosis
- Features: Pregnancies, glucose, blood pressure, insulin, BMI, diabetes pedigree, age

**Focus:** Handling medical measurement noise and variance

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, GridSearchCV
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.metrics import (
    classification_report, confusion_matrix, accuracy_score,
    roc_auc_score, roc_curve, precision_recall_curve, f1_score
)
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (14, 8)
print('Libraries imported!')

## 1. Data Loading

In [None]:
try:
    df = pd.read_csv('diabetes.csv')
    print('Real Pima dataset loaded!')
except FileNotFoundError:
    print('Creating synthetic Pima-like diabetes dataset...')
    np.random.seed(42)
    n = 768
    
    pregnancies = np.random.poisson(3.5, n)
    glucose = np.random.normal(121, 30, n)
    glucose = np.clip(glucose, 0, 200)
    blood_pressure = np.random.normal(70, 12, n)
    blood_pressure = np.clip(blood_pressure, 0, 125)
    skin_thickness = np.random.normal(29, 10, n)
    skin_thickness = np.clip(skin_thickness, 0, 100)
    insulin = np.random.lognormal(4.3, 0.7, n)
    bmi = np.random.normal(32, 6, n)
    bmi = np.clip(bmi, 18, 50)
    dpf = np.random.gamma(2, 0.25, n)
    age = np.random.gamma(5, 6, n).astype(int)
    age = np.clip(age, 21, 81)
    
    # Create outcome with dependencies and noise
    risk = (
        (glucose > 140) * 0.30 +
        (bmi > 35) * 0.15 +
        (age > 45) * 0.10 +
        (dpf > 0.5) * 0.10 +
        (pregnancies > 5) * 0.05 +
        np.random.normal(0, 0.15, n)  # Medical noise
    )
    outcome = (risk > 0.35).astype(int)
    
    df = pd.DataFrame({
        'Pregnancies': pregnancies,
        'Glucose': glucose,
        'BloodPressure': blood_pressure,
        'SkinThickness': skin_thickness,
        'Insulin': insulin,
        'BMI': bmi,
        'DiabetesPedigreeFunction': dpf,
        'Age': age,
        'Outcome': outcome
    })
    print('Synthetic dataset created!')

print(f'\nDataset shape: {df.shape}')
print(f'\nDiabetes prevalence: {df["Outcome"].mean()*100:.1f}%')
print(f'\nClass distribution:')
print(df['Outcome'].value_counts())

In [None]:
print('Dataset Information:')
print(df.info())
print('\nStatistical Summary:')
display(df.describe())

# Check for zeros (common data quality issue in Pima dataset)
zero_cols = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
print('\nZero values (potential missing data):')
for col in zero_cols:
    zero_count = (df[col] == 0).sum()
    if zero_count > 0:
        print(f'{col}: {zero_count} ({zero_count/len(df)*100:.1f}%)')

## 2. Data Quality and Noise Analysis

In [None]:
# Handle zeros in medical measurements (likely missing values)
# Replace with median by outcome group
df_clean = df.copy()
for col in zero_cols:
    if (df[col] == 0).any():
        for outcome in [0, 1]:
            mask = (df['Outcome'] == outcome) & (df[col] > 0)
            median_val = df.loc[mask, col].median()
            mask_zero = (df_clean['Outcome'] == outcome) & (df_clean[col] == 0)
            df_clean.loc[mask_zero, col] = median_val
        print(f'Imputed {col}')

print('\nData cleaning complete!')
print(f'Dataset after cleaning: {df_clean.shape}')

In [None]:
# Visualize distributions and outliers
fig, axes = plt.subplots(3, 3, figsize=(18, 15))
axes = axes.ravel()

features = df_clean.columns[:-1]

for idx, feature in enumerate(features):
    for outcome in [0, 1]:
        data = df_clean[df_clean['Outcome'] == outcome][feature]
        axes[idx].hist(data, alpha=0.6, label=['No Diabetes', 'Diabetes'][outcome], bins=25)
    axes[idx].set_title(f'{feature} Distribution')
    axes[idx].set_xlabel(feature)
    axes[idx].set_ylabel('Frequency')
    axes[idx].legend()

axes[-1].axis('off')
plt.tight_layout()
plt.show()

In [None]:
# Outlier detection
fig, axes = plt.subplots(2, 4, figsize=(20, 10))
axes = axes.ravel()

for idx, feature in enumerate(features):
    df_clean.boxplot(column=feature, by='Outcome', ax=axes[idx])
    axes[idx].set_title(f'{feature} by Outcome')
    axes[idx].set_xlabel('Outcome (0=No, 1=Yes)')
    
    # Calculate outliers
    Q1 = df_clean[feature].quantile(0.25)
    Q3 = df_clean[feature].quantile(0.75)
    IQR = Q3 - Q1
    outliers = ((df_clean[feature] < Q1 - 1.5*IQR) | (df_clean[feature] > Q3 + 1.5*IQR)).sum()
    axes[idx].text(0.5, 0.95, f'Outliers: {outliers}', transform=axes[idx].transAxes, 
                  ha='center', va='top', bbox=dict(boxstyle='round', facecolor='wheat'))

plt.suptitle('Box Plots with Outlier Detection')
plt.tight_layout()
plt.show()

In [None]:
# Correlation analysis
plt.figure(figsize=(12, 10))
corr = df_clean.corr()
sns.heatmap(corr, annot=True, cmap='coolwarm', center=0, square=True, fmt='.2f')
plt.title('Feature Correlation Matrix')
plt.tight_layout()
plt.show()

print('\nHighest correlations with Outcome:')
outcome_corr = corr['Outcome'].sort_values(ascending=False)
print(outcome_corr)

## 3. Data Preparation with Robust Scaling

In [None]:
# Prepare data
X = df_clean.drop('Outcome', axis=1).values
y = df_clean['Outcome'].values
feature_names = df_clean.columns[:-1].tolist()

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

print(f'Training: {X_train.shape[0]} samples')
print(f'Test: {X_test.shape[0]} samples')
print(f'Features: {X_train.shape[1]}')

In [None]:
# Compare StandardScaler vs RobustScaler (for handling outliers)
scaler_standard = StandardScaler()
scaler_robust = RobustScaler()

X_train_standard = scaler_standard.fit_transform(X_train)
X_test_standard = scaler_standard.transform(X_test)

X_train_robust = scaler_robust.fit_transform(X_train)
X_test_robust = scaler_robust.transform(X_test)

print('Both scaling methods applied!')
print('\nStandard Scaler: Uses mean and standard deviation')
print('Robust Scaler: Uses median and IQR (less sensitive to outliers)')

## 4. LDA with Different Scaling Approaches

In [None]:
# LDA with Standard Scaling
lda_standard = LinearDiscriminantAnalysis()
lda_standard.fit(X_train_standard, y_train)
y_pred_standard = lda_standard.predict(X_test_standard)
y_proba_standard = lda_standard.predict_proba(X_test_standard)[:, 1]

print('LDA WITH STANDARD SCALING')
print('='*70)
print(f'Accuracy: {accuracy_score(y_test, y_pred_standard):.4f}')
print(f'ROC-AUC: {roc_auc_score(y_test, y_proba_standard):.4f}')
print(f'F1-Score: {f1_score(y_test, y_pred_standard):.4f}')

In [None]:
# LDA with Robust Scaling
lda_robust = LinearDiscriminantAnalysis()
lda_robust.fit(X_train_robust, y_train)
y_pred_robust = lda_robust.predict(X_test_robust)
y_proba_robust = lda_robust.predict_proba(X_test_robust)[:, 1]

print('\nLDA WITH ROBUST SCALING')
print('='*70)
print(f'Accuracy: {accuracy_score(y_test, y_pred_robust):.4f}')
print(f'ROC-AUC: {roc_auc_score(y_test, y_proba_robust):.4f}')
print(f'F1-Score: {f1_score(y_test, y_pred_robust):.4f}')

In [None]:
# Comparison
print('\nSCALING METHOD COMPARISON')
print('='*70)
print(f'{'Metric':<20} {'Standard':>15} {'Robust':>15} {'Difference':>15}')
print('-'*70)

metrics = [
    ('Accuracy', accuracy_score(y_test, y_pred_standard), accuracy_score(y_test, y_pred_robust)),
    ('ROC-AUC', roc_auc_score(y_test, y_proba_standard), roc_auc_score(y_test, y_proba_robust)),
    ('F1-Score', f1_score(y_test, y_pred_standard), f1_score(y_test, y_pred_robust))
]

for name, std_val, rob_val in metrics:
    diff = rob_val - std_val
    print(f'{name:<20} {std_val:>15.4f} {rob_val:>15.4f} {diff:>+15.4f}')

better_method = 'Robust' if roc_auc_score(y_test, y_proba_robust) > roc_auc_score(y_test, y_proba_standard) else 'Standard'
print(f'\n→ {better_method} scaling performs better for this dataset')

## 5. Model Evaluation and Diagnostics

In [None]:
# Use better performing model
if roc_auc_score(y_test, y_proba_robust) >= roc_auc_score(y_test, y_proba_standard):
    lda_final = lda_robust
    y_pred_final = y_pred_robust
    y_proba_final = y_proba_robust
    scaling_method = 'Robust'
else:
    lda_final = lda_standard
    y_pred_final = y_pred_standard
    y_proba_final = y_proba_standard
    scaling_method = 'Standard'

print(f'Using LDA with {scaling_method} Scaling for final analysis')
print('\nClassification Report:')
print(classification_report(y_test, y_pred_final, target_names=['No Diabetes', 'Diabetes']))

In [None]:
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred_final)

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Raw counts
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0],
            xticklabels=['No', 'Yes'], yticklabels=['No', 'Yes'])
axes[0].set_title('Confusion Matrix')
axes[0].set_ylabel('True Label')
axes[0].set_xlabel('Predicted Label')

# Normalized
cm_norm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
sns.heatmap(cm_norm, annot=True, fmt='.2%', cmap='Blues', ax=axes[1],
            xticklabels=['No', 'Yes'], yticklabels=['No', 'Yes'])
axes[1].set_title('Normalized Confusion Matrix')
axes[1].set_ylabel('True Label')
axes[1].set_xlabel('Predicted Label')

plt.tight_layout()
plt.show()

tn, fp, fn, tp = cm.ravel()
print(f'\nDiagnostic Performance:')
print(f'Sensitivity (Recall): {tp/(tp+fn):.3f}')
print(f'Specificity: {tn/(tn+fp):.3f}')
print(f'Positive Predictive Value: {tp/(tp+fp):.3f}')
print(f'Negative Predictive Value: {tn/(tn+fn):.3f}')

In [None]:
# Cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
if scaling_method == 'Robust':
    cv_scores = cross_val_score(lda_robust, X_train_robust, y_train, cv=cv, scoring='roc_auc')
else:
    cv_scores = cross_val_score(lda_standard, X_train_standard, y_train, cv=cv, scoring='roc_auc')

print('\nCross-Validation Results (ROC-AUC):')
print(f'Scores: {cv_scores}')
print(f'Mean: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}')

plt.figure(figsize=(10, 6))
plt.plot(range(1, 6), cv_scores, 'bo-', linewidth=2, markersize=10)
plt.axhline(y=cv_scores.mean(), color='r', linestyle='--', label=f'Mean: {cv_scores.mean():.4f}')
plt.fill_between(range(1, 6), cv_scores.mean()-cv_scores.std(), 
                 cv_scores.mean()+cv_scores.std(), alpha=0.2)
plt.xlabel('Fold')
plt.ylabel('ROC-AUC Score')
plt.title('5-Fold Cross-Validation Performance')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

## 6. Feature Importance and Medical Interpretation

In [None]:
# Feature coefficients
coefs = lda_final.coef_[0]
feature_importance = pd.DataFrame({
    'Feature': feature_names,
    'Coefficient': coefs,
    'Abs_Coefficient': np.abs(coefs)
}).sort_values('Abs_Coefficient', ascending=False)

print('Feature Importance:')
display(feature_importance)

plt.figure(figsize=(12, 8))
colors = ['red' if x < 0 else 'blue' for x in feature_importance['Coefficient'].values]
plt.barh(range(len(feature_importance)), feature_importance['Coefficient'].values[::-1], 
         color=colors[::-1], alpha=0.7)
plt.yticks(range(len(feature_importance)), feature_importance['Feature'].values[::-1])
plt.xlabel('LDA Coefficient')
plt.title('Feature Importance for Diabetes Prediction')
plt.axvline(x=0, color='black', linestyle='--')
plt.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

## 7. ROC Analysis and Threshold Selection

In [None]:
# ROC and PR curves
fpr, tpr, thresholds = roc_curve(y_test, y_proba_final)
precision, recall, pr_thresholds = precision_recall_curve(y_test, y_proba_final)

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

axes[0].plot(fpr, tpr, linewidth=2, label=f'ROC (AUC={roc_auc_score(y_test, y_proba_final):.3f})')
axes[0].plot([0, 1], [0, 1], 'k--', label='Random')
axes[0].set_xlabel('False Positive Rate')
axes[0].set_ylabel('True Positive Rate')
axes[0].set_title('ROC Curve')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

axes[1].plot(recall, precision, linewidth=2)
axes[1].set_xlabel('Recall (Sensitivity)')
axes[1].set_ylabel('Precision (PPV)')
axes[1].set_title('Precision-Recall Curve')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 8. Comparison with QDA

In [None]:
# Train QDA
if scaling_method == 'Robust':
    qda = QuadraticDiscriminantAnalysis()
    qda.fit(X_train_robust, y_train)
    y_pred_qda = qda.predict(X_test_robust)
    y_proba_qda = qda.predict_proba(X_test_robust)[:, 1]
else:
    qda = QuadraticDiscriminantAnalysis()
    qda.fit(X_train_standard, y_train)
    y_pred_qda = qda.predict(X_test_standard)
    y_proba_qda = qda.predict_proba(X_test_standard)[:, 1]

print('COMPARISON: LDA vs QDA')
print('='*70)
print(f'{'Metric':<20} {'LDA':>15} {'QDA':>15} {'Better':>15}')
print('-'*70)

comparisons = [
    ('Accuracy', accuracy_score(y_test, y_pred_final), accuracy_score(y_test, y_pred_qda)),
    ('ROC-AUC', roc_auc_score(y_test, y_proba_final), roc_auc_score(y_test, y_proba_qda)),
    ('F1-Score', f1_score(y_test, y_pred_final), f1_score(y_test, y_pred_qda))
]

for name, lda_val, qda_val in comparisons:
    better = 'LDA' if lda_val > qda_val else 'QDA' if qda_val > lda_val else 'Tie'
    print(f'{name:<20} {lda_val:>15.4f} {qda_val:>15.4f} {better:>15}')

## 9. Key Insights and Medical Recommendations

In [None]:
print('\n' + '='*70)
print('KEY INSIGHTS: DIABETES PREDICTION WITH LDA')
print('='*70)

print('\n1. DATA QUALITY MATTERS:')
print(f'   - Handled missing values (zeros) via imputation')
print(f'   - {scaling_method} scaling proved better for handling outliers')
print('   - Medical measurements contain significant variance')

print('\n2. PREDICTIVE PERFORMANCE:')
print(f'   - ROC-AUC: {roc_auc_score(y_test, y_proba_final):.3f}')
print(f'   - Sensitivity: {tp/(tp+fn):.2%} (catch rate for diabetes)')
print(f'   - Specificity: {tn/(tn+fp):.2%}')
print(f'   - CV Score: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}')

print('\n3. TOP DIAGNOSTIC INDICATORS:')
for i, row in feature_importance.head(3).iterrows():
    print(f'   - {row["Feature"]}: coefficient = {row["Coefficient"]:.4f}')

print('\n4. MODEL SELECTION:')
lda_auc = roc_auc_score(y_test, y_proba_final)
qda_auc = roc_auc_score(y_test, y_proba_qda)
if lda_auc > qda_auc:
    print('   - LDA performs better (covariance assumption holds)')
else:
    print('   - QDA performs better (different covariances per class)')
print(f'   - Difference: {abs(lda_auc - qda_auc):.4f}')

print('\n5. CLINICAL RECOMMENDATIONS:')
print('   - Use robust preprocessing for medical data')
print('   - Monitor top predictive features in clinical practice')
print('   - Adjust decision threshold based on screening vs diagnosis context')
print('   - Consider ensemble with other methods for production use')

print('\n' + '='*70)

## Summary

### What We Learned:

1. **Handling Medical Data Quality**: Addressed missing values and outliers appropriately
2. **Robust Preprocessing**: Compared standard vs robust scaling for noise resilience
3. **Model Evaluation**: Used appropriate metrics for medical diagnostics
4. **Feature Interpretation**: Identified key clinical indicators
5. **Model Comparison**: Evaluated LDA vs QDA for this specific dataset

### Why Pima Diabetes is Good for LDA:
- Real medical measurements with inherent noise
- Missing values (zeros) teach data preprocessing
- Moderate class imbalance (realistic for screening)
- Well-studied benchmark for health analytics
- Demonstrates importance of robust methods
- Clear clinical interpretation of features

### Key Takeaways:
- Data quality preprocessing is crucial for medical ML
- Robust scaling can improve performance with outliers
- Cross-validation provides reliable performance estimates
- Feature importance guides clinical decision-making
- Balance sensitivity vs specificity based on use case

### Next Steps:
- Try other imputation methods (KNN, iterative)
- Experiment with feature engineering
- Ensemble LDA with tree-based models
- Cost-sensitive learning for asymmetric errors
- Temporal validation if follow-up data available