# XGBoost for Pathology Prediction

## Overview
This notebook implements XGBoost (Extreme Gradient Boosting) for multi-class pathology prediction using the DDxPlus dataset.

## Objectives
- Load preprocessed training, validation, and test data
- Train XGBoost model for pathology classification
- Evaluate model performance with comprehensive metrics
- Visualize feature importance and predictions


In [2]:
pip install xgboost


Collecting xgboost
  Using cached xgboost-3.1.1-py3-none-macosx_12_0_arm64.whl.metadata (2.1 kB)
Collecting numpy (from xgboost)
  Using cached numpy-2.3.4-cp314-cp314-macosx_14_0_arm64.whl.metadata (62 kB)
Collecting scipy (from xgboost)
  Using cached scipy-1.16.3-cp314-cp314-macosx_14_0_arm64.whl.metadata (62 kB)
Using cached xgboost-3.1.1-py3-none-macosx_12_0_arm64.whl (2.2 MB)
Using cached numpy-2.3.4-cp314-cp314-macosx_14_0_arm64.whl (5.1 MB)
Using cached scipy-1.16.3-cp314-cp314-macosx_14_0_arm64.whl (20.9 MB)
Installing collected packages: numpy, scipy, xgboost
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3/3[0m [xgboost]m2/3[0m [xgboost]
[1A[2KSuccessfully installed numpy-2.3.4 scipy-1.16.3 xgboost-3.1.1
Note: you may need to restart the kernel to use updated packages.


In [3]:
pip install pandas numpy matplotlib seaborn scikit-learn 

Collecting pandas
  Using cached pandas-2.3.3-cp314-cp314-macosx_11_0_arm64.whl.metadata (91 kB)
Collecting matplotlib
  Using cached matplotlib-3.10.7-cp314-cp314-macosx_11_0_arm64.whl.metadata (11 kB)
Collecting seaborn
  Using cached seaborn-0.13.2-py3-none-any.whl.metadata (5.4 kB)
Collecting scikit-learn
  Using cached scikit_learn-1.7.2-cp314-cp314-macosx_12_0_arm64.whl.metadata (11 kB)
Collecting pytz>=2020.1 (from pandas)
  Using cached pytz-2025.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Using cached tzdata-2025.2-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting contourpy>=1.0.1 (from matplotlib)
  Using cached contourpy-1.3.3-cp314-cp314-macosx_11_0_arm64.whl.metadata (5.5 kB)
Collecting cycler>=0.10 (from matplotlib)
  Using cached cycler-0.12.1-py3-none-any.whl.metadata (3.8 kB)
Collecting fonttools>=4.22.0 (from matplotlib)
  Using cached fonttools-4.60.1-cp314-cp314-macosx_10_13_universal2.whl.metadata (112 kB)
Collecting kiwisolver

## 1. Setup and Imports


In [5]:
import pandas as pd
import numpy as np
import pickle
import joblib
import warnings
warnings.filterwarnings('ignore')

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

import xgboost as xgb

# Set style for plots
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("✓ Imports successful")
print(f"XGBoost version: {xgb.__version__}")


✓ Imports successful
XGBoost version: 3.1.1


## 2. Load Preprocessed Data


In [None]:
# Load preprocessed datasets from stratified directory (features already prepared)
train_df = pd.read_csv('../../DDxPlus Dataset/preprocessed_stratified/train_preprocessed.csv')
val_df = pd.read_csv('../../DDxPlus Dataset/preprocessed_stratified/validation_preprocessed.csv')
test_df = pd.read_csv('../../DDxPlus Dataset/preprocessed_stratified/test_preprocessed.csv')

# Load label encoder
with open('../../DDxPlus Dataset/pkl files/label_encoder.pkl', 'rb') as f:
    label_encoder = pickle.load(f)

print("✓ Data loaded successfully from preprocessed_stratified directory")
print(f"\nDataset shapes:")
print(f"  Training:   {train_df.shape}")
print(f"  Validation: {val_df.shape}")
print(f"  Test:       {test_df.shape}")
print(f"\nNumber of pathologies: {len(label_encoder.classes_)}")


UnpicklingError: invalid load key, 'v'.

## 3. Prepare Features and Labels


In [None]:
# Identify feature columns (exclude PATHOLOGY and other non-feature columns)
exclude_cols = ['PATHOLOGY', 'EVIDENCES', 'DIFFERENTIAL_DIAGNOSIS']
feature_cols = [col for col in train_df.columns if col not in exclude_cols]

print(f"Total features: {len(feature_cols)}")
print(f"\nFeature categories:")
print(f"  - Demographics: {[col for col in feature_cols if col in ['AGE', 'SEX_ENCODED']]}")
print(f"  - Evidence features: {len([col for col in feature_cols if col.startswith('evidence_')])}")
print(f"  - Other features: {len([col for col in feature_cols if col not in ['AGE', 'SEX_ENCODED'] and not col.startswith('evidence_')])}")

# Separate features and labels
X_train = train_df[feature_cols]
y_train = train_df['PATHOLOGY_ENCODED']

X_val = val_df[feature_cols]
y_val = val_df['PATHOLOGY_ENCODED']

X_test = test_df[feature_cols]
y_test = test_df['PATHOLOGY_ENCODED']

print(f"\n✓ Data prepared:")
print(f"  X_train shape: {X_train.shape}")
print(f"  X_val shape: {X_val.shape}")
print(f"  X_test shape: {X_test.shape}")


## 4. XGBoost Model Configuration


In [None]:
# Number of classes
num_classes = len(label_encoder.classes_)

# XGBoost parameters
xgb_params = {
    'objective': 'multi:softprob',  # Multi-class classification with probability
    'num_class': num_classes,
    'eval_metric': 'mlogloss',
    'max_depth': 6,
    'learning_rate': 0.1,
    'n_estimators': 200,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'min_child_weight': 3,
    'gamma': 0.1,
    'reg_alpha': 0.1,
    'reg_lambda': 1.0,
    'random_state': 42,
    'n_jobs': -1,
    'tree_method': 'hist'  # Faster training
}

print("XGBoost Configuration:")
for key, value in xgb_params.items():
    print(f"  {key}: {value}")


## 5. Train XGBoost Model


In [None]:
# Create XGBoost classifier
xgb_model = xgb.XGBClassifier(**xgb_params)

print("Training XGBoost model...")
print("This may take several minutes depending on your system...")

# Train the model with early stopping
xgb_model.fit(
    X_train, y_train,
    eval_set=[(X_train, y_train), (X_val, y_val)],
    verbose=True
)

print("\n✓ Model training completed!")


## 6. Model Evaluation


In [None]:
# Make predictions on validation and test sets
y_val_pred = xgb_model.predict(X_val)
y_test_pred = xgb_model.predict(X_test)

# Calculate metrics
val_accuracy = accuracy_score(y_val, y_val_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)

val_f1 = f1_score(y_val, y_val_pred, average='weighted')
test_f1 = f1_score(y_test, y_test_pred, average='weighted')

print("=" * 80)
print("MODEL PERFORMANCE METRICS")
print("=" * 80)

print(f"\n📊 Validation Set:")
print(f"  Accuracy: {val_accuracy:.4f} ({val_accuracy*100:.2f}%)")
print(f"  F1-Score: {val_f1:.4f}")

print(f"\n📊 Test Set:")
print(f"  Accuracy: {test_accuracy:.4f} ({test_accuracy*100:.2f}%)")
print(f"  F1-Score: {test_f1:.4f}")

print("\n" + "=" * 80)


In [None]:
# Detailed classification report for test set
print("\nDetailed Classification Report (Test Set):")
print("=" * 80)
report = classification_report(y_test, y_test_pred, 
                              target_names=label_encoder.classes_,
                              output_dict=True)

# Print report for a few pathologies
print("\nTop 10 Most Common Pathologies:")
pathology_counts = test_df['PATHOLOGY'].value_counts()
for patho in pathology_counts.head(10).index:
    patho_encoded = label_encoder.transform([patho])[0]
    if patho_encoded in report:
        print(f"\n{patho}:")
        print(f"  Precision: {report[str(patho_encoded)]['precision']:.4f}")
        print(f"  Recall:    {report[str(patho_encoded)]['recall']:.4f}")
        print(f"  F1-Score:  {report[str(patho_encoded)]['f1-score']:.4f}")
        print(f"  Support:   {report[str(patho_encoded)]['support']}")


## 8. Confusion Matrix


In [None]:
# Create confusion matrix for test set
cm = confusion_matrix(y_test, y_test_pred)

# Plot confusion matrix
plt.figure(figsize=(14, 12))
sns.heatmap(cm, fmt='d', cmap='Blues', cbar=True,
            xticklabels=False, yticklabels=False)
plt.title('Confusion Matrix - Test Set', fontsize=16, fontweight='bold')
plt.ylabel('True Pathology', fontsize=12)
plt.xlabel('Predicted Pathology', fontsize=12)
plt.tight_layout()
plt.show()

# Calculate per-class accuracy
class_accuracies = cm.diagonal() / cm.sum(axis=1)
print(f"\nBest performing pathology: {class_accuracies.max():.2%}")
print(f"Worst performing pathology: {class_accuracies.min():.2%}")
print(f"Mean per-class accuracy: {class_accuracies.mean():.2%}")


## 9. Save Model


In [None]:
# Save the trained model
joblib.dump(xgb_model, 'xgboost_pathology_model.pkl')
print("✓ Model saved as 'xgboost_pathology_model.pkl'")

# Save feature importance
feature_importance.to_csv('xgboost_feature_importance.csv', index=False)
print("✓ Feature importance saved as 'xgboost_feature_importance.csv'")

# Save evaluation metrics
metrics = {
    'val_accuracy': val_accuracy,
    'test_accuracy': test_accuracy,
    'val_f1': val_f1,
    'test_f1': test_f1
}

with open('xgboost_metrics.pkl', 'wb') as f:
    pickle.dump(metrics, f)
print("✓ Metrics saved as 'xgboost_metrics.pkl'")


## Summary

This notebook demonstrates the use of XGBoost for pathology prediction on the DDxPlus dataset. The model achieves competitive performance through:

- **Feature Engineering**: Using demographics, evidence counts, and binary evidence features
- **XGBoost Optimization**: Tuned hyperparameters for multi-class classification
- **Evaluation**: Comprehensive metrics including accuracy, F1-score, and per-class performance
- **Analysis**: Feature importance insights and confusion matrix visualization

### Key Results
- Model saved as `xgboost_pathology_model.pkl`
- Feature importance saved as `xgboost_feature_importance.csv`
- Evaluation metrics saved as `xgboost_metrics.pkl`
