# XGBoost pathology prediction (stratified preprocessed data)

This notebook trains an XGBoost classifier to predict the `pathology` column using stratified, preprocessed datasets located in:

`DDxPlus Dataset/preprocessed_stratified/`

It includes: EDA, preprocessing pipelines, Stratified K-Fold training with early stopping, hyperparameter search, evaluation, SHAP explanations, and model saving.

How to run:
- Make a Python venv and install requirements in the first cell (uncomment and run).
- Ensure the CSVs are present at `DDxPlus Dataset/preprocessed_stratified/train_preprocessed.csv`, `validation_preprocessed.csv`, `test_preprocessed.csv`.

Outputs:
- `models/xgb_pathology_stratified.joblib`
- `outputs/validation_predictions.csv`

---

Notebook generated programmatically.

## Install required packages

If you don't already have the required packages installed in your Python environment, run the cell below. In many environments the `%pip` magic is preferred inside notebooks (it ensures the kernel environment is used).

Run this cell only once and skip if packages are already installed.

In [5]:
# Uncomment and run this if you need to install packages in the notebook kernel
# %pip install -q pandas numpy scikit-learn xgboost shap matplotlib seaborn joblib

# If you prefer to install from the terminal, run the same pip command there.

In [6]:
# Import libraries
import os
import warnings
from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import StratifiedKFold, train_test_split, RandomizedSearchCV
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import (accuracy_score, precision_score, recall_score, f1_score,
                             classification_report, confusion_matrix, roc_auc_score,
                             log_loss, precision_recall_curve)

import xgboost as xgb
from xgboost import XGBClassifier
import joblib

# optional
try:
    import shap
except Exception:
    shap = None

warnings.filterwarnings('ignore')

print('libraries imported')

libraries imported


In [7]:
# Paths to stratified preprocessed data
BASE = Path('/Users/denizcii/Desktop/sdp-chronicles')
DATA_DIR = BASE / 'DDxPlus Dataset' / 'preprocessed_stratified'

train_path = DATA_DIR / 'train_preprocessed.csv'
val_path = DATA_DIR / 'validation_preprocessed.csv'
test_path = DATA_DIR / 'test_preprocessed.csv'

print('Looking for files:')
print(train_path.exists(), train_path)
print(val_path.exists(), val_path)
print(test_path.exists(), test_path)

# Load dataframes if present
train_df = pd.read_csv(train_path) if train_path.exists() else None
val_df = pd.read_csv(val_path) if val_path.exists() else None
test_df = pd.read_csv(test_path) if test_path.exists() else None

# Show quick heads
for name, df in [('train', train_df), ('val', val_df), ('test', test_df)]:
    if df is None:
        print(f'{name} dataframe not found')
    else:
        print(f"{name} shape:", df.shape)
        display(df.head(3))

Looking for files:
True /Users/denizcii/Desktop/sdp-chronicles/DDxPlus Dataset/preprocessed_stratified/train_preprocessed.csv
True /Users/denizcii/Desktop/sdp-chronicles/DDxPlus Dataset/preprocessed_stratified/validation_preprocessed.csv
True /Users/denizcii/Desktop/sdp-chronicles/DDxPlus Dataset/preprocessed_stratified/test_preprocessed.csv
train shape: (936888, 12)


Unnamed: 0,AGE,DIFFERENTIAL_DIAGNOSIS,SEX,PATHOLOGY,EVIDENCES,INITIAL_EVIDENCE,SEX_ENCODED,NUM_EVIDENCES,NUM_DIFFERENTIAL_DX,TOP_PROBABILITY,CONFIDENCE_GAP,PATHOLOGY_ENCODED
0,34,"[['TSVP', 0.2249801997605792], ['Fibrillation ...",M,TSVP,"['anxiete_s', 'cafe', 'douleurxx', 'douleurxx_...",douleurxx,0,17,6,0.22498,0.005229,44
1,73,"[['Anémie', 0.1800172485572441], ['Ebola', 0.1...",F,Anémie,"['Mauv_aliment', 'atcd_anem', 'atcd_fam_anem',...",rectorragie,1,21,9,0.180017,0.052969,3
2,52,"[['RGO', 0.20809497395808063], ['Bronchite', 0...",M,RGO,"['douleurxx', 'douleurxx_carac_@_sensible', 'd...",pyrosis,0,23,9,0.208095,0.04347,35


val shape: (129258, 12)


Unnamed: 0,AGE,DIFFERENTIAL_DIAGNOSIS,SEX,PATHOLOGY,EVIDENCES,INITIAL_EVIDENCE,SEX_ENCODED,NUM_EVIDENCES,NUM_DIFFERENTIAL_DX,TOP_PROBABILITY,CONFIDENCE_GAP,PATHOLOGY_ENCODED
0,52,"[['Bronchite', 0.33250589804545644], ['Rhinosi...",M,Rhinosinusite aigue,"['douleurxx', 'douleurxx_carac_@_une_brûlure_o...",hyponos,0,24,6,0.332506,0.138893,37
1,67,"[['Attaque de panique', 0.16960860783151496], ...",M,Attaque de panique,"['atcdpsyfam', 'diaph', 'douleurxx', 'douleurx...",paresthesies_bilat,0,26,12,0.169609,0.038329,5
2,62,"[['IVRS ou virémie', 0.1947788896958362], ['Po...",F,IVRS ou virémie,"['contact', 'dayc', 'diaph', 'douleurxx', 'dou...",rhino_clair,1,21,9,0.194779,0.013022,18


test shape: (142184, 12)


Unnamed: 0,AGE,DIFFERENTIAL_DIAGNOSIS,SEX,PATHOLOGY,EVIDENCES,INITIAL_EVIDENCE,SEX_ENCODED,NUM_EVIDENCES,NUM_DIFFERENTIAL_DX,TOP_PROBABILITY,CONFIDENCE_GAP,PATHOLOGY_ENCODED
0,7,"[['Tuberculose', 0.34186741055243264], ['Bronc...",F,Tuberculose,"['HIV', 'cortico', 'crach_sg', 'drogues_IV', '...",crach_sg,1,9,3,0.341867,0.001673,45
1,28,"[['Pneumothorax spontané', 0.12926816396128354...",F,Péricardite,"['B34.9', 'douleurxx', 'douleurxx_carac_@_un_c...",dyspn,1,13,15,0.129268,0.001353,34
2,6,[['Fibrillation auriculaire/Flutter auriculair...,F,Fibrillation auriculaire/Flutter auriculaire,"['ap_valve', 'dyspn', 'e10_e11', 'e66', 'etour...",ww_effort,1,11,14,0.1384,0.017558,15


## Quick Data Exploration

Let's check:
1. Missing values
2. Target (pathology) distribution
3. Feature types and basic statistics

In [None]:
# Check missing values
print("Missing values in training set:")
display(train_df.isnull().sum()[train_df.isnull().sum() > 0])

# Check target distribution
plt.figure(figsize=(12, 5))
train_df['pathology'].value_counts(normalize=True).plot(kind='bar')
plt.title('Distribution of Pathology Classes (Training Set)')
plt.xlabel('Pathology Class')
plt.ylabel('Proportion')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

# Basic feature info
print("\nFeature types:")
display(train_df.dtypes)

# Basic statistics for numeric columns
print("\nNumeric feature statistics:")
display(train_df.describe())

## Prepare Features and Labels

We'll:
1. Separate features and target
2. Create preprocessing pipeline for categorical and numeric features
3. Set up cross-validation strategy

## Train XGBoost Model

1. Separate features and target (pathology)
2. Train XGBoost classifier
3. Make predictions on validation set

In [None]:
# Separate features and target
X_train = train_df.drop(['pathology'], axis=1)
y_train = train_df['pathology']

X_val = val_df.drop(['pathology'], axis=1)
y_val = val_df['pathology']

print("Training features shape:", X_train.shape)
print("Validation features shape:", X_val.shape)

# Initialize and train XGBoost classifier
xgb_clf = XGBClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=5,
    random_state=42,
    eval_metric='mlogloss'  # for multiclass classification
)

# Train with early stopping using validation set
xgb_clf.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)],
    early_stopping_rounds=10,
    verbose=True
)

print("\nBest iteration:", xgb_clf.best_iteration)

# Create output directories if they don't exist
os.makedirs(BASE / 'models', exist_ok=True)
os.makedirs(BASE / 'outputs', exist_ok=True)

# Save model
model_path = BASE / 'models' / 'xgb_pathology_stratified.joblib'
joblib.dump(xgb_clf, model_path)
print(f"\nModel saved to {model_path}")

## Model Evaluation

Evaluate model performance on validation set:
1. Make predictions
2. Calculate accuracy, precision, recall, F1 score
3. Save predictions to CSV

In [13]:
# Training data - keep only numeric columns and encode categorical ones
numeric_features = ['AGE', 'SEX_ENCODED', 'NUM_EVIDENCES', 'NUM_DIFFERENTIAL_DX', 'TOP_PROBABILITY', 'CONFIDENCE_GAP']

# Select only numeric features for training
X_train = train_df[numeric_features]
y_train = train_df['PATHOLOGY_ENCODED']  # use encoded pathology as target

# Prepare test set
X_test = test_df[numeric_features]
y_test_true = test_df['PATHOLOGY'].copy()  # save original labels for evaluation
y_test_encoded = test_df['PATHOLOGY_ENCODED'].copy()  # save encoded labels for model

# Prepare validation set
X_val = val_df[numeric_features]
y_val = val_df['PATHOLOGY_ENCODED']

print("Training features shape:", X_train.shape)
print("Validation features shape:", X_val.shape)
print("Test features shape:", X_test.shape)
print("\nFeatures used:", numeric_features)

# Initialize and train XGBoost classifier
xgb_clf = XGBClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=5,
    random_state=42,
    eval_metric='mlogloss'  # for multiclass classification
)

# Train with validation monitoring
xgb_clf.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)],
    verbose=True
)

# Make predictions on test set
y_pred_encoded = xgb_clf.predict(X_test)
y_pred_proba = xgb_clf.predict_proba(X_test)

# Create output directories if they don't exist
os.makedirs(BASE / 'models', exist_ok=True)
os.makedirs(BASE / 'outputs', exist_ok=True)

# Save model and predictions
model_path = BASE / 'models' / 'xgb_pathology_stratified.joblib'
preds_path = BASE / 'outputs' / 'test_predictions.csv'

joblib.dump(xgb_clf, model_path)
print(f"\nModel saved to {model_path}")

# Create a mapping dictionary from encoded to original labels
pathology_mapping = dict(zip(train_df['PATHOLOGY_ENCODED'], train_df['PATHOLOGY']))
y_pred = [pathology_mapping[pred] for pred in y_pred_encoded]

# Save predictions with probabilities
pred_df = pd.DataFrame({
    'true_pathology': y_test_true,
    'predicted_pathology': y_pred,
    'predicted_pathology_encoded': y_pred_encoded
})

# Add probability columns for each class
for i, col in enumerate(xgb_clf.classes_):
    pred_df[f'prob_class_{col}'] = y_pred_proba[:, i]

pred_df.to_csv(preds_path, index=False)
print(f"Predictions saved to {preds_path}")

# Print evaluation metrics
print("\nModel Performance on Test Set:")
print("Accuracy:", accuracy_score(y_test_encoded, y_pred_encoded))
print("\nClassification Report:")
print(classification_report(y_test_encoded, y_pred_encoded))

Training features shape: (936888, 6)
Validation features shape: (129258, 6)
Test features shape: (142184, 6)

Features used: ['AGE', 'SEX_ENCODED', 'NUM_EVIDENCES', 'NUM_DIFFERENTIAL_DX', 'TOP_PROBABILITY', 'CONFIDENCE_GAP']
[0]	validation_0-mlogloss:3.03081
[1]	validation_0-mlogloss:2.76426
[2]	validation_0-mlogloss:2.56765
[3]	validation_0-mlogloss:2.42171
[4]	validation_0-mlogloss:2.30142
[5]	validation_0-mlogloss:2.19968
[6]	validation_0-mlogloss:2.10952
[7]	validation_0-mlogloss:2.03075
[8]	validation_0-mlogloss:1.96291
[9]	validation_0-mlogloss:1.90054
[10]	validation_0-mlogloss:1.84561
[11]	validation_0-mlogloss:1.79637
[12]	validation_0-mlogloss:1.75042
[13]	validation_0-mlogloss:1.70891
[14]	validation_0-mlogloss:1.67017
[15]	validation_0-mlogloss:1.63309
[16]	validation_0-mlogloss:1.59937
[17]	validation_0-mlogloss:1.56909
[18]	validation_0-mlogloss:1.53965
[19]	validation_0-mlogloss:1.51337
[20]	validation_0-mlogloss:1.48825
[21]	validation_0-mlogloss:1.46500
[22]	validation

In [9]:
# Check column names
print("Training data columns:")
print(train_df.columns.tolist())
print("\nTest data columns:")
print(test_df.columns.tolist())

Training data columns:
['AGE', 'DIFFERENTIAL_DIAGNOSIS', 'SEX', 'PATHOLOGY', 'EVIDENCES', 'INITIAL_EVIDENCE', 'SEX_ENCODED', 'NUM_EVIDENCES', 'NUM_DIFFERENTIAL_DX', 'TOP_PROBABILITY', 'CONFIDENCE_GAP', 'PATHOLOGY_ENCODED']

Test data columns:
['AGE', 'DIFFERENTIAL_DIAGNOSIS', 'SEX', 'PATHOLOGY', 'EVIDENCES', 'INITIAL_EVIDENCE', 'SEX_ENCODED', 'NUM_EVIDENCES', 'NUM_DIFFERENTIAL_DX', 'TOP_PROBABILITY', 'CONFIDENCE_GAP', 'PATHOLOGY_ENCODED']


## Model Analysis

Let's analyze why we might be getting lower accuracy:
1. Confusion Matrix to see which classes are confused
2. Feature Importance to see if we're missing important features

In [None]:
# Plot confusion matrix
plt.figure(figsize=(10, 8))
cm = confusion_matrix(y_test_encoded, y_pred_encoded)
sns.heatmap(cm, cmap='Blues', fmt='d')
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()

# Plot feature importance
plt.figure(figsize=(10, 6))
importance_df = pd.DataFrame({
    'feature': numeric_features,
    'importance': xgb_clf.feature_importances_
})
importance_df = importance_df.sort_values('importance', ascending=False)

sns.barplot(x='importance', y='feature', data=importance_df)
plt.title('Feature Importance')
plt.xlabel('Importance Score')
plt.tight_layout()
plt.show()

# Print most confused pairs
print("\nMost Confused Class Pairs:")
for i in range(len(xgb_clf.classes_)):
    for j in range(i+1, len(xgb_clf.classes_)):
        if cm[i][j] + cm[j][i] > 0:  # if there are any confusions between these classes
            true_label_i = pathology_mapping[i]
            true_label_j = pathology_mapping[j]
            confusion_sum = cm[i][j] + cm[j][i]
            print(f"{true_label_i} ↔ {true_label_j}: {confusion_sum} cases")

## Improved Model with Text Features

Let's improve the model by:
1. Including encoded text features (DIFFERENTIAL_DIAGNOSIS, EVIDENCES)
2. Using a proper preprocessing pipeline
3. Increasing model complexity slightly