# Patient Dropout Prediction Model

This notebook builds an XGBoost classifier to predict patient dropout from clinical trials using Python libraries.

**Features:**
- **Age**: Patient age (0-99)
- **Gender**: Patient gender (MALE/FEMALE)
- **Target**: patient_dropout indicator (1 = dropout, 0 = completed)

**Tech Stack:**
- Snowpark for Python (data access)
- XGBoost (gradient boosting classifier)
- scikit-learn (preprocessing & metrics)
- pandas (data manipulation)


## 1. Import Libraries and Setup


In [None]:
### Import Required Libraries

# Snowpark for Python
from snowflake.snowpark.context import get_active_session
import snowflake.snowpark.functions as F

# Data science libraries
import pandas as pd
import numpy as np
from xgboost import XGBClassifier

# Scikit-learn for preprocessing and metrics
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    accuracy_score, 
    precision_score, 
    recall_score, 
    f1_score,
    confusion_matrix,
    classification_report,
    roc_auc_score,
    roc_curve
)

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Misc
import warnings
warnings.simplefilter('ignore')


In [None]:
# Establish Snowflake session
session = get_active_session()

# Add query tag for tracking
session.query_tag = {
    "origin": "patient_dropout_ml",
    "model": "xgboost_classifier",
    "version": {"major": 1, "minor": 0}
}

print("Session established successfully!")
session


## 2. Load Training Data

Load the patient dropout data from INFORMATICS_SANDBOX.ML_TEST.DOR_ANALYSIS_FF


In [None]:
# Load the training data from Snowflake table
patient_data_df = session.table("INFORMATICS_SANDBOX.ML_TEST.DOR_ANALYSIS_FF")

# Display basic info
print(f"Total records: {patient_data_df.count()}")
patient_data_df.show()


In [None]:
# Check the schema
patient_data_df.describe()

# Show column names and types
for field in patient_data_df.schema.fields:
    print(f"{field.name}: {field.datatype}")


## 3. Exploratory Data Analysis


In [None]:
# Convert to pandas for analysis and visualization
patient_pd = patient_data_df.to_pandas()

# Display first few rows
print("Sample data:")
patient_pd.head(10)


In [None]:
# Check data distribution
print("=== Dataset Overview ===")
print(f"Total patients: {len(patient_pd)}")
print(f"Dropout count: {patient_pd['PATIENT_DROPOUT'].sum()}")
print(f"Dropout percentage: {patient_pd['PATIENT_DROPOUT'].mean() * 100:.2f}%")
print(f"\nAge statistics:")
print(f"  Mean age: {patient_pd['AGE'].mean():.2f}")
print(f"  Min age: {patient_pd['AGE'].min()}")
print(f"  Max age: {patient_pd['AGE'].max()}")
print(f"  Std age: {patient_pd['AGE'].std():.2f}")

# Check for missing values
print(f"\nMissing values:")
print(patient_pd.isnull().sum())


In [None]:
# Dropout rate by gender
print("=== Dropout Rate by Gender ===")
gender_stats = patient_pd.groupby('GENDER').agg({
    'PATIENT_DROPOUT': ['count', 'sum', 'mean']
}).round(4)
gender_stats.columns = ['Total_Patients', 'Dropout_Count', 'Dropout_Rate']
gender_stats['Dropout_Percentage'] = gender_stats['Dropout_Rate'] * 100
print(gender_stats)


In [None]:
# Dropout rate by age group
print("\n=== Dropout Rate by Age Group ===")
patient_pd['AGE_GROUP'] = pd.cut(
    patient_pd['AGE'], 
    bins=[0, 30, 50, 70, 100],
    labels=['18-29', '30-49', '50-69', '70+']
)

age_stats = patient_pd.groupby('AGE_GROUP').agg({
    'PATIENT_DROPOUT': ['count', 'sum', 'mean']
}).round(4)
age_stats.columns = ['Total_Patients', 'Dropout_Count', 'Dropout_Rate']
age_stats['Dropout_Percentage'] = age_stats['Dropout_Rate'] * 100
print(age_stats)

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Dropout rate by age group
age_stats['Dropout_Percentage'].plot(kind='bar', ax=axes[0], color='steelblue')
axes[0].set_title('Dropout Rate by Age Group')
axes[0].set_ylabel('Dropout Percentage (%)')
axes[0].set_xlabel('Age Group')
axes[0].tick_params(axis='x', rotation=45)

# Dropout rate by gender
gender_stats['Dropout_Percentage'].plot(kind='bar', ax=axes[1], color='coral')
axes[1].set_title('Dropout Rate by Gender')
axes[1].set_ylabel('Dropout Percentage (%)')
axes[1].set_xlabel('Gender')
axes[1].tick_params(axis='x', rotation=0)

plt.tight_layout()
plt.show()


## 4. Data Preprocessing

Prepare features for XGBoost model training by encoding categorical variables.


In [None]:
# Define feature columns and target
FEATURE_COLUMNS = ['AGE', 'GENDER']
TARGET_COLUMN = 'PATIENT_DROPOUT'

# Create a clean dataframe with only required columns
df_clean = patient_pd[FEATURE_COLUMNS + [TARGET_COLUMN]].copy()

# Handle case variations in gender
df_clean['GENDER'] = df_clean['GENDER'].str.upper()

# Encode gender: MALE = 1, FEMALE = 0
df_clean['GENDER_ENCODED'] = (df_clean['GENDER'] == 'MALE').astype(int)

# Drop the original gender column
df_clean = df_clean.drop('GENDER', axis=1)

# Remove AGE_GROUP if it exists (was created for EDA only)
if 'AGE_GROUP' in df_clean.columns:
    df_clean = df_clean.drop('AGE_GROUP', axis=1)

print("Preprocessed data shape:", df_clean.shape)
print("\nFirst few rows:")
print(df_clean.head())


In [None]:
# Verify no missing values and data types
print("Data Info:")
print(df_clean.info())
print("\nData Statistics:")
print(df_clean.describe())
print("\nClass distribution:")
print(df_clean[TARGET_COLUMN].value_counts())


## 5. Train/Test Split

Split the data into training and testing sets following best practices.


In [None]:
# Prepare features (X) and target (y)
X = df_clean[['AGE', 'GENDER_ENCODED']]
y = df_clean[TARGET_COLUMN]

# Split into train and test sets (80/20 split)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=42,
    stratify=y  # Maintain class distribution in both sets
)

print(f"Training set size: {len(X_train)} ({len(X_train)/len(X)*100:.1f}%)")
print(f"Test set size: {len(X_test)} ({len(X_test)/len(X)*100:.1f}%)")
print(f"\nTraining set dropout rate: {y_train.mean()*100:.2f}%")
print(f"Test set dropout rate: {y_test.mean()*100:.2f}%")


## 6. Train XGBoost Model

Train an XGBoost classifier following the pattern from MEDPACE_ML_HOL notebooks.


In [None]:
# Initialize XGBoost Classifier
xgb_model = XGBClassifier(
    n_estimators=100,        # Number of trees
    max_depth=6,             # Maximum tree depth
    learning_rate=0.1,       # Step size shrinkage
    random_state=42,
    eval_metric='logloss'    # Evaluation metric
)

# Train the model
print("Training XGBoost model...")
xgb_model.fit(X_train, y_train)
print("Training complete!")

# Display feature importance
feature_names = ['AGE', 'GENDER_ENCODED']
feature_importance = pd.DataFrame({
    'Feature': feature_names,
    'Importance': xgb_model.feature_importances_
}).sort_values('Importance', ascending=False)

print("\nFeature Importance:")
print(feature_importance)


In [None]:
# Visualize feature importance
plt.figure(figsize=(10, 5))
plt.barh(feature_importance['Feature'], feature_importance['Importance'], color='skyblue')
plt.xlabel('Importance Score')
plt.title('XGBoost Feature Importance')
plt.tight_layout()
plt.show()


## 7. Make Predictions

Generate predictions on both training and test sets.


In [None]:
# Generate predictions on test set
y_test_pred = xgb_model.predict(X_test)
y_test_pred_proba = xgb_model.predict_proba(X_test)[:, 1]  # Probability of dropout

# Generate predictions on training set (to check for overfitting)
y_train_pred = xgb_model.predict(X_train)
y_train_pred_proba = xgb_model.predict_proba(X_train)[:, 1]

print("Predictions generated successfully!")
print(f"Test predictions shape: {y_test_pred.shape}")
print(f"Training predictions shape: {y_train_pred.shape}")


In [None]:
# Create a predictions dataframe for test set
test_predictions_df = pd.DataFrame({
    'AGE': X_test['AGE'].values,
    'GENDER_ENCODED': X_test['GENDER_ENCODED'].values,
    'Actual_Dropout': y_test.values,
    'Predicted_Dropout': y_test_pred,
    'Dropout_Probability': y_test_pred_proba
})

print("Sample predictions:")
print(test_predictions_df.head(10))


In [None]:
## 8. Model Evaluation - Test Set Performance

Calculate comprehensive evaluation metrics on the test set.


# Calculate test set metrics
test_accuracy = accuracy_score(y_test, y_test_pred)
test_precision = precision_score(y_test, y_test_pred)
test_recall = recall_score(y_test, y_test_pred)
test_f1 = f1_score(y_test, y_test_pred)
test_auc = roc_auc_score(y_test, y_test_pred_proba)

print("=== TEST SET PERFORMANCE ===")
print(f"Accuracy:  {test_accuracy:.4f} ({test_accuracy*100:.2f}%)")
print(f"Precision: {test_precision:.4f}")
print(f"Recall:    {test_recall:.4f}")
print(f"F1 Score:  {test_f1:.4f}")
print(f"ROC AUC:   {test_auc:.4f}")


In [None]:
# Calculate training set metrics (to check for overfitting)
train_accuracy = accuracy_score(y_train, y_train_pred)
train_precision = precision_score(y_train, y_train_pred)
train_recall = recall_score(y_train, y_train_pred)
train_f1 = f1_score(y_train, y_train_pred)
train_auc = roc_auc_score(y_train, y_train_pred_proba)

print("\n=== TRAINING SET PERFORMANCE ===")
print(f"Accuracy:  {train_accuracy:.4f} ({train_accuracy*100:.2f}%)")
print(f"Precision: {train_precision:.4f}")
print(f"Recall:    {train_recall:.4f}")
print(f"F1 Score:  {train_f1:.4f}")
print(f"ROC AUC:   {train_auc:.4f}")

# Check for overfitting
print("\n=== OVERFITTING CHECK ===")
print(f"Accuracy difference: {abs(train_accuracy - test_accuracy):.4f}")
if abs(train_accuracy - test_accuracy) < 0.05:
    print("✓ Model generalizes well (difference < 5%)")
else:
    print("⚠ Possible overfitting detected")


In [None]:
# Confusion Matrix
cm = confusion_matrix(y_test, y_test_pred)

print("\n=== CONFUSION MATRIX ===")
print(f"True Negatives:  {cm[0,0]}")
print(f"False Positives: {cm[0,1]}")
print(f"False Negatives: {cm[1,0]}")
print(f"True Positives:  {cm[1,1]}")

# Visualize confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['No Dropout', 'Dropout'],
            yticklabels=['No Dropout', 'Dropout'])
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.title('Confusion Matrix - Test Set')
plt.show()


In [None]:
# Classification Report
print("\n=== CLASSIFICATION REPORT ===")
print(classification_report(y_test, y_test_pred, 
                          target_names=['No Dropout', 'Dropout']))


## 9. ROC Curve and AUC

Visualize model performance across different classification thresholds.


In [None]:
# Calculate ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_test_pred_proba)

# Plot ROC curve
plt.figure(figsize=(10, 7))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (AUC = {test_auc:.4f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random Classifier')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.grid(alpha=0.3)
plt.show()


In [None]:
## 10. Predict on New Patients

Apply the trained model to score new patients for dropout risk.


# Create sample new patients to score
new_patients_data = pd.DataFrame({
    'AGE': [25, 45, 65, 30, 75, 22, 55, 40, 80, 28],
    'GENDER': ['FEMALE', 'MALE', 'FEMALE', 'MALE', 'FEMALE', 
               'MALE', 'FEMALE', 'MALE', 'MALE', 'FEMALE']
})

print("New patients to score:")
print(new_patients_data)


In [None]:
# Preprocess new patients (same as training data)
new_patients_data['GENDER_ENCODED'] = (new_patients_data['GENDER'].str.upper() == 'MALE').astype(int)

# Prepare features for prediction
X_new = new_patients_data[['AGE', 'GENDER_ENCODED']]

print("\nPreprocessed features:")
print(X_new)


In [None]:
# Make predictions on new patients
new_predictions = xgb_model.predict(X_new)
new_predictions_proba = xgb_model.predict_proba(X_new)[:, 1]

# Create results dataframe
new_patients_results = new_patients_data.copy()
new_patients_results['Predicted_Dropout'] = new_predictions
new_patients_results['Dropout_Probability'] = new_predictions_proba

# Add risk category
def categorize_risk(prob):
    if prob >= 0.7:
        return 'High Risk'
    elif prob >= 0.4:
        return 'Medium Risk'
    else:
        return 'Low Risk'

new_patients_results['Risk_Category'] = new_patients_results['Dropout_Probability'].apply(categorize_risk)

print("\n=== NEW PATIENT PREDICTIONS ===")
print(new_patients_results[['AGE', 'GENDER', 'Dropout_Probability', 'Risk_Category']].sort_values('Dropout_Probability', ascending=False))


In [None]:
# Visualize risk distribution for new patients
risk_counts = new_patients_results['Risk_Category'].value_counts()

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Risk category distribution
risk_counts.plot(kind='bar', ax=axes[0], color=['red', 'orange', 'green'])
axes[0].set_title('Risk Category Distribution - New Patients')
axes[0].set_ylabel('Count')
axes[0].set_xlabel('Risk Category')
axes[0].tick_params(axis='x', rotation=45)

# Dropout probability distribution
axes[1].hist(new_patients_results['Dropout_Probability'], bins=10, color='steelblue', edgecolor='black')
axes[1].set_title('Dropout Probability Distribution')
axes[1].set_xlabel('Dropout Probability')
axes[1].set_ylabel('Frequency')
axes[1].axvline(0.4, color='orange', linestyle='--', label='Medium Risk Threshold')
axes[1].axvline(0.7, color='red', linestyle='--', label='High Risk Threshold')
axes[1].legend()

plt.tight_layout()
plt.show()


In [None]:
## 11. Summary and Next Steps


### Model Summary

**Data Source:**
- Table: INFORMATICS_SANDBOX.ML_TEST.DOR_ANALYSIS_FF
- Features: Age (numeric), Gender (categorical)
- Target: Patient_Dropout (binary: 1=dropout, 0=completed)

**Model Details:**
- Algorithm: XGBoost Classifier
- Implementation: Python (xgboost library)
- Train/Test Split: 80/20 with stratification

**Model Performance:**
- Test Accuracy: {printed above}
- Test AUC: {printed above}  
- Precision, Recall, F1: {printed above}

**Workflow:**
1. ✅ Load data from Snowflake using Snowpark
2. ✅ Exploratory data analysis with visualizations
3. ✅ Feature encoding (Gender → binary)
4. ✅ Train/test split with stratification
5. ✅ XGBoost model training
6. ✅ Comprehensive evaluation (accuracy, precision, recall, F1, AUC, confusion matrix, ROC curve)
7. ✅ Production scoring on new patients with risk categorization

### Potential Improvements

1. **Feature Engineering:**
   - Add medical history features
   - Include trial duration and phase
   - Add previous trial participation data
   - Create age bins or polynomial features

2. **Model Enhancements:**
   - Hyperparameter tuning with GridSearchCV or RandomizedSearchCV
   - Try ensemble methods (Random Forest, LightGBM)
   - Implement SMOTE or class weighting if imbalanced
   - Feature selection techniques

3. **MLOps:**
   - Integrate with Snowflake Model Registry
   - Set up model monitoring and drift detection
   - Create automated retraining pipeline
   - Deploy as Snowflake UDF for real-time scoring

4. **Validation:**
   - Implement k-fold cross-validation
   - Test on multiple clinical trial datasets
   - Perform temporal validation (train on old data, test on recent)

5. **Interpretability:**
   - Add SHAP values for model explainability
   - Create feature importance visualizations
   - Analyze misclassified cases
5. Address class imbalance if present (SMOTE, class weights)
6. Create a production deployment pipeline
7. Integrate with Snowflake Feature Store and Model Registry


## 12. Model Persistence (Optional)

Save the trained XGBoost model for future use.


In [None]:
# Optional: Save model to file for later use
# Uncomment to save the model

# import joblib
# import os

# # Create models directory if it doesn't exist
# os.makedirs('/tmp/models', exist_ok=True)

# # Save the model
# model_path = '/tmp/models/patient_dropout_xgboost.joblib'
# joblib.dump(xgb_model, model_path)
# print(f"Model saved to: {model_path}")

# # To load the model later:
# # loaded_model = joblib.load(model_path)
# # predictions = loaded_model.predict(X_new)

print("Model training and evaluation complete!")
