# XGBoost Anomaly Detection for CMS Open Payments

**Project:** AAI-540 Machine Learning Operations - Final Team Project  
**Context:** Continuation of notebook 02 - Data Exploration & Analysis  
**Objective:** Train an XGBoost model to detect anomalous payment patterns using gradient boosted decision trees

---

## Table of Contents
1. [Setup & Data Loading](#setup)
2. [Load Data from Stored Variables](#loading)
3. [Data Preparation & Feature Engineering](#preparation)
4. [XGBoost Configuration](#configuration)
5. [Model Training](#training)
6. [Performance Evaluation](#evaluation)
7. [Anomaly Score Calculation & Validation](#scoring)
8. [Visualizations & Metrics](#visualizations)
9. [Summary & Outputs](#summary)

---

## 1. Setup & Data Loading

Load dependencies and restore configuration from notebook 02 (EDA).

In [None]:
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import time
import pickle
import sys
sys.path.append('..')

from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_auc_score
import xgboost as xgb
from utils.visualizations import ModelVisualizer

plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

model_viz = ModelVisualizer()

In [None]:
%store -r bucket
%store -r region
%store -r database_name
%store -r table_name_parquet
%store -r s3_parquet_path
%store -r s3_athena_staging
%store -r df

if 'df' not in dir() or df is None:
    raise NameError("Missing required variable 'df'. Run notebook 02 first.")
    
print(f"Region: {region} | Bucket: {bucket} | Database: {database_name}")
print(f"Dataset shape: {df.shape}")

## 2. Load Data from Stored Variables

Use the cleaned and processed dataset from notebook 02.

In [None]:
df_payments = df.copy()
display(df_payments.head(3))
print(f"Dataset loaded: {df_payments.shape}")

## 3. Data Preparation & Feature Engineering

Prepare features for XGBoost training with appropriate preprocessing.

In [None]:
# Select numeric features for anomaly detection
numeric_cols = df_payments.select_dtypes(include=[np.number]).columns.tolist()

# Exclude identifier and non-relevant columns
cols_to_exclude = [
    'EventTime', 'covered_recipient_profile_id', 'index',
    'teaching_hospital_id', 'covered_recipient_npi',
    'recipient_zip_code', 'recipient_province', 'recipient_postal_code'
]

numeric_features = [col for col in numeric_cols 
                   if col not in cols_to_exclude 
                   and not any(x in col.lower() for x in ['_id', '_code', '_province', '_postal'])]

# Create feature matrix
X = df_payments[numeric_features].copy().astype(float)
X = X.replace([np.inf, -np.inf], np.nan)

# Remove columns with excessive missing values (>50%)
missing_pct = (X.isnull().sum() / len(X)) * 100
cols_to_keep = missing_pct[missing_pct <= 50].index.tolist()
X = X[cols_to_keep]

# Handle outliers using IQR method
for col in X.columns:
    q1, q3 = X[col].quantile(0.25), X[col].quantile(0.75)
    iqr = q3 - q1
    X[col] = X[col].clip(lower=q1 - 3*iqr, upper=q3 + 3*iqr)

# Fill remaining missing values with median
X = X.fillna(X.median())

print(f"Features prepared: {X.shape}")

In [None]:
# For XGBoost anomaly detection, we use a semi-supervised approach:
# 1. Create pseudo-labels using Isolation Forest from previous notebook
# 2. Or use statistical thresholds to create initial labels

# Create synthetic labels based on statistical outliers
# We'll use multiple indicators to create initial labels
outlier_indicators = []

for col in X.columns:
    # Use Z-score method
    z_scores = np.abs((X[col] - X[col].mean()) / X[col].std())
    outlier_indicators.append((z_scores > 3).astype(int))

# Combine outlier indicators (if any feature is outlier, mark as potential anomaly)
outlier_score = pd.DataFrame(outlier_indicators).T.sum(axis=1)

# Create labels: top 5% of outlier scores are labeled as anomalies
threshold_percentile = 95
threshold_value = np.percentile(outlier_score, threshold_percentile)
y_pseudo = (outlier_score >= threshold_value).astype(int)

anomaly_ratio = y_pseudo.sum() / len(y_pseudo)
print(f"Pseudo-labels created: {y_pseudo.sum():,} anomalies ({anomaly_ratio*100:.2f}%)")
print(f"Normal samples: {(y_pseudo==0).sum():,} ({(1-anomaly_ratio)*100:.2f}%)")

In [None]:
# Scale features using RobustScaler (better for outliers)
scaler = RobustScaler()
X_scaled = scaler.fit_transform(X)
X_scaled = pd.DataFrame(X_scaled, columns=X.columns)

# Split into train and test sets with stratification
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y_pseudo, 
    test_size=0.2, 
    random_state=42,
    stratify=y_pseudo
)

print(f"Train: {len(X_train):,} ({y_train.sum()} anomalies) | Test: {len(X_test):,} ({y_test.sum()} anomalies)")

## 4. XGBoost Configuration

Configure the XGBoost model with optimal hyperparameters for anomaly detection.

In [None]:
# Calculate scale_pos_weight for imbalanced dataset
scale_pos_weight = (y_train == 0).sum() / (y_train == 1).sum()

# XGBoost hyperparameters
xgb_params = {
    'objective': 'binary:logistic',
    'eval_metric': 'auc',
    'max_depth': 6,
    'learning_rate': 0.1,
    'n_estimators': 200,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'min_child_weight': 3,
    'gamma': 0.1,
    'reg_alpha': 0.1,
    'reg_lambda': 1.0,
    'scale_pos_weight': scale_pos_weight,
    'random_state': 42,
    'n_jobs': -1,
    'verbosity': 0
}

print(f"XGBoost configured: depth={xgb_params['max_depth']}, lr={xgb_params['learning_rate']}, n_est={xgb_params['n_estimators']}")

## 5. Model Training

Train the XGBoost model on the prepared dataset with early stopping.

In [None]:
# Initialize and train XGBoost Classifier
xgb_model = xgb.XGBClassifier(**xgb_params)

start_time = time.time()

# Fit the model with evaluation set for early stopping
xgb_model.fit(
    X_train, y_train,
    eval_set=[(X_train, y_train), (X_test, y_test)],
    early_stopping_rounds=20,
    verbose=False
)

training_time = time.time() - start_time
print(f"Training completed: {training_time:.2f}s | Best iteration: {xgb_model.best_iteration} | Best score: {xgb_model.best_score:.6f}")

### Training Curve Analysis

Visualize the learning curves from the training process.

In [None]:
# Extract evaluation results
results = xgb_model.evals_result()

# Plot training curves using ModelVisualizer
fig = model_viz.plot_xgboost_training_curves(
    results=results,
    best_iteration=xgb_model.best_iteration
)
plt.show()

# Print summary
train_auc = results['validation_0']['auc']
test_auc = results['validation_1']['auc']

print(f"Train AUC: {train_auc[0]:.4f} → {train_auc[-1]:.4f} | Test AUC: {test_auc[0]:.4f} → {test_auc[-1]:.4f} (best: {max(test_auc):.4f})")

## 6. Performance Evaluation

Evaluate model performance and generate predictions on train and test sets.

In [None]:
# Generate predictions and probabilities
train_pred = xgb_model.predict(X_train)
train_proba = xgb_model.predict_proba(X_train)[:, 1]  # Probability of anomaly

test_pred = xgb_model.predict(X_test)
test_proba = xgb_model.predict_proba(X_test)[:, 1]

# Calculate AUC scores
train_auc_score = roc_auc_score(y_train, train_proba)
test_auc_score = roc_auc_score(y_test, test_proba)

print(f"Performance: Train AUC={train_auc_score:.6f} | Test AUC={test_auc_score:.6f}")
print(f"\nTest Classification Report:")
print(classification_report(y_test, test_pred, target_names=['Normal', 'Anomaly']))

In [None]:
# Plot confusion matrices using ModelVisualizer
fig = model_viz.plot_confusion_matrices(
    y_train=y_train,
    train_pred=train_pred,
    y_test=y_test,
    test_pred=test_pred
)
plt.show()

In [None]:
# Plot ROC curves using ModelVisualizer
fig = model_viz.plot_roc_curves(
    y_train=y_train,
    train_proba=train_proba,
    y_test=y_test,
    test_proba=test_proba,
    train_auc=train_auc_score,
    test_auc=test_auc_score
)
plt.show()

## 7. Anomaly Score Calculation & Validation

Calculate anomaly scores and validate detected anomalies with payment details.

In [None]:
# Combine all data for comprehensive analysis
all_data = pd.concat([X_train, X_test], axis=0)
all_labels = pd.concat([pd.Series(y_train.values, index=X_train.index), 
                        pd.Series(y_test.values, index=X_test.index)])

# Get predictions and probabilities for all data
all_pred = xgb_model.predict(all_data)
all_proba = xgb_model.predict_proba(all_data)[:, 1]

# Use probability as anomaly score (higher = more likely anomaly)
anomaly_scores = all_proba

# Set threshold at 0.5 (default) or optimize based on requirements
threshold = 0.5
anomaly_labels = (all_proba >= threshold).astype(int)

anomaly_count = anomaly_labels.sum()
anomaly_percentage = (anomaly_count / len(anomaly_labels)) * 100

print(f"Threshold: {threshold:.4f} | Anomalies: {anomaly_count:,}/{len(anomaly_labels):,} ({anomaly_percentage:.2f}%)")
print(f"Score range: [{anomaly_scores.min():.4f}, {anomaly_scores.max():.4f}] | Mean: {anomaly_scores.mean():.4f}")

### Validation: Inspect Detected Anomalies

Examine the payment details of detected anomalies to validate model performance.

In [None]:
# Create results dataframe with anomaly scores
anomaly_results = df_payments.copy()
anomaly_results['anomaly_score'] = anomaly_scores
anomaly_results['is_anomaly'] = anomaly_labels
anomaly_results['anomaly_score_percentile'] = pd.Series(anomaly_scores).rank(pct=True) * 100

# Filter anomalies and normal payments
anomalies_df = anomaly_results[anomaly_results['is_anomaly'] == 1].copy()
anomalies_df = anomalies_df.sort_values('anomaly_score', ascending=False)  # Higher scores = more anomalous
normal_df = anomaly_results[anomaly_results['is_anomaly'] == 0]

# Display top anomalies using ModelVisualizer
top_anomalies = model_viz.display_top_anomalies(
    anomalies_df=anomalies_df,
    score_col='anomaly_score',
    top_n=10
)
display(top_anomalies)

# Statistical comparison using ModelVisualizer
comparison_features = ['total_amount_of_payment_usdollars']
optional_comparison = ['amt_to_avg_ratio', 'hist_pay_avg']
for col in optional_comparison:
    if col in anomaly_results.columns:
        comparison_features.append(col)

comparison_stats = model_viz.print_anomaly_stats(
    normal_df=normal_df,
    anomaly_df=anomalies_df,
    score_col='anomaly_score',
    comparison_features=comparison_features
)

In [None]:
# Anomaly distribution by key categories
categorical_cols = ['covered_recipient_type', 'nature_of_payment_or_transfer_of_value']
optional_categorical = ['is_high_risk_nature', 'is_weekend', 'is_new_recipient']

for col in optional_categorical:
    if col in anomalies_df.columns:
        categorical_cols.append(col)

categorical_cols = [col for col in categorical_cols if col in anomalies_df.columns]

if categorical_cols:
    print("Top Anomaly Categories:")
    for col in categorical_cols[:2]:  # Show only first 2
        print(f"\n{col}:")
        print(anomalies_df[col].value_counts().head(3))

In [None]:
# Visualize payment amount distribution: Normal vs Anomalies using ModelVisualizer
if 'total_amount_of_payment_usdollars' in anomaly_results.columns:
    fig = model_viz.plot_anomaly_comparison(
        normal_df=normal_df,
        anomaly_df=anomalies_df,
        amount_col='total_amount_of_payment_usdollars',
        score_col='anomaly_score'
    )
    plt.show()

## 8. Visualizations & Metrics

Visualize anomaly scores, feature importance, and model performance metrics.

In [None]:
# Plot feature importance using ModelVisualizer
fig, feature_importance = model_viz.plot_feature_importance(
    feature_names=X.columns,
    feature_importances=xgb_model.feature_importances_,
    top_n=15
)
plt.show()

print("\nTop 10 Most Important Features:")
display(feature_importance.head(10))

In [None]:
# Plot comprehensive anomaly score analysis using ModelVisualizer
fig = model_viz.plot_anomaly_scores(
    train_scores=train_proba,
    test_scores=test_proba,
    threshold=threshold,
    model_name='XGBoost'
)
plt.show()

## 9. Summary & Outputs

Save model and anomaly detection results for downstream analysis.

In [None]:
# Save model artifacts
with open('cms_xgboost_model.pkl', 'wb') as f:
    pickle.dump(xgb_model, f)

with open('xgboost_scaler.pkl', 'wb') as f:
    pickle.dump(scaler, f)

anomaly_results.to_csv('anomaly_results_xgboost.csv', index=False)
feature_importance.to_csv('xgboost_feature_importance.csv', index=False)

print("Saved: model, scaler, results, feature importance")

In [None]:
# Execution summary
results_summary = pd.DataFrame({
    'Metric': ['Total Records', 'Train/Test Split', 'Anomalies Detected', 
               'Train AUC', 'Test AUC', 'Training Time (s)', 'Top Feature'],
    'Value': [
        len(anomaly_labels), 
        f"{len(X_train):,}/{len(X_test):,}",
        f"{anomaly_count:,} ({anomaly_percentage:.2f}%)",
        f"{train_auc_score:.4f}",
        f"{test_auc_score:.4f}",
        f"{training_time:.2f}",
        f"{feature_importance.iloc[0]['feature']} ({feature_importance.iloc[0]['importance']:.4f})"
    ]
})

display(results_summary)

print(f"\nScore Separation: Normal={normal_df['anomaly_score'].mean():.4f} | Anomaly={anomalies_df['anomaly_score'].mean():.4f} | Gap={anomalies_df['anomaly_score'].mean() - normal_df['anomaly_score'].mean():.4f}")