<a href="https://colab.research.google.com/github/robaahmedd/fraud_detection_project/blob/abdelrhman/02_modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

1. Setup and data loading
This section imports all required libraries for modeling and loads the provider-level dataset created in Notebook 1. The dataset contains one row per provider, with engineered features summarizing inpatient and outpatient behavior, and a binary PotentialFraud label indicating fraudulent versus legitimate providers.



In [2]:
# 1. Setup and Data Loading

import pandas as pd
import numpy as np

from collections import Counter

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, HistGradientBoostingClassifier
from sklearn.metrics import (
    precision_score, recall_score, f1_score,
    roc_auc_score, average_precision_score
)

from imblearn.over_sampling import SMOTE

import warnings
warnings.filterwarnings("ignore")

print("Loaded libraries successfully.")

# Load provider-level dataset created by Person 1
provider_level = pd.read_csv("provider_level.csv")

print("Provider-level dataset shape:", provider_level.shape)
provider_level.head()


Loaded libraries successfully.
Provider-level dataset shape: (5410, 55)


Unnamed: 0,Provider,PotentialFraud,Inpatient_ClaimCount,Inpatient_UniquePatients,Inpatient_TotalReimbursed,Inpatient_AvgReimbursed,Inpatient_StdReimbursed,Inpatient_MaxReimbursed,Inpatient_TotalDeductible,Inpatient_AvgDeductible,...,Outpatient_ChronicCond_rheumatoidarthritis,Outpatient_ChronicCond_stroke,Total_ClaimCount,Total_UniquePatients,Total_Reimbursed,Total_DeceasedClaims,AvgReimbursed_PerPatient,AvgClaims_PerPatient,DeceasedClaim_Rate,Inpatient_Ratio
0,PRV51001,0,5.0,5.0,97000.0,19400.0,18352.111595,42000.0,5340.0,1068.0,...,5.0,4.0,25.0,19.0,104640.0,0.0,5507.368421,1.315789,0.0,0.2
1,PRV51003,1,62.0,53.0,573000.0,9241.935484,8513.606244,57000.0,66216.0,1068.0,...,19.0,5.0,132.0,66.0,605670.0,1.0,9176.818182,2.0,0.007576,0.469697
2,PRV51004,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,46.0,17.0,149.0,138.0,52170.0,1.0,378.043478,1.07971,0.006711,0.0
3,PRV51005,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,331.0,124.0,1165.0,495.0,280910.0,4.0,567.494949,2.353535,0.003433,0.0
4,PRV51007,0,3.0,3.0,19000.0,6333.333333,3511.884584,10000.0,3204.0,1068.0,...,21.0,10.0,72.0,56.0,33710.0,1.0,601.964286,1.285714,0.013889,0.041667


2. Define features and target
We define the modeling target as PotentialFraud (1 = fraudulent provider, 0 = legitimate) and remove the Provider identifier from the feature matrix to avoid data leakage. This creates X (features) and y (labels) for supervised learning and allows us to inspect the overall class distribution, which confirms the dataset is imbalanced with only about 10% fraudulent providers.



In [3]:
# 2. Define Features and Target

target_col = "PotentialFraud"
id_cols = ["Provider"]

# Separate features (X) and target (y)
X = provider_level.drop(columns=id_cols + [target_col])
y = provider_level[target_col]

print("Features shape:", X.shape)
print("Target shape:", y.shape)
print("Class distribution (0=Legit, 1=Fraud):", Counter(y))


Features shape: (5410, 53)
Target shape: (5410,)
Class distribution (0=Legit, 1=Fraud): Counter({0: 4904, 1: 506})


3. Stratified train/validation/test split
To enable rigorous evaluation, the data is split into training, validation, and test sets using stratified sampling. Stratification preserves the fraud ratio in each split, which is important given the class imbalance. The training set is used for fitting models, the validation set for model comparison and tuning, and the test set is reserved for final performance reporting.

In [4]:
# 3. Train / Validation / Test Split (Stratified)

# First split: train vs temp (val+test)
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y,
    test_size=0.4,
    random_state=42,
    stratify=y
)

# Second split: validation vs test
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp,
    test_size=0.5,
    random_state=42,
    stratify=y_temp
)

print("Train size:", X_train.shape[0], "Class dist:", Counter(y_train))
print("Val size:", X_val.shape[0], "Class dist:", Counter(y_val))
print("Test size:", X_test.shape[0], "Class dist:", Counter(y_test))


Train size: 3246 Class dist: Counter({0: 2942, 1: 304})
Val size: 1082 Class dist: Counter({0: 981, 1: 101})
Test size: 1082 Class dist: Counter({0: 981, 1: 101})


4. Feature scaling for linear models
Standardization is applied to the feature matrix for use with Logistic Regression and other models that are sensitive to feature scale. Tree-based models such as Random Forest and Gradient Boosting do not require scaling, so we keep both scaled and unscaled versions of the data to use with different algorithms as appropriate.

In [5]:
# 4. Optional Scaling for Logistic Regression

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)

print("Scaling completed for Logistic Regression.")


Scaling completed for Logistic Regression.


5. Evaluation helper function
A reusable helper function is defined to train a model, generate predictions, and compute key metrics on the validation set. The function reports Precision, Recall, F1-score, ROC-AUC, and PR-AUC, which are more informative than accuracy for imbalanced fraud detection and align with the project’s evaluation requirements.

In [6]:
# 5. Helper Function for Model Evaluation

def evaluate_model(name, model, X_tr, y_tr, X_v, y_v):
    """
    Fit model, predict on validation set, and return metrics + predictions.
    """
    # Fit
    model.fit(X_tr, y_tr)

    # Predictions
    y_pred = model.predict(X_v)

    # Probabilities or decision scores
    if hasattr(model, "predict_proba"):
        y_proba = model.predict_proba(X_v)[:, 1]
    else:
        y_proba = model.decision_function(X_v)

    # Metrics
    metrics = {
        "model": name,
        "precision": precision_score(y_v, y_pred),
        "recall": recall_score(y_v, y_pred),
        "f1": f1_score(y_v, y_pred),
        "roc_auc": roc_auc_score(y_v, y_proba),
        "pr_auc": average_precision_score(y_v, y_proba)
    }

    return metrics, y_pred, y_proba


6. Baseline models with class weighting
We first train baseline models that handle class imbalance using class_weight="balanced" instead of resampling. Logistic Regression provides a simple, interpretable baseline; Random Forest and HistGradientBoosting provide more powerful non-linear ensembles. Their performance on the validation set establishes reference metrics before applying SMOTE.

In [7]:
# 6. Baseline Models with Class Weights

results_class_weight = {}

# 6.1 Logistic Regression (with class_weight="balanced")
log_reg = LogisticRegression(
    class_weight="balanced",
    max_iter=500,
    n_jobs=-1,
    solver="lbfgs"
)

metrics_log, y_val_pred_log, y_val_proba_log = evaluate_model(
    "LogReg_class_weight",
    log_reg,
    X_train_scaled, y_train,
    X_val_scaled, y_val
)
results_class_weight["LogReg_class_weight"] = metrics_log
print("Logistic Regression (class_weight) metrics:", metrics_log)

# 6.2 Random Forest (with class_weight="balanced")
rf = RandomForestClassifier(
    n_estimators=300,
    max_depth=None,
    min_samples_split=5,
    min_samples_leaf=2,
    class_weight="balanced",
    random_state=42,
    n_jobs=-1
)

metrics_rf, y_val_pred_rf, y_val_proba_rf = evaluate_model(
    "RF_class_weight",
    rf,
    X_train, y_train,
    X_val, y_val
)
results_class_weight["RF_class_weight"] = metrics_rf
print("Random Forest (class_weight) metrics:", metrics_rf)

# 6.3 HistGradientBoosting (no explicit class_weight)
gb = HistGradientBoostingClassifier(
    max_depth=8,
    learning_rate=0.1,
    max_iter=200,
    random_state=42
)

metrics_gb, y_val_pred_gb, y_val_proba_gb = evaluate_model(
    "HGB_no_weight",
    gb,
    X_train, y_train,
    X_val, y_val
)
results_class_weight["HGB_no_weight"] = metrics_gb
print("HistGradientBoosting metrics:", metrics_gb)

# Show results in a table
pd.DataFrame(results_class_weight).T


Logistic Regression (class_weight) metrics: {'model': 'LogReg_class_weight', 'precision': 0.4200913242009132, 'recall': 0.9108910891089109, 'f1': 0.575, 'roc_auc': np.float64(0.9476085223201218), 'pr_auc': np.float64(0.7354874881173631)}
Random Forest (class_weight) metrics: {'model': 'RF_class_weight', 'precision': 0.7096774193548387, 'recall': 0.6534653465346535, 'f1': 0.6804123711340206, 'roc_auc': np.float64(0.9481232526922416), 'pr_auc': np.float64(0.7617527567817257)}
HistGradientBoosting metrics: {'model': 'HGB_no_weight', 'precision': 0.7571428571428571, 'recall': 0.5247524752475248, 'f1': 0.6198830409356725, 'roc_auc': np.float64(0.9415629636358132), 'pr_auc': np.float64(0.7305449433028568)}


Unnamed: 0,model,precision,recall,f1,roc_auc,pr_auc
LogReg_class_weight,LogReg_class_weight,0.420091,0.910891,0.575,0.947609,0.735487
RF_class_weight,RF_class_weight,0.709677,0.653465,0.680412,0.948123,0.761753
HGB_no_weight,HGB_no_weight,0.757143,0.524752,0.619883,0.941563,0.730545


7. Handling class imbalance with SMOTE
To further address the minority class problem, we apply SMOTE (Synthetic Minority Oversampling Technique) to the training data only. SMOTE creates synthetic fraudulent providers to balance the classes, giving the models more exposure to minority patterns while keeping validation and test sets untouched, which avoids information leakage and maintains realistic evaluation.

In [8]:
# 7. Handling Imbalance with SMOTE (on training set only)

smote = SMOTE(random_state=42, k_neighbors=5)

X_train_sm, y_train_sm = smote.fit_resample(X_train, y_train)

print("Before SMOTE:", Counter(y_train))
print("After SMOTE:", Counter(y_train_sm))


Before SMOTE: Counter({0: 2942, 1: 304})
After SMOTE: Counter({0: 2942, 1: 2942})


. Models trained on SMOTE-balanced data
Using the SMOTE-balanced training set, we retrain Random Forest and HistGradientBoosting models and evaluate them on the original validation set. This allows direct comparison between class-weight-based and SMOTE-based strategies and shows how oversampling impacts Precision, Recall, F1, ROC-AUC, and PR-AUC for fraud detection.

In [9]:
# 8. Models Trained on SMOTE Data

results_smote = {}

# 8.1 Random Forest on SMOTE data
rf_sm = RandomForestClassifier(
    n_estimators=300,
    max_depth=None,
    min_samples_split=5,
    min_samples_leaf=2,
    random_state=42,
    n_jobs=-1
)

metrics_rf_sm, y_val_pred_rf_sm, y_val_proba_rf_sm = evaluate_model(
    "RF_SMOTE",
    rf_sm,
    X_train_sm, y_train_sm,
    X_val, y_val
)
results_smote["RF_SMOTE"] = metrics_rf_sm
print("Random Forest (SMOTE) metrics:", metrics_rf_sm)

# 8.2 HistGradientBoosting on SMOTE data
gb_sm = HistGradientBoostingClassifier(
    max_depth=8,
    learning_rate=0.1,
    max_iter=200,
    random_state=42
)

metrics_gb_sm, y_val_pred_gb_sm, y_val_proba_gb_sm = evaluate_model(
    "HGB_SMOTE",
    gb_sm,
    X_train_sm, y_train_sm,
    X_val, y_val
)
results_smote["HGB_SMOTE"] = metrics_gb_sm
print("HistGradientBoosting (SMOTE) metrics:", metrics_gb_sm)

# Show SMOTE-based results in a table
pd.DataFrame(results_smote).T

Random Forest (SMOTE) metrics: {'model': 'RF_SMOTE', 'precision': 0.5826771653543307, 'recall': 0.7326732673267327, 'f1': 0.6491228070175439, 'roc_auc': np.float64(0.9412551346877807), 'pr_auc': np.float64(0.7335651100815841)}
HistGradientBoosting (SMOTE) metrics: {'model': 'HGB_SMOTE', 'precision': 0.6559139784946236, 'recall': 0.6039603960396039, 'f1': 0.6288659793814433, 'roc_auc': np.float64(0.938504859660278), 'pr_auc': np.float64(0.7334079163921718)}


Unnamed: 0,model,precision,recall,f1,roc_auc,pr_auc
RF_SMOTE,RF_SMOTE,0.582677,0.732673,0.649123,0.941255,0.733565
HGB_SMOTE,HGB_SMOTE,0.655914,0.60396,0.628866,0.938505,0.733408


9. Model comparison on validation set
We aggregate the metrics from all model variants—class-weight baselines and SMOTE-trained models—into a single comparison table. This table helps identify which combination of algorithm and imbalance strategy best balances Recall (catching fraud) and Precision (limiting false alarms), with particular attention to PR-AUC as a key metric for imbalanced data.

In [10]:
all_results = {}
all_results.update(results_class_weight)
all_results.update(results_smote)

results_df = pd.DataFrame(all_results).T.sort_values(by="pr_auc", ascending=False)
results_df

Unnamed: 0,model,precision,recall,f1,roc_auc,pr_auc
RF_class_weight,RF_class_weight,0.709677,0.653465,0.680412,0.948123,0.761753
LogReg_class_weight,LogReg_class_weight,0.420091,0.910891,0.575,0.947609,0.735487
RF_SMOTE,RF_SMOTE,0.582677,0.732673,0.649123,0.941255,0.733565
HGB_SMOTE,HGB_SMOTE,0.655914,0.60396,0.628866,0.938505,0.733408
HGB_no_weight,HGB_no_weight,0.757143,0.524752,0.619883,0.941563,0.730545


10. Hyperparameter tuning for RF with SMOTE
To refine the best-performing SMOTE-based model, we perform lightweight hyperparameter tuning of Random Forest trained on SMOTE data using GridSearchCV. A small parameter grid over tree depth, number of trees, and minimum samples per split is used, optimizing F1-score to balance Precision and Recall. The tuned RF_SMOTE_tuned model is then evaluated on the validation set and compared to previous variants.

In [11]:
# -------------------------------------------------------------
# Lightweight Random Forest Hyperparameter Tuning (RF + SMOTE)
# -------------------------------------------------------------

print("Starting lightweight RF_SMOTE tuning (GridSearch on SMOTE training data)...")

from sklearn.model_selection import GridSearchCV

# Base RF model (no class_weight, since SMOTE already balances classes)
rf_base_smote = RandomForestClassifier(
    n_estimators=200,
    random_state=42,
    n_jobs=-1
)

# Small grid to keep runtime reasonable
param_grid = {
    "n_estimators": [100, 200],
    "max_depth": [None, 10],
    "min_samples_split": [2, 5],
}

grid_smote = GridSearchCV(
    rf_base_smote,
    param_grid,
    scoring="f1",   # you can also try 'recall' or 'average_precision'
    cv=3,
    n_jobs=-1,
    verbose=1
)

# IMPORTANT: tuning uses SMOTE training data ONLY
grid_smote.fit(X_train_sm, y_train_sm)

best_rf_smote_tuned = grid_smote.best_estimator_
print("Best params (RF_SMOTE):", grid_smote.best_params_)

# -------------------------------------------------------------
# Evaluate tuned RF_SMOTE using FULL correct signature
# -------------------------------------------------------------
rf_smote_tuned_results, _, _ = evaluate_model(
    "RF_SMOTE_tuned",
    best_rf_smote_tuned,
    X_train_sm,   # training data (SMOTE)
    y_train_sm,   # training labels (SMOTE)
    X_val,        # validation data (original)
    y_val         # validation labels (original)
)

print("\nTUNED RF_SMOTE Results:")
print(rf_smote_tuned_results)


Starting lightweight RF_SMOTE tuning (GridSearch on SMOTE training data)...
Fitting 3 folds for each of 8 candidates, totalling 24 fits
Best params (RF_SMOTE): {'max_depth': None, 'min_samples_split': 2, 'n_estimators': 100}

TUNED RF_SMOTE Results:
{'model': 'RF_SMOTE_tuned', 'precision': 0.6186440677966102, 'recall': 0.7227722772277227, 'f1': 0.6666666666666666, 'roc_auc': np.float64(0.9382929118599932), 'pr_auc': np.float64(0.7211265986760231)}


11. Final training and test evaluation
The final model, RF_SMOTE_tuned, is retrained on the combined training and validation data, with SMOTE applied again to create a balanced full training set. We then evaluate this model on the untouched test set to obtain final performance estimates in terms of Precision, Recall, F1, ROC-AUC, and PR-AUC, which represent how well the model can prioritize high-risk providers in a realistic setting.

In [12]:
# 10. Train Final Model on Train + Validation, Evaluate on Test

# Final chosen model: tuned SMOTE RF from GridSearch
final_model_name = "RF_SMOTE_tuned"
final_model = best_rf_smote_tuned

# Rebuild full train (train + val)
X_train_full = pd.concat([X_train, X_val], axis=0)
y_train_full = pd.concat([y_train, y_val], axis=0)

# Apply SMOTE again on full train because final model uses SMOTE
X_train_full_sm, y_train_full_sm = smote.fit_resample(X_train_full, y_train_full)
print("Full train after SMOTE:", Counter(y_train_full_sm))

# Fit final model
final_model.fit(X_train_full_sm, y_train_full_sm)

# Evaluate on test set (tree model → use original features)
X_test_used = X_test

y_test_pred = final_model.predict(X_test_used)

# Get probabilities
if hasattr(final_model, "predict_proba"):
    y_test_proba = final_model.predict_proba(X_test_used)[:, 1]
else:
    y_test_proba = final_model.decision_function(X_test_used)

# Compute test metrics
test_metrics = {
    "precision": precision_score(y_test, y_test_pred),
    "recall": recall_score(y_test, y_test_pred),
    "f1": f1_score(y_test, y_test_pred),
    "roc_auc": roc_auc_score(y_test, y_test_proba),
    "pr_auc": average_precision_score(y_test, y_test_proba)
}

print("Final model:", final_model_name)
print("Test metrics:", test_metrics)


Full train after SMOTE: Counter({0: 3923, 1: 3923})
Final model: RF_SMOTE_tuned
Test metrics: {'precision': 0.5037037037037037, 'recall': 0.6732673267326733, 'f1': 0.576271186440678, 'roc_auc': np.float64(0.9217660298139905), 'pr_auc': np.float64(0.6272543970482534)}


12. Saving model and test predictions
For downstream evaluation and error analysis in Notebook 3, the final tuned model is saved as best_model.pkl. We also generate and save test predictions (including provider IDs, true labels, predicted labels, and fraud probabilities) to test_predictions_for_evaluation.csv. This enables Person 3 to compute confusion matrices, ROC/PR curves, perform case studies on false positives and false negatives, and carry out cost-based analyses without retraining the model.

In [13]:
from joblib import dump
import pandas as pd
import os

OUTPUT_PATH = "/content/test_predictions_for_evaluation.csv"
MODEL_PATH = "/content/best_model.pkl"

# Use the tuned SMOTE RF as final model
final_model = best_rf_smote_tuned

# Save tuned model
dump(final_model, MODEL_PATH)

# Save predictions on ORIGINAL test set
test_pred = final_model.predict(X_test)
test_pred_proba = final_model.predict_proba(X_test)[:, 1]

pd.DataFrame({
    "Provider": provider_level.iloc[X_test.index]["Provider"].values,
    "y_true": y_test.values,
    "y_pred": test_pred,
    "y_proba": test_pred_proba
}).to_csv(OUTPUT_PATH, index=False)

print("Saved to:", OUTPUT_PATH)
print("Saved model:", MODEL_PATH)


Saved to: /content/test_predictions_for_evaluation.csv
Saved model: /content/best_model.pkl


In [14]:
df = pd.read_csv("/content/test_predictions_for_evaluation.csv")
df.head(10)


Unnamed: 0,Provider,y_true,y_pred,y_proba
0,PRV55252,0,0,0.02
1,PRV56512,0,0,0.02
2,PRV57021,1,0,0.33
3,PRV52342,0,1,0.78
4,PRV53841,0,0,0.0
5,PRV56392,1,0,0.46
6,PRV56267,0,0,0.01
7,PRV55663,0,0,0.01
8,PRV52011,0,0,0.0
9,PRV54986,1,1,1.0
