###Algorithm Evaluation for Fraud Detection

We evaluate a set of widely used machine learning algorithms that are suitable for structured, tabular,
and highly imbalanced fraud-detection data. Each algorithm offers different strengths in terms of
interpretability, robustness, and detection capability.

Below are the selected models and the reasons for including them:

---

### **1) Logistic Regression**
A simple and highly interpretable baseline model.  
Useful for understanding linear relationships and providing transparent decision boundaries.

---

### **2) Decision Tree**
Provides clear, rule-based decisions that investigators can easily understand.  
Good for identifying important feature splits.

---

### **3) Random Forest**
An ensemble of many decision trees.  
Robust to outliers, handles non-linear relationships, and naturally supports class weighting.  
Commonly used in fraud detection due to strong performance.

---

### **4) Gradient Boosting (e.g., XGBoost / GradientBoostingClassifier)**
Builds trees sequentially, correcting previous errors.  
High accuracy, excels at complex fraud patterns, and generally outperforms simpler models.

---

### **5) Support Vector Machine (SVM)**
Effective in high-dimensional spaces and good for detecting subtle patterns.  
Works well when the minority class has unique boundaries.

---

Evaluating these algorithms helps us compare interpretability vs. predictive power and select the most
appropriate model for fraud detection.


In [None]:
# ============================================================
#  FIXED: ALGORITHM EVALUATION FOR FRAUD DETECTION
# ============================================================

# ---------- Imports ----------
from imblearn.pipeline import Pipeline        # <-- IMPORTANT FIX
from imblearn.over_sampling import SMOTE
from sklearn.metrics import classification_report, precision_recall_curve, auc

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC

# Create SMOTE instance
smote = SMOTE(random_state=42)

# ============================================================
# Logistic Regression
# ============================================================
log_reg = Pipeline([
    ('smote', SMOTE(random_state=42)),    # Oversampling
    ('model', LogisticRegression(
        class_weight='balanced',
        max_iter=500,
        n_jobs=-1
    ))
])

log_reg.fit(X_train, y_train)
y_pred_lr = log_reg.predict(X_test)
y_proba_lr = log_reg.predict_proba(X_test)[:, 1]


# ============================================================
# Decision Tree
# ============================================================
tree = Pipeline([
    ('smote', SMOTE(random_state=42)),
    ('model', DecisionTreeClassifier(
        class_weight='balanced',
        random_state=42
    ))
])

tree.fit(X_train, y_train)
y_pred_tree = tree.predict(X_test)
y_proba_tree = tree.predict_proba(X_test)[:, 1]


# ============================================================
# Random Forest
# ============================================================
rf = Pipeline([
    ('smote', SMOTE(random_state=42)),
    ('model', RandomForestClassifier(
        n_estimators=300,
        class_weight='balanced',
        random_state=42,
        n_jobs=-1
    ))
])

rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
y_proba_rf = rf.predict_proba(X_test)[:, 1]


# ============================================================
# Gradient Boosting
# ============================================================
gb = Pipeline([
    ('smote', SMOTE(random_state=42)),
    ('model', GradientBoostingClassifier(
        random_state=42
    ))
])

gb.fit(X_train, y_train)
y_pred_gb = gb.predict(X_test)
y_proba_gb = gb.predict_proba(X_test)[:, 1]


# ============================================================
# SVM
# ============================================================
svm_model = Pipeline([
    ('smote', SMOTE(random_state=42)),
    ('model', SVC(
        class_weight='balanced',
        probability=True,
        kernel='rbf',
        random_state=42
    ))
])

svm_model.fit(X_train, y_train)
y_pred_svm = svm_model.predict(X_test)
y_proba_svm = svm_model.predict_proba(X_test)[:, 1]


# ============================================================
# PRINT METRICS
# ============================================================

models = {
    "Logistic Regression": (y_pred_lr, y_proba_lr),
    "Decision Tree": (y_pred_tree, y_proba_tree),
    "Random Forest": (y_pred_rf, y_proba_rf),
    "Gradient Boosting": (y_pred_gb, y_proba_gb),
    "SVM": (y_pred_svm, y_proba_svm)
}

for name, (pred, proba) in models.items():
    print(f"\n==================== {name} ====================")
    print(classification_report(y_test, pred))

    precision, recall, thresholds = precision_recall_curve(y_test, proba)
    pr_auc = auc(recall, precision)
    print("PR-AUC:", pr_auc)



              precision    recall  f1-score   support

           0       0.98      0.89      0.94       981
           1       0.45      0.86      0.59       101

    accuracy                           0.89      1082
   macro avg       0.72      0.88      0.76      1082
weighted avg       0.93      0.89      0.90      1082

PR-AUC: 0.7429020451837107

              precision    recall  f1-score   support

           0       0.96      0.91      0.93       981
           1       0.41      0.59      0.48       101

    accuracy                           0.88      1082
   macro avg       0.68      0.75      0.71      1082
weighted avg       0.90      0.88      0.89      1082

PR-AUC: 0.5200577311871291

              precision    recall  f1-score   support

           0       0.97      0.93      0.95       981
           1       0.50      0.72      0.59       101

    accuracy                           0.91      1082
   macro avg       0.74      0.82      0.77      1082
weighted avg     

### 2) Model Comparison by Interpretability, Computational Cost, Robustness to Imbalance, and Suitability for Mixed Data

To determine which algorithm best fits the fraud detection problem, we evaluated each model according
to four critical dimensions:

---

##  Interpretability
Interpretability is essential in healthcare fraud detection since investigators must explain why a
provider was flagged.

- **Highly interpretable:** Logistic Regression, Decision Tree  
- **Moderately interpretable:** Random Forest (via feature importance)  
- **Low interpretability:** Gradient Boosting, SVM  

---

##  Computational Feasibility
- **Fastest models:** Logistic Regression, Decision Tree  
- **Moderate cost:** Random Forest  
- **Heaviest models:** Gradient Boosting and SVM (especially with nonlinear kernels)  

Logistic Regression and Random Forest are computationally efficient enough for large-scale deployment.

---

##  Robustness to Class Imbalance
Given the strong imbalance in fraud detection datasets:

- **Most robust:** Logistic Regression & Random Forest  
  - Both support class weighting and work well with SMOTE  
- **Moderate:** Gradient Boosting  
- **Weak:** Decision Tree and SVM showed lower precision and PR-AUC  

---

##  Suitability for Mixed Data
Your aggregated dataset combines numerical, categorical, and engineered features.

- **Tree-based models (RF, GB, DT)** naturally handle mixed data well  
- **Logistic Regression & SVM** require encoding and scaling, but still performed strongly

---

### Summary
Random Forest and Logistic Regression offer the best combination of interpretability, computational
efficiency, robustness to imbalance, and adaptability to mixed data. Logistic Regression performed
exceptionally well despite preprocessing requirements.


### 3) Justification of the Primary Model Choice

Based on the empirical performance metrics and alignment with dataset characteristics, **Logistic
Regression is selected as the primary fraud detection model**.

---

##  Best Performance on Imbalanced Metrics
Logistic Regression achieved the **highest PR-AUC (0.743)** among all models.  
Since PR-AUC is the most important metric in imbalanced problems, this indicates superior ranking
performance for fraudulent cases.

It also achieved:
- **Recall = 0.86** (second highest overall)
- **F1-score = 0.59** (tied for highest)

This demonstrates strong ability to detect fraud with balanced precision–recall behavior.

---

##  Interpretability (Critical for Medicare Compliance)
Fraud detection requires clear explanations for flagged providers.  
Logistic Regression provides:
- transparent coefficients  
- easy-to-understand feature contributions  
- simple threshold-based decisions  

This makes it ideal for auditing and regulatory environments.

---

##  Alignment with Dataset Characteristics
Your dataset shows strong linear separation in engineered features, which Logistic Regression can model
effectively.  
It also works extremely well with:
- SMOTE oversampling  
- class_weight balancing  
- aggregated numerical features  

This contributes to its strong PR-AUC and recall.

---

##  Final Statement
Given its superior PR-AUC, strong recall, interpretability, computational efficiency, and alignment
with dataset structure, **Logistic Regression is selected as the primary model**.  
Random Forest is used as the secondary comparison model due to its high precision and robustness.


### 1) Implementation of Additional Comparison Models

To benchmark the performance of the primary model (Logistic Regression), we implemented two additional
models commonly used in fraud detection:

#### ** Random Forest Classifier**
Selected for its robustness, ability to model complex interactions, and high precision on our dataset.
It serves as a strong ensemble-based baseline.

#### ** Gradient Boosting Classifier**
Chosen for its ability to capture non-linear relationships and sequentially improve on weak learners.
It provides a high-recall alternative and strong ranking ability.

Both comparison models were trained under the same conditions as the primary model using:
- SMOTE oversampling  
- class weighting (when applicable)  
- identical train–test splits and evaluation metrics  

This ensures a fair and consistent comparison of model performance.
