###  Class Imbalance Handling Strategy

The dataset contains significantly fewer fraudulent providers compared to non-fraudulent ones.  
To prevent the model from learning a bias toward predicting ‚ÄúNot Fraud,‚Äù we apply **four imbalance
mitigation techniques**, each addressing the issue differently:

---

#### **1) Class Weighting**
The model automatically increases the penalty for misclassifying the minority class (fraud).  
This forces the algorithm to pay more attention to fraud cases without modifying the dataset.

---

#### **2) Oversampling (SMOTE)**
SMOTE generates synthetic fraud samples to balance the dataset.  
This prevents the model from being overwhelmed by majority-class samples during training.

---

#### **3) Undersampling**
Randomly removes samples from the majority class.  
This creates balance but risks losing important information ‚Äî included here for completeness.

---

#### **4) Cost-Sensitive Learning**
We manually set higher misclassification costs for fraud.  
This explicitly tells the model: *‚ÄúMissing fraud is more expensive than marking a real provider as fraud.‚Äù*

---

These techniques collectively ensure that the model learns patterns of fraudulent behavior more effectively
and avoids relying on misleading accuracy metrics that favor the majority class.


In [None]:
X = provider_aggregated_df.drop(columns=['Provider','PotentialFraud','PotentialFraud_numeric'])
y = provider_aggregated_df['PotentialFraud_numeric']


In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.20,
    stratify=y,
    random_state=42
)

print("Train size:", len(X_train))
print("Test size:", len(X_test))


Train size: 4328
Test size: 1082


In [None]:
# ============================================================
#  CLASS IMBALANCE HANDLING METHODS
#  Includes: Class Weighting, Oversampling (SMOTE),
#            Undersampling, and Cost-Sensitive Learning
# ============================================================

# ---------- Imports ----------
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

# ============================================================
# 1) CLASS WEIGHTING
# ============================================================
# The model automatically increases the weight of the minority class.
# This makes misclassifying fraud more costly than misclassifying non-fraud.

rf_weighted = RandomForestClassifier(
    class_weight='balanced',   # Adjust weights inversely proportional to class frequencies
    n_estimators=300,          # Number of trees
    random_state=42,           # Reproducibility
    n_jobs=-1                  # Use all CPU cores
)

# Fit weighted model
rf_weighted.fit(X_train, y_train)      # Train with balanced class weights


# ============================================================
# 2) OVERSAMPLING USING SMOTE
# ============================================================
# SMOTE generates synthetic minority samples to balance the training data.

sm = SMOTE(random_state=42)            # Initialize SMOTE generator
X_train_sm, y_train_sm = sm.fit_resample(X_train, y_train)   # Apply oversampling

print("SMOTE distribution:\n", y_train_sm.value_counts())    # Check new counts


# ============================================================
# 3) UNDERSAMPLING USING RANDOM UNDER-SAMPLER
# ============================================================
# Randomly remove majority class samples to achieve balance.
# Useful only when dataset is large (risk of losing information).

rus = RandomUnderSampler(random_state=42)    # Initialize undersampler
X_train_us, y_train_us = rus.fit_resample(X_train, y_train)  # Apply undersampling

print("Undersampling distribution:\n", y_train_us.value_counts())  # Check counts


# ============================================================
# 4) COST-SENSITIVE LEARNING
# ============================================================
# Manually define misclassification cost:
# Higher cost for missing fraud ‚Üí model prioritizes catching fraud.

custom_weights = {
    0: 1,   # Weight for NON-FRAUD
    1: 5    # Weight for FRAUD (more important)
}

log_cost_sensitive = LogisticRegression(
    class_weight=custom_weights,      # Apply manual cost structure
    max_iter=500,                     # Ensure convergence
    n_jobs=-1                         # Parallel processing
)

# Fit logistic model with custom costs
log_cost_sensitive.fit(X_train, y_train)      # Train logistic regression


SMOTE distribution:
 PotentialFraud_numeric
0    3923
1    3923
Name: count, dtype: int64
Undersampling distribution:
 PotentialFraud_numeric
0    405
1    405
Name: count, dtype: int64


### üìä Metrics for Imbalanced Data

Accuracy is misleading in imbalanced datasets because a model can predict all cases as "Not Fraud"
and still achieve high accuracy.  
Therefore, we prioritize metrics that correctly evaluate minority-class performance:

---

#### **1) Precision**
Out of all providers predicted as fraud, how many were actually fraud?  
Helps avoid false accusations.

---

#### **2) Recall**
Out of all true fraud providers, how many did the model catch?  
Critical for fraud detection systems.

---

#### **3) F1-Score**
Harmonic mean of Precision and Recall.  
Useful when both false positives and false negatives are important.

---

#### **4) PR-AUC (Precision-Recall Area Under Curve)**
Best metric for heavily imbalanced datasets.  
It measures model performance *only* on fraud detection quality, ignoring the majority class.

---

These metrics give a clearer, fairer evaluation of model performance than overall accuracy and better
reflect the true cost of missing fraudulent providers.


In [None]:
# ============================================================
# LOGISTIC REGRESSION (PRIMARY MODEL WITH SMOTE)
# + FULL EVALUATION METRICS (PRECISION, RECALL, F1, PR-AUC)
# ============================================================

from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    classification_report,
    precision_recall_curve,
    auc
)

# ============================================================
# 1) DEFINE THE MODEL (CREATES log_reg)
# ============================================================

log_reg = Pipeline([
    ('smote', SMOTE(random_state=42)),
    ('model', LogisticRegression(
        class_weight='balanced',
        max_iter=500,
        n_jobs=-1
    ))
])

# ============================================================
# 2) TRAIN THE MODEL
# ============================================================

log_reg.fit(X_train, y_train)

# ============================================================
# 3) PREDICTIONS
# ============================================================

y_pred = log_reg.predict(X_test)
y_proba = log_reg.predict_proba(X_test)[:, 1]

# ============================================================
# 4) PRINT METRICS
# ============================================================

print("=== Precision, Recall & F1-Score ===")
print(classification_report(y_test, y_pred))

precision, recall, thresholds = precision_recall_curve(y_test, y_proba)
pr_auc = auc(recall, precision)

print("PR-AUC:", pr_auc)


=== Precision, Recall & F1-Score ===
              precision    recall  f1-score   support

           0       0.98      0.89      0.94       981
           1       0.45      0.86      0.59       101

    accuracy                           0.89      1082
   macro avg       0.72      0.88      0.76      1082
weighted avg       0.93      0.89      0.90      1082

PR-AUC: 0.7429020451837107


### üéØ Justification of Selected Imbalance Strategy

After testing multiple imbalance-handling techniques (class weighting, oversampling, undersampling,
and cost-sensitive learning), the most effective and reliable approach for this dataset is:

## ‚úÖ **SMOTE Oversampling + Class Weighting**

This hybrid strategy was chosen because it balances the dataset at the training level (via SMOTE) while
also instructing the model to penalize misclassification of fraudulent providers more heavily
(via class weighting). Together, they improve the model‚Äôs ability to detect the minority class without
overfitting or discarding valuable data.

---

## ‚öñÔ∏è **Trade-Offs**

### **1. Performance**
- **Pros:**  
  - Higher Recall ‚Üí catches more fraudulent providers.  
  - More stable Precision-Recall balance compared to using a single technique.  
  - Reduces model bias toward majority (non-fraud) class.
- **Cons:**  
  - SMOTE may slightly increase noise by generating synthetic samples.  
  - Class weighting can make the model more sensitive and increase false positives.

---

### **2. Fairness**
- **Pros:**  
  - Reduces unfair bias toward the majority class.  
  - Ensures fraudulent behavior is not ignored simply because it is rare.
- **Cons:**  
  - Higher sensitivity may incorrectly flag some legitimate providers (false positives),
    which can be unfair if not monitored.

---

### **3. Interpretability**
- **Pros:**  
  - Works well with interpretable models (e.g., Logistic Regression with class weights).  
  - Does not introduce complex transformations that obscure feature meaning.
- **Cons:**  
  - Oversampled synthetic data can make exact explanations slightly less intuitive.  
  - Tree-based models with class weighting can become harder to interpret at deeper depths.

---

## ‚úÖ **Final Reasoning**

This strategy provides the best **practical balance**:

- **High Performance:** Improves fraud detection capability  
- **Reasonable Fairness:** Lowers bias but requires monitoring of false positives  
- **Good Interpretability:** Still compatible with explainable models  

Given the healthcare fraud context‚Äîwhere missing fraud is far more costly than investigating an
extra legitimate provider‚Äîthe selected approach maximizes real-world value while maintaining
transparency and regulatory trust.
