# **Enhancing Healthcare Insurance Fraud Detection with Machine Learning & AI**
#### By: Soham Kaje

## **Introduction**


The **Healthcare Insurance Fraud Detection** project explores the critical issue of fraudulent behavior within the U.S. healthcare system, particularly among Medicare providers. Fraudulent claims not only burden insurance companies with excessive costs but also contribute to rising premiums and reduced access to affordable care for patients. Despite existing regulatory efforts, many fraudulent activities remain undetected due to the complexity and scale of healthcare data.

This project aims to enhance fraud detection mechanisms by leveraging the power of machine learning and artificial intelligence. By analyzing patterns across inpatient, outpatient, and beneficiary data, we seek to identify behavioral indicators and anomalies that suggest potential fraud. Through this data-driven approach, the project intends to develop a predictive model capable of flagging suspicious providers more accurately and efficiently.

Ultimately, our goal is to support insurers, regulators, and healthcare administrators in proactively identifying fraud, thereby reducing costs, improving compliance, and safeguarding the integrity of the healthcare system.

### **Hypotheses**

- Providers with higher average reimbursement amounts and more frequent high-cost procedures are more likely to be flagged as fraudulent. This hypothesis explores the assumption that fraud often involves inflating claim costs through unnecessary or misrepresented procedures.
- Patient demographics and chronic condition patterns are associated with the likelihood of provider fraud. For example, providers who disproportionately treat older patients with multiple chronic conditions may exploit ambiguities in diagnosis coding, increasing fraud risk.
- Providers with a higher number of claims per unique beneficiary are more likely to engage in fraudulent behavior. This aims to test if over-utilization, such as excessive testing or repeated procedures, is a common fraud indicator.

By testing these hypotheses, this project aims to uncover systemic patterns that distinguish legitimate care from manipulative billing practices. These insights can refine fraud detection models, reduce false positives, and enhance the accuracy of insurance audits.

----

## **Data Sources**

The **Healthcare Insurance Fraud Detection** project leverages four interrelated datasets from a Medicare claims case study hosted on Kaggle. Together, they offer a comprehensive view of medical billing patterns, patient demographics, and provider behaviors essential for uncovering fraudulent activity.

### **1. Provider Labels Dataset**
This dataset serves as the **ground truth** for supervised machine learning.
- **Key Columns**:  
  - `Provider`: Unique provider identifier.  
  - `PotentialFraud`: Binary label indicating if the provider is suspected of fraud (`Yes`/`No`).

This dataset is used to train and evaluate predictive models.

### **2. Beneficiary Data**
This file includes individual-level **KYC (Know Your Customer)** details for Medicare recipients.
- **Demographics**: Gender, race, state, and county.  
- **Health Coverage**: Part A/B coverage months, renal disease indicator.  
- **Chronic Conditions**: Flags for conditions such as diabetes, stroke, heart failure, and cancer.  
- **Financial Data**: Annual inpatient and outpatient reimbursement/deductible amounts.

This dataset provides critical context for understanding the types of patients served by providers.

### **3. Inpatient Claims Data**
Captures claims related to **hospital admissions**.
- **Key Variables**:  
  - Claim dates (start, end, admission, discharge)  
  - Procedure and diagnosis codes (ICD)  
  - Reimbursement and deductible amounts  
  - Physician identifiers (attending, operating, other)

This data helps uncover billing behaviors for more intensive care scenarios.

### **4. Outpatient Claims Data**
Details claims for **non-admitted** patients, such as checkups, minor treatments, and tests.
- **Similar Columns to Inpatient Data**, but generally for lower-cost procedures.
- Includes diagnosis and procedure codes, reimbursement, deductible amounts, and claim dates.

Together, the inpatient and outpatient datasets are used to calculate provider behavior metrics such as average claim cost, procedure frequency, and patient load.

These datasets collectively form the backbone of our fraud detection framework. By integrating and analyzing these sources, the project aims to identify patterns indicative of fraud and build predictive models to flag high-risk providers with greater accuracy.

---

## **Data Cleaning Process**

The datasets used in this project were preprocessed and saved in the following folder:

- **Cleaned Training Data**: Located in the `Dataset_Cleaned` folder.

### **Key Cleaning Steps**

#### **1. Claims Data**
- Source: `Train-1542865627584.csv`
- Selected only the `Provider` and `PotentialFraud` columns to serve as the **target labels** for supervised machine learning.
- Removed any unnecessary identifiers and metadata.

#### **2. Beneficiary Data**
- Source: `Train_Beneficiarydata-1542865627584.csv`
- Retained key demographic features (e.g., `Gender`, `Race`, `RenalDiseaseIndicator`) and insurance coverage duration.
- Included binary flags for multiple chronic conditions to help capture patient health patterns.
- Excluded columns like date of birth and death for simplicity and privacy.
- Cleaned version saved as `Cleaned_Train_Beneficiary.csv`.

#### **3. Inpatient Claims Data**
- Source: `Train_Inpatientdata-1542865627584.csv`
- Kept relevant columns such as `ClaimID`, `Provider`, `ClaimStartDt`, `ClaimEndDt`, `InscClaimAmtReimbursed`, and `DeductibleAmtPaid`.
- Dropped procedural and diagnostic codes to streamline initial model development.
- Cleaned version saved as `Cleaned_Train_Inpatient.csv`.

#### **4. Outpatient Claims Data**
- Originally split across three files:  
  - `Train_Outpatientdata_Part1.csv`  
  - `Train_Outpatientdata_Part2.csv`  
  - `Train_Outpatientdata_Part3.csv`  
  located in the `Dataset` folder.
- The files were **merged** into a single DataFrame, and only essential billing and identification fields were retained.
- Cleaned version saved as `Cleaned_Train_Outpatient.csv`.

### **General Cleaning Actions**
- All cleaned datasets were saved in the `Dataset_Cleaned` directory for consistent access.
- Irrelevant columns were dropped to reduce noise in model training.
- Column names were left in their original format for compatibility with existing notebooks and scripts.
- The data cleaning scripts are included in the project repository to ensure transparency and reproducibility.

These steps prepared the data for machine learning by focusing on key indicators of fraud while simplifying the feature space for better interpretability and performance.

---

## **Exploratory Data Analysis (EDA)**

To understand provider behavior, claim patterns, and patient health profiles, we performed exploratory data analysis (EDA) on the four cleaned datasets. These visualizations helped identify important trends and outliers that guided our feature engineering and model design.

### **1. Fraud Label Distribution (Claims Data)**  
![Fraud Label Distribution](EDA_Images/fraud_label_distribution.png)

**Description**:  
This bar chart shows the distribution of providers flagged as potentially fraudulent. As expected, the data is imbalanced, with the majority of providers labeled as non-fraudulent. This class imbalance is a crucial consideration for model selection and evaluation.

### **2. Chronic Conditions Among Patients (Beneficiary Data)**  
![Chronic Conditions Distribution](EDA_Images/chronic_conditions_distribution.png)

**Description**:  
The bar chart summarizes the prevalence of various chronic conditions across Medicare beneficiaries. Common issues include ischemic heart disease, diabetes, and heart failure. These health conditions may correlate with the types and frequency of claims submitted by providers.

### **3. Inpatient Reimbursement Distribution**  
![Inpatient Reimbursement Histogram](EDA_Images/inpatient_reimbursement_distribution.png)

**Description**:  
This histogram displays the distribution of reimbursement amounts for inpatient claims. While the majority of claims fall within a moderate range, a noticeable number of high-reimbursement outliers could indicate potential overbilling, an important signal in fraud detection.

### **4. Outpatient Claim Volume per Provider**  
![Outpatient Claims per Provider](EDA_Images/outpatient_claims_per_provider.png)

**Description**:  
This visualization highlights the top 20 providers by outpatient claim volume. A small number of providers are responsible for a disproportionately high number of claims, which may warrant further investigation. High-volume billing patterns can be red flags for fraudulent behavior.

These EDA findings provided a strong foundation for selecting features that capture anomalies, trends, and behavior patterns in healthcare claims. All visualizations are reproducible using the code in `EDA.ipynb` and were generated from the cleaned files located in the `Dataset_Cleaned` folder.

---

## **Visualizations**

To support our hypotheses and guide feature selection, we created eight visualizations — two from each dataset — that explore fraud-related patterns in healthcare claims. These visualizations examine trends in claim volume, patient health characteristics, reimbursement behavior, and potential fraud signals.

All images are saved in the `Visualization_Images` folder and are generated using `.ipynb` notebooks stored in the `Visualizations` folder.

### **1. Claims Data Visualizations**

#### **Claim Volume Distribution by Fraud Status**
This boxplot (log-scaled) compares simulated claim volumes between providers labeled as fraudulent and non-fraudulent. Fraudulent providers show a broader spread, suggesting overbilling behavior may be common in that group.

![Claim Volume vs Fraud](Visualization_Images/claim_volume_vs_fraud_status.png)

#### **Fraud Count by Provider Claim Volume Group**
Providers were grouped into low, medium, and high claim volume categories. The medium and high-volume groups contain disproportionately more fraud cases, supporting the hypothesis that high-volume billing behavior correlates with fraud risk.

![Fraud by Volume Group](Visualization_Images/fraud_by_claim_volume_group.png)

### **2. Beneficiary Data Visualizations**

#### **Gender Distribution by Fraud Label**
This bar chart breaks down patient gender served by fraudulent vs. non-fraudulent providers. While both genders are relatively balanced, fraudulent providers see slightly more female patients — potentially indicating gender-based utilization trends worth further study.

![Gender vs Fraud](Visualization_Images/gender_vs_fraud.png)

#### **Chronic Condition Correlation Heatmap**
This heatmap visualizes the co-occurrence of chronic conditions across beneficiaries. Notably, conditions like diabetes, heart failure, and ischemic heart disease often appear together, which may suggest billing patterns tied to comorbidity clusters.

![Chronic Conditions Correlation](Visualization_Images/chronic_conditions_correlation.png)

### **3. Inpatient Data Visualizations**

#### **Length of Stay vs. Reimbursement (Scatter Plot)**
This scatter plot shows that longer hospital stays generally result in higher reimbursements — but fraudulent providers often exhibit higher reimbursements even for shorter stays, suggesting upcoding or inflated charges.

![Length of Stay vs Reimbursement](Visualization_Images/length_of_stay_vs_reimbursement.png)

#### **Top 10 Providers by Inpatient Reimbursement**
This bar chart identifies providers with the highest total inpatient reimbursements. Several providers show outlier behavior, and cross-referencing with fraud labels can help highlight high-risk billing entities.

![Top Inpatient Providers](Visualization_Images/top_inpatient_providers.png)

### **4. Outpatient Data Visualizations**

#### **Monthly Outpatient Claim Volume**
This plot shows monthly claim patterns across providers. Slight end-of-year spikes are visible, which may point to seasonal or quota-driven overutilization — a known red flag for potential fraud.

![Monthly Outpatient Claims](Visualization_Images/monthly_outpatient_claims.png)

#### **Average vs. Median Reimbursement by Fraud Status**
This grouped bar chart shows that both the average and median outpatient reimbursement amounts are higher for fraudulent providers. This supports the hypothesis that higher per-claim costs are associated with fraud.

![Reimbursement per Claim by Fraud](Visualization_Images/reimbursement_per_claim_by_fraud.png)


---

## Machine Learning Analysis

To uncover patterns that indicate potential healthcare fraud, we developed supervised machine learning models using claims, inpatient, outpatient, and beneficiary data. Below is a summary of our modeling approach, results, and key findings.

### 1. Model Objective

#### Hypothesis  
Certain provider behaviors — such as high claim volume, inflated reimbursements, or patient health patterns — are predictive of fraudulent activity.

#### Methodology  
- **Input Features**:  
  - Claims volume, average reimbursement, deductible amounts  
  - Inpatient length of stay, comorbidities (chronic condition ratios)  
  - Gender ratio, renal disease rate, and other provider-level aggregates  
- **Target Variable**:  
  - `PotentialFraud` (binary: **Yes** or **No**)  
- **Models Used**:  
  - **Random Forest Classifier**  
  - **Logistic Regression**  
  - **XGBoost Classifier**  
- **Data Preprocessing**:  
  - Merged and aggregated patient-claim data into provider-level features  
  - Handled missing values using median imputation  
  - Applied class balancing techniques (e.g., `class_weight='balanced'` and `scale_pos_weight`)

### 2. Model Performance

#### Summary of Results  

| Model               | ROC AUC | Precision (Fraud) | Recall (Fraud) | F1 Score (Fraud) |
|--------------------|---------|--------------------|-----------------|------------------|
| **Random Forest**   | 0.73    | **0.74** ✅              | 0.49            | 0.59             |
| **Logistic Reg.**   | **0.88** ✅ | 0.41               | **0.89** ✅      | 0.56             |
| **XGBoost**         | 0.79    | 0.63           | 0.62            | **0.63** ✅       |

#### Interpretation  
- **Random Forest** was highly precise, but missed many fraudulent providers (low recall).  
- **Logistic Regression** identified almost all fraud cases but produced many false positives.  
- **XGBoost** delivered the most **balanced results**, making it a strong candidate for production deployment.

### 3. Visual Model Comparison

We created a side-by-side bar chart comparing model metrics to visualize their trade-offs across ROC AUC, precision, recall, and F1 score.

![Model Comparison](ML_Images/model_comparison_plot.png)

### 4. Key Findings

#### Insights from Model Behavior  
- **High Reimbursement + Chronic Comorbidities** → Strong fraud indicators  
- **Volume-Heavy Providers** showed elevated fraud risk  
- **Balanced classifiers (e.g., XGBoost)** outperformed more rigid ones on minority class detection (fraud = minority)

#### Practical Implications  
- Machine learning models can effectively support claim audits by **flagging high-risk providers** based on their billing and patient profile data.  
- Fine-tuning thresholds or deploying ensembles could further boost real-world accuracy while maintaining fairness.

This analysis demonstrates how structured data and predictive modeling can significantly enhance healthcare fraud detection. By combining claims analytics with machine learning, we can reduce investigation costs and improve the integrity of healthcare systems.

---

## **Explainable AI with SHAP**

In this project, we utilized **SHAP (SHapley Additive exPlanations)**, a state-of-the-art method in **Explainable AI (XAI)**, to understand how our machine learning models make predictions. SHAP offers both **global** and **local** interpretability, ensuring that our models are not just “black-box” classifiers, but transparent systems that can be trusted and audited.

### **Why SHAP?**
SHAP helps us understand the influence of each feature on the prediction for both individual predictions (local) and the entire model (global). This is crucial for ensuring that the fraud detection system is explainable, especially in sensitive fields like healthcare. The **Beeswarm Plot** and **Waterfall Plot** are two key visualizations that allow us to see these impacts clearly.

### **SHAP Visualizations**:
1. **Beeswarm Plot (Global Feature Importance)**  
   The **beeswarm plot** shows the impact of each feature on the model's predictions for all test data. Features such as **Feature 3** and **Feature 0** emerged as the most influential, indicating their importance in predicting fraud cases.

   ![SHAP Beeswarm Plot](ML_Images/shap_beeswarm_plot.png)

2. **Waterfall Plot (Local Explanation for a Single Prediction)**  
   The **waterfall plot** explains why the model flagged a particular provider as fraudulent. For this provider, **Feature 3** significantly contributed to increasing the fraud score, while **Feature 19** had a lower influence, reducing the overall score.

   ![SHAP Waterfall Plot](ML_Images/shap_waterfall_plot.png)

### **Key Stats from the SHAP Plots**:

- **Global Insights**:  
  From the **Beeswarm Plot**, we observed that:
  - **Feature 3** has a strong positive impact on predicting fraud, meaning providers with higher values for **Feature 3** are more likely to be flagged as fraudulent.
  - **Feature 19** was consistently associated with a negative impact, suggesting it reduces the likelihood of fraud being detected.

- **Local Explanation**:  
  The **Waterfall Plot** for a specific provider revealed:
  - **Feature 3** pushed the fraud score higher by +2.89, while **Feature 19** reduced it by -0.64.
  - The final model output (`f(x) = -0.729`) indicates the provider's prediction for fraud (with negative values corresponding to non-fraudulent).

### **Next Steps**:
By using SHAP, we not only gained **model transparency** but also improved the **trustworthiness** of our fraud detection system. This enables auditors and healthcare providers to confidently understand and challenge model decisions.

---

## Conclusion

The **Healthcare Insurance Fraud Detection** project provides a comprehensive analysis of patterns in healthcare claims data to predict and understand fraudulent behavior. By leveraging multiple datasets (claims, inpatient, outpatient, and beneficiary details) and applying a combination of exploratory data analysis, machine learning, and explainable AI (SHAP), we derived several important insights:

### 1. **Fraud Risk Factors Identified**
   - **High reimbursement amounts**, **length of stay**, and the **presence of chronic conditions** were strong indicators of fraudulent behavior, highlighting key features that contribute to fraud risk.
   - Providers with **higher claim volumes** were more likely to exhibit fraudulent behavior, supporting the hypothesis that higher billing rates correlate with an increased risk of fraud.

### 2. **Machine Learning Model Performance**
   - **XGBoost** emerged as the best-performing model, providing a good balance between **precision** and **recall** for fraud detection. It outperformed **Random Forest** and **Logistic Regression** in terms of detecting fraudulent cases while maintaining a high degree of accuracy.
   - **Logistic Regression** performed well in identifying fraudulent cases but struggled with precision, showing that **class imbalance** was an issue that required fine-tuning of thresholds and weighting.

### 3. **Explainability with SHAP**
   - We implemented **SHAP (SHapley Additive exPlanations)** to make the machine learning models more **transparent** and **trustworthy**. SHAP visualizations helped us understand which features were most influential in predicting fraud, thus providing insights into model decisions.
   - Both **global** and **local** SHAP plots showed that features like **reimbursement amounts** and **chronic conditions** were pivotal in classifying a provider as fraudulent or non-fraudulent.

### 4. **Fraud Risk Scoring**
   - By using the trained models to calculate a **fraud risk score** for each provider, we can prioritize investigations and direct resources where they are most needed. This risk score can help insurance companies and auditors focus on high-risk providers, potentially saving time and reducing costs.

### **Broader Implications**

The insights gained from this analysis have profound implications for improving healthcare fraud detection:
- **Proactive Fraud Detection**: The models can be deployed in real-time to identify high-risk providers and claims, enabling early intervention.
- **Cost Reduction**: By using AI to flag fraudulent claims, insurance companies can reduce the financial burden of fraud and improve their operational efficiency.
- **Trust and Transparency**: The use of **explainable AI** helps foster trust in the system, as stakeholders can clearly understand how fraud predictions are made.

### **Impact on the Community**

The impact of this project extends beyond just improving fraud detection within the insurance industry. By reducing healthcare fraud, we directly contribute to:
- **Lower healthcare costs**: Reducing fraud prevents unnecessary charges, leading to more affordable healthcare for individuals and communities.
- **Improved resource allocation**: By flagging fraudulent providers, resources can be better allocated to those who need them, ensuring that genuine cases receive timely attention.
- **Increased trust in the healthcare system**: The transparency and explainability of AI models help build trust among both providers and patients, fostering a more cooperative healthcare environment.
- **Support for regulatory bodies**: This system can also support healthcare regulators in enforcing policies more effectively, ensuring that fraud is detected before it becomes widespread.

This project demonstrates how AI and machine learning can not only streamline fraud detection but also play a critical role in creating a more equitable and efficient healthcare system. The ability to detect and mitigate fraud contributes to long-term sustainability, ensuring that resources are allocated fairly and transparently.

### **Limitations and Future Work**

While the project has provided valuable insights, it has certain limitations:
- **Class Imbalance**: Despite adjustments, fraud remains a minority class, which poses challenges in achieving a perfectly balanced prediction.
- **Data Gaps**: We were unable to incorporate external data (e.g., audit findings or external fraud reports) that could provide a more comprehensive view of fraud detection.
- **Model Generalization**: The model may need adjustments when deployed in a real-world setting with constantly changing fraud strategies.

Future research could:
- Incorporate **additional datasets**, such as audit data or claims review outcomes.
- Experiment with more advanced **ensemble models** or **deep learning techniques** to improve model accuracy.
- **Expand the scope** by incorporating natural language processing (NLP) to process unstructured data from provider notes or appeals.

By combining **AI-driven fraud detection** with **explainable AI**, this project lays the foundation for more effective, transparent, and scalable healthcare fraud prevention systems, with a direct positive impact on both the healthcare industry and the communities it serves.