Skip to content

suhitasva/Capstone_Project

Repository files navigation

Healthcare Fraud Detection

Introduction

In recent years the rate at which doctors, and hospitals have conducted fraudulent activities, scams, and schemes have troubled authorities. The Department of Justice (DOJ) recovered over $3 billion from False Claims cases in the 2019 fiscal year, with $2.6 billion coming from healthcare fraud schemes. DOJ also reported that the billions of dollars stemming from healthcare fraud cases involved a wide range of stakeholders, including drug and medical device manufacturers. The stakeholders also included care providers, hospitals, pharmacies, hospice organizations, laboratories, and physicians.

fraudFraud Investigation

Healthcare/Medicare fraud is more prevalent among medical providers and usually results in higher health care costs, insurance premiums, and taxes for the general population. Medical Providers try to maximize reimbursement received from Medicare which they are not entitled to via illegitimate activities such as submitting false claims. This capstone project will focus on fraud committed by doctors and hospitals. Using real-life Medicare claims data, I have attempted to identify key healthcare fraud indicators and fraudulent provider characteristics which could be used in Medicare fraud investigation via supervised machine learning. Machine learning classification algorithms will be used in an attempt to classify providers as fraud or non-fraud.

Healthcare Fraud Overview

fraud

As per the FBI, health care fraud can be committed by medical providers, patients, and others who intentionally deceive the health care system to receive unlawful benefits or payments. Some of the common ways that medical providers deceive patients/insurance providers through claims procedures are listed below:

Fraud
  • Billing for care not rendered.
  • Submitting duplicate claims.
  • Falsifying claim/patient info.
  • Disguising non-covered services as covered services.
  • Using incorrect diagnosis/procedure codes.
  • Stealing a Medicare number or card and using it to submit fraudulent claims.

Let us also look at some of the common terminology associated with healthcare, which includes some of the offences described above. As per a Medicare Advantage article, some of the common ways in which illegitimate Medicare spending may be carried out are as follows:

  • Double Billing:
    • This type of Medicare fraud involves deliberately charging twice for a service or product that was only performed or supplied once.
  • Phantom Billing:
    • This involves billing for a test or procedure or other medical service that was never actually performed. This is one of the most common forms of Medicare fraud
  • Upcoding:
    • Upcoding is altering the codes assigned to specific billable services to reflect a higher-level service than what was actually performed. This type of scam is carried out to receive a fraudulently higher Medicare reimbursement than what is required.
  • Unbundling:
    • This involves taking a comprehensive service and separating it into several specific services in order to bill for each one independently. This leads to a higher reimbursement total.
  • Kickbacks:
    • Kickbacks occur when a provider accepts payment on behalf of a pharmaceutical company or medical device supplier. This is done in exchange of recommending or prescribing patients to use the product.

Medicare Claims Dataset

The Medicare claims data used in this project comes from data uploaded to Kaggle - Healthcare Provider Fraud Detection Analysis by Rohit Anand Gupta. The data is comprised of three sub-datasets; their details are listed below.

In the Beneficiary dataset, we get patient-level information such as their age, race, gender, geographical conditions, chronic conditions, deductible paid, reimbursement received, etc. The Inpatient and Outpatient dataset comprises of claim-level information for those patients. These datasets include information such as associated hospital, associated physicians, claim start/end date, discharge start/end date, and diagnosis/procedure codes associated with the claim.

Another key piece of information that is included in this dataset is the fraud labels. In this data, the fraud labels are placed upon the medical providers/hospitals. The fraud labels indicate if the hospitals are possibly fraud or non-fraud. Based on the initial review, we can see that the labels provided are highly imbalanced; almost all the doctors are labeled as non-fraud. Such kind of imbalance is detrimental while data modeling, especially for classification tasks as we run into the risk of labeling all our providers as non-fraud. The data was balanced using up sampling techniques, which are discussed further in the blog.

Fraud Data Preprocessing

Before getting into some details about the extensive data analysis done on the claims data, I would like to discuss how the data was preprocessed. First off, the missingness in the data was handled. The data included a lot of missing values, such as missing date of death if the patient is alive and missing operating physician if no surgical operation was performed. Missing information was imputed accordingly. Also, for uniform and efficient preprocessing, all categorical data was label encoded.

I also decided to keep outliers in the data, as the outliers could provide key fraud indicator information. These outliers could very well be transactions where actual fraud is being committed. This was also the reason why the data was robust scaled before modeling. Another important preprocessing step done was up sampling the data to reduce the imbalance and fraud label ratio to 1:1. The data was processed via two up sampling techniques; SMOTE (creates data randomly between two data points) and BorderlineSMOTE (creates data along the decision boundary between the two classes) and performances were compared between both.

Next, I also created new features and dropped some redundant ones. New features that give information on whether the patient was deceased or not, duration of the hospital stay/claim, number of associated doctors/claims, number of chronic conditions the patient has, etc. were created. Also, some other features with high null values or ones from which other features were created were dropped. After all of the preprocessing all the three datasets were combined into one to create one training and testing dataset.

Beneficiary Information Analysis

Before we proceed, let us look at who our patients are. If we look at the graphs above, we see that majority of our patients belong to race encoded as 0 and gender encoded as 1. Most of the patients fall between the ages of 68 through 82 years old; however, we have some outliers as well. Almost all our patients are alive. I also studied the top beneficiaries that paid the highest deductible and for whom the highest total reimbursements were received. There are several beneficiaries that are common in both groups as we can see from the two graphs below.

Fraud vs Non-Fraud Providers Study:

To understand what the key fraud provider characteristics are, I extensively studied the inpatient/outpatient data based on the fraud labels provided. I attempted to uncover what sets some of these fraud providers apart from the non-fraud providers. Following, are some of the findings that were uncovered through the study (comparison done between inpatient/outpatient datasets and fraud/non-fraud providers):

Maximum Reimbursement Amounts

These graphs below detail the distribution of the maximum total reimbursement amount received for fraud and non-fraud providers between the inpatient and outpatient claims. There is a difference in the average maximum reimbursement amounts received for both types of providers in the inpatient dataset. A similar difference is not seen between the outpatient fraud and non-fraud providers; however, we can see that fraud providers claimed some of the highest reimbursements.

The bar graphs below show the top providers with high maximum reimbursement amounts (both inpatient and outpatient datasets) and how many of those were fraud vs non-fraud. In the top inpatient providers, all except one of the providers are labeled as fraud. In the top outpatient providers, there is a 50:50 division; however, the highest reimbursements were claimed by fraud providers.

Number of Claims

These graphs below detail the distribution of the total number of claims submitted for fraud and non-fraud providers between the inpatient and outpatient claims. For both inpatient and outpatient datasets, fraud providers had an extremely high number of claims submitted than the non-fraud providers.

The bar graphs below show the top providers with the high total number of claims submitted (both inpatient and outpatient datasets) and how many of those were fraud vs non-fraud. All the top providers for both datasets are labeled as fraud.

Diagnosis Code Counts

These graphs below detail the distribution of the total number diagnosis codes listed on claims for fraud and non-fraud providers between the inpatient and outpatient claims. In Inpatient providers, the average code counts are higher for non-fraud providers than the fraud providers; however, exactly opposite is true for the outpatient providers.

The bar graphs below show the top providers with the high total number of diagnosis codes listed on claims (both inpatient and outpatient datasets) and how many of those were fraud vs non-fraud. For the top inpatient providers, all except one of the providers is labeled as fraud; whereas, in the top outpatient providers few providers have the fraud label associated with them.

Average Patient Age/Chronic Condition Counts

Next, I also looked at average patient age and chronic condition counts between inpatient and outpatient providers for both types of providers. It seems like from the graphs below that for both inpatient and outpatient providers, the range of patient age is narrower for the fraud providers than the non-fraud providers. Likewise, the range of patient chronic condition counts is also narrower for the fraud providers than the non-fraud providers.

Patients per state - Fraud Providers

The last avenue that I explored as a part of this study was, to look at where the majority of the patients are residing for fraud providers in the inpatient and outpatient datasets.

From these graphs, it looks like most of the patients for the fraud providers in both inpatient and outpatient datasets are coming from few common states. States that encoded as 5, 30 and 33 have the highest number of patients who are associated with a fraud labeled medical provider.

Classification Task - Data Modeling

After all the in-depth data analysis, I moved on to the data modeling part using python and machine learning classification algorithms. For modeling, I used the training dataset; did a 70:30 train-test split and evaluated the results for SMOTE and BorderlineSMOTE unsampled data. The model performances were evaluated based on the F1 score, which achieves a harmonious balance between precision and recall.

Based, on these graphs we can see that there is not much difference in performance between data up sampled via two different up sampling techniques. The Linear SVC model performs the worst in this case and completely misclassifies the majority of the providers. However, the XGBoost and the LightGBM models (final features selected via Recursive Feature Selection) perform much better in terms of classifying between both classes. In the next section we will look at performance evaluation of our better performing LightGBM model.

LightGBM Performance Evaluation

One of the best-performing models out of all the model types attempted was the LightGBM model. We can see the reason why it performed well through some of the classification model performance metrics. First off, we will look at the confusion matrix. This confusion matrix shows us correct vs misclassifications as predicted by the model. We can see that this model does a good job of classifying non-frauds as non-frauds and possibly frauds as such. I was also able to achieve better possibly-fraud detections by adjusting the model classification threshold a bit.

Confusion matrix

AUC/ROC and Precision-Recall Curve:

Next, we will look at the ROC and Precision-Recall Curve for the LightGBM classifier. The AUCROC curves allow us to visualize the tradeoff between a model's sensitivity and specificity. Ideally, the true-positive rate should be closer to one and the false positive rate should be closer to zero. Additionally, the higher the area under the curve (computation of the relationship between false positives and true positives) the better the model is. Our LightGBM model seems to do well in this regard; it has a high and steep ROC curve with an AUC score of 0.94.

Next, we will look at the precision-recall curve for this model. Precision-Recall curve shows the tradeoff between result relevancy and completeness. The goal always would be to maximize both precision and recall and have a high area under the curve. Also, having a high average precision is also highly desirable. The LightGBM model achieves both in this case.

Class Prediction Error

Lastly, we will also look at the Class Prediction Error plot for the LightGBM model. This Class Prediction Error plot allows us to see how efficient our classifier is at predicting correct classes. For our model, this plot shows us that our model does a good job predicting the majority of the classes correctly and the misclassification rate is relatively low.

Model Feature Importances

One way we can understand what factors are important while trying to distinguish between fraud and non-fraud providers is to look at model feature importances. This feature allows us to examine which attributes contribute to the model's classifying capability. In this section we will look at feature importances for three models: XGBoost, LightGBM, and Random Forest.

If we look at the results from the XGBoost and the LightGBM model, we can see the same top four features contribute the most in terms of decision-making. These are the attending physician or the primary doctor, the county, the state the patient is from, and other physicians listed on the claim. If we look at top features by weight for the Random Forest model, we again see the attending physician or the primary doctor, county, and the state the patient is from.

SHAP Value Analysis

The last thing that I looked at in terms of feature evaluation was the SHAP values of the important features in the model. SHAP values are calculated based on the game-theory approach where weightage is assigned to different features based on how much they contribute to the model's prediction capability. The first graph which is shown below, tells us about the top features based on the SHAP values and their overall effect on the model. We will look at the results for the XGBoost and the LightGBM model.

Both the models, share the top five features themselves. These features are the patient's birth year, age, state/county they belong to, and their primary physician listed on the submitted claim. The x-axis on the graph below all the SHAP values plotted for the feature and whether they positively or negatively impact the model and the colors (from red to blue) tell us whether the feature value is high or low.

We will also look at top features ordered by total mean SHAP values for the linear models. These linear models tell us a different story. For these models the total claim amount, the insurance claim amount reimbursed, and the deductible paid by the patient were the top influencers in the prediction decision-making.

Conclusions:

When I first started looking at this data, I looked at the beneficiaries/patients first. Some key insights that I gathered through this exploration were:

  • Certain beneficiaries listed below could be actively experiencing fraud or could be more susceptible to fraud.
    • Patients for whom high reimbursements were received.
    • Patients who have paid high deductibles.
    • Some of the aforementioned patients that have high chronic condition counts.

Next, I studied the fraud and the non-fraud providers from the inpatient and the outpatient dataset and came up with following distinguishing characteristics between the two:

One other thing to note is that possibly fraud providers could be more active in certain states and counties. A patient’s age being in a certain range, which state/county they are from, their total claim amount, and who their primary doctor is could in certain cases make them more vulnerable to fraud. These features could also help investigators differentiate between fraud and non-fraud providers.

Future work:

When I started working on this project, I came to realize that I am only just scratching the surface in terms of deciphering the black box of healthcare fraud detection. The possibilities are limitless in the type of work we can do or the areas we can focus on to zero in on fraud providers. Given more time some things that I would love to try are:

  • Duplicate claim investigation.
  • Doctor-Hospital Network Analysis.
  • Studying patterns in beneficiaries.
  • Conducting a market basket analysis.

References: