# Medical claims anomaly detection
* Healthcare fraud and abuse take many forms. Some of the most common types of frauds by providers are:

1. Billing for services that were not provided
2. Duplicate submission of a claim for the same service
3. Misrepresenting the service provided
4. Billing for a covered service when the service actually provided was not covered
5. **Charging for a more complex or expensive service than was actually provided**

### This nb tries to analyze ONLY #5

# Goals

1. EDA
2. Dimensionality reduction
3. Use 3 **unsupervised** anomaly detection algorithms: Elliptic envelope, Isolation forest, Local Outlier Function
4. Create 2 ensembles from the 3 models by fusing labels: ALL and ANY
5. Label the anomalies found by these enesmbles
6. Use a **supervised** model and its feature importance as a way to explain the model decision making process

# Data
**Medicare Claims Synthetic Public Use Files (SynPUFs)**
Medicare Claims Synthetic Public Use Files (SynPUFs) were created to allow interested parties to gain familiarity using Medicare claims data while protecting beneficiary privacy.
The data structure of the Medicare SynPUFs is very similar to the CMS Limited Data Sets, but with a smaller number of variables. They provide data analysts and software developers the opportunity to develop programs and products utilizing the identical formats and variable names as those which appear in the actual CMS data files.

* I've joined the CLAIMS and BENEFICIARIES tables, while reducing the huge original files to a mere 1M rows. 

* Acknowledgements
* https://www.cms.gov/Research-Statistics-Data-and-Systems/Downloadable-Public-Use-Files/SynPUFs
* Dictionary of the columns detailed below https://www.cms.gov/Research-Statistics-Data-and-Systems/Downloadable-Public-Use-Files/SynPUFs/Downloads/SynPUF_DUG.pdf

In [None]:
import numpy as np
from numpy import ma
import pandas as pd
import math
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.colors as colors
import gc
gc.enable()

%matplotlib inline
from matplotlib import ticker, cm
from matplotlib.pyplot import figure
from matplotlib import pyplot
import seaborn as sns
import plotly

from scipy.stats import multivariate_normal
from sklearn.metrics import f1_score, confusion_matrix, classification_report, precision_recall_fscore_support
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import LocalOutlierFactor
from sklearn.ensemble import IsolationForest
from sklearn.covariance import EllipticEnvelope

#import sklearn
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_curve, average_precision_score, auc
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from sklearn.datasets import make_classification

from sklearn import preprocessing
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

from imblearn.over_sampling import SMOTE, ADASYN
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler
from imblearn.over_sampling import SMOTE, ADASYN
from collections import Counter  

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

pd.set_option('max_columns', None)

import os
os.listdir('/kaggle/input')

# Data

In [None]:
%%time

RawClaimBenef = pd.read_csv('/kaggle/input/medicalclaimssynthetic1m/MedicalClaimsSynthetic1M.csv')
RawClaimBenef.drop(['DESYNPUF_ID', 'PPPYMT_IP', 'PRF_PHYSN_NPI_1', 'CLM_ID'], axis=1, inplace=True)
print(RawClaimBenef.shape)
RawClaimBenef.sample(3)

### One Hot Encoding the categoricals

In [None]:
# Categorical data

print(RawClaimBenef.shape)
cols4OHE = ['BENE_RACE_CD', 'SP_STATE_CODE', 'BENE_COUNTY_CD', 'LINE_PRCSG_IND_CD_1', 'HCPCS_CD_1']

RawClaimBenef = pd.get_dummies(RawClaimBenef, columns = cols4OHE)
print(RawClaimBenef.shape)

In [None]:
myCol = 'BENE_DEATH_DT'

print(RawClaimBenef[myCol].dtype)
RawClaimBenef.BENE_DEATH_DT = RawClaimBenef.BENE_DEATH_DT.astype(bool)
print(RawClaimBenef.BENE_DEATH_DT.dtype)
RawClaimBenef.groupby(myCol).size().plot.bar()
plt.show()

In [None]:
myCol = 'BENE_SEX_IDENT_CD'

print(RawClaimBenef[myCol].dtype)
RawClaimBenef[myCol] = RawClaimBenef[myCol].apply(lambda x: x-1)
RawClaimBenef[myCol] = RawClaimBenef[myCol].astype(bool)
print(RawClaimBenef[myCol].dtype)
RawClaimBenef.groupby(myCol).size().plot.bar()
plt.show()

In [None]:
myCol = 'BENE_ESRD_IND'

print(RawClaimBenef[myCol].dtype)
RawClaimBenef[myCol] = RawClaimBenef[myCol].apply(lambda x: 1 if x=='Y'  else 0)
RawClaimBenef[myCol] = RawClaimBenef[myCol].astype(bool)
print(RawClaimBenef[myCol].dtype)
RawClaimBenef.groupby(myCol).size().plot.bar()
plt.show()

In [None]:
myColumns = ['SP_ALZHDMTA',
 'SP_CHF',
 'SP_CHRNKIDN',
 'SP_CNCR',
 'SP_COPD',
 'SP_DEPRESSN',
 'SP_DIABETES',
 'SP_ISCHMCHT',
 'SP_OSTEOPRS',
 'SP_RA_OA',
 'SP_STRKETIA']

for myCol in myColumns:
    print(RawClaimBenef[myCol].dtype)
    RawClaimBenef[myCol] = RawClaimBenef[myCol].apply(lambda x: x-1)
    RawClaimBenef[myCol] = RawClaimBenef[myCol].astype(bool)
    print(RawClaimBenef[myCol].dtype)
    RawClaimBenef.groupby(myCol).size().plot.bar()
    plt.show()

### Tokenize the ICD9 Diagnosis code

In [None]:
texts = RawClaimBenef['ICD9_DGNS_CD_1'].tolist()
print(len(texts))

In [None]:
%%time

# Tokenize from words to integers (sequences) ... removed underscore _ the dot . and minus - from the filters

maxlen = 1 # cut off after this number of words in a text...as all concat into one word maxlen = 1
max_words = 12000 # considers only the top number of words
max_features = max_words

tokenizer = Tokenizer(num_words=max_words, 
                     filters='!"#$%&()*+,/:;<=>?@[\\]^`{|}~\t\n',)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

data = pad_sequences(sequences, maxlen=maxlen, padding='post', truncating='post', value=0.0)
data = pd.DataFrame(data)
data.columns = ['TokenICD9']

print(data.shape)
data.tail()

In [None]:
CleanData = pd.concat([RawClaimBenef,data],axis=1)
CleanData.drop(['ICD9_DGNS_CD_1', 'LINE_ICD9_DGNS_CD_1'], axis=1, inplace=True)
print(CleanData.shape)
CleanData.tail()

In [None]:
# Normalize data

x = CleanData.values  
scaler = preprocessing.StandardScaler()
x_scaled = scaler.fit_transform(x)
CleanData = pd.DataFrame(x_scaled, columns=CleanData.columns)
print(CleanData.shape)
CleanData.tail()

In [None]:
del RawClaimBenef
del x_scaled
del x
del data
del texts
del sequences

gc.collect()


# Dimensionality reduction w PCA

In [None]:
%%time

# PCA to 2 dims

Xpca = PCA(n_components=2).fit_transform(CleanData)
Xpca = pd.DataFrame(Xpca)
Xpca.columns = ['Dim0', 'Dim1']
print(Xpca.shape)
Xpca.head()
Dim0 = np.array(Xpca['Dim0'])
Dim1 = np.array(Xpca['Dim1'])

#Generate and plot the instances on 2D 
plt.figure(figsize=(12,8))
plt.title("Scatter Plot of the 2D post PCA Dataset")
plt.scatter(Dim0, Dim1, cmap='coolwarm', linewidths=1)
plt.legend()
plt.show()

In [None]:
%%time

del Xpca
gc.collect()

# PCA to 3 dims

Xpca = PCA(n_components=3).fit_transform(CleanData)
Xpca = pd.DataFrame(Xpca)
Xpca.columns = ['Dim0', 'Dim1', 'Dim2']
print(Xpca.shape)
Xpca.head()
Dim0 = np.array(Xpca['Dim0'])
Dim1 = np.array(Xpca['Dim1'])
Dim2 = np.array(Xpca['Dim2'])

#Generate and plot the instances on 3D 
fig = plt.figure(figsize=(12,8))
ax = fig.add_subplot(projection='3d')
ax.scatter(Dim0, Dim1, Dim2, marker='o')
plt.show()

In [None]:
CleanData.head(1)

In [None]:
Xpca.head(1)

In [None]:
#CleanData = Xpca.copy()
X = np.array(Xpca)
X.shape

# Unsupervised anomaly detection 

### 0.001 contamination level

* I've set the hyper-parameter of contamination to 0.001 for Elliptic Envelope and Isolation Forest
* For LOF - I've tuned the number of neighbors till I got the same contamination level as the two above

### Elliptic Envelope

In [None]:
%%time
Preds = EllipticEnvelope(random_state=0, contamination=0.001).fit_predict(X)
# contamination=0.1 by default ... hyperparam that needs to be set
Preds.shape

# 1=inlier, -1=outlier
Xpca['EllipticEnvelope'] = Preds
print(Xpca.shape)

Xpca.EllipticEnvelope.value_counts(normalize=True)

In [None]:
%%time

normalDF = Xpca[Xpca.EllipticEnvelope== 1]
normalDF = normalDF[['Dim0','Dim1', 'Dim2']]
Normal = np.array(normalDF)
print(Normal.shape)

anomalDF = Xpca[Xpca.EllipticEnvelope== -1]
anomalDF = anomalDF[['Dim0','Dim1', 'Dim2']]
Anomal = np.array(anomalDF)
print(Anomal.shape)

# Plot 3D with Anomalies
fig = plt.figure(figsize=(16,16))
ax = fig.add_subplot(projection='3d')

xs = Normal[:,0]
ys = Normal[:,1]
zs = Normal[:,2]
ax.scatter(xs, ys, zs, marker='.', c='g')

xsA = Anomal[:,0]
ysA = Anomal[:,1]
zsA = Anomal[:,2]
ax.scatter(xsA, ysA, zsA, marker='o', c='r')
plt.title('Elliptic Envelope contamination = 0.001')
plt.show()

# Isolation Forest

In [None]:
%%time

Preds = IsolationForest(random_state=0, contamination=0.001).fit_predict(X)

# 1=inlier, -1=outlier
Xpca['IsolationForest'] = Preds
print(Xpca.shape)

Xpca.IsolationForest.value_counts(normalize=True)

In [None]:
%%time

normalDF = Xpca[Xpca.IsolationForest== 1]
normalDF = normalDF[['Dim0','Dim1', 'Dim2']]
Normal = np.array(normalDF)
print(Normal.shape)

anomalDF = Xpca[Xpca.IsolationForest== -1]
anomalDF = anomalDF[['Dim0','Dim1', 'Dim2']]
Anomal = np.array(anomalDF)
print(Anomal.shape)

# Plot 3D with Anomalies
fig = plt.figure(figsize=(16,16))
ax = fig.add_subplot(projection='3d')

xs = Normal[:,0]
ys = Normal[:,1]
zs = Normal[:,2]
ax.scatter(xs, ys, zs, marker='.', c='g')

xsA = Anomal[:,0]
ysA = Anomal[:,1]
zsA = Anomal[:,2]
ax.scatter(xsA, ysA, zsA, marker='o', c='r')
plt.title('Isolation Forest contamination = 0.001')
plt.show()

# Local Outlier factor

In [None]:
%%time
Preds = LocalOutlierFactor(n_neighbors= 35).fit_predict(X)
# n_neighbors=20 by default ... hyperparam that needs be set
Preds.shape

# 1=inlier, -1=outlier
Xpca['LocalOutlierFactor'] = Preds
print(Xpca.shape)

Xpca.LocalOutlierFactor.value_counts(normalize=True)

In [None]:
%%time

normalDF = Xpca[Xpca.LocalOutlierFactor== 1]
normalDF = normalDF[['Dim0','Dim1', 'Dim2']]
Normal = np.array(normalDF)
print(Normal.shape)

anomalDF = Xpca[Xpca.LocalOutlierFactor== -1]
anomalDF = anomalDF[['Dim0','Dim1', 'Dim2']]
Anomal = np.array(anomalDF)
print(Anomal.shape)

# Plot 3D with Anomalies

fig = plt.figure(figsize=(16,16))
ax = fig.add_subplot(projection='3d')

xs = Normal[:,0]
ys = Normal[:,1]
zs = Normal[:,2]
ax.scatter(xs, ys, zs, marker='.', c='g')

xsA = Anomal[:,0]
ysA = Anomal[:,1]
zsA = Anomal[:,2]
ax.scatter(xsA, ysA, zsA, marker='o', c='r')
plt.title('Local Outlier Factor n_neighbors = 35')
plt.show()

# Ensemble by fusion of labels
* If ANY of the 3 models think an instance is an outlier than the instance = anomaly (model1 OR model2 OR model3)
* If ALL 3 models think an instance is an outlier than the instance = anomaly (model1 AND model2 AND model3)

In [None]:
my0 = np.zeros(len(Xpca))
Xpca['AnyAnomal'] = my0
Xpca['AllAnomal'] = my0

In [None]:
# Anomaly = if ANY of the 3 models above predicted it as anomaly

Xpca.loc[(Xpca.LocalOutlierFactor==-1) | 
         (Xpca.EllipticEnvelope==-1) |
         (Xpca.IsolationForest==-1), 'AnyAnomal'] = 1

Xpca.AnyAnomal.value_counts(normalize=True)

In [None]:
%%time

normalDF = Xpca[Xpca.AnyAnomal== 0]
normalDF = normalDF[['Dim0','Dim1', 'Dim2']]
Normal = np.array(normalDF)
print(Normal.shape)

anomalDF = Xpca[Xpca.AnyAnomal== 1]
anomalDF = anomalDF[['Dim0','Dim1', 'Dim2']]
Anomal = np.array(anomalDF)
print(Anomal.shape)

# Plot 3D with Anomalies

fig = plt.figure(figsize=(16,16))
ax = fig.add_subplot(projection='3d')

xs = Normal[:,0]
ys = Normal[:,1]
zs = Normal[:,2]
ax.scatter(xs, ys, zs, marker='.', c='g')

xsA = Anomal[:,0]
ysA = Anomal[:,1]
zsA = Anomal[:,2]
ax.scatter(xsA, ysA, zsA, marker='o', c='r')
plt.title('Ensemble of ANY of the 3 models')
plt.show()

In [None]:
# Anomaly = if ALL of the 3 models above predicted it as anomaly

Xpca.loc[(Xpca.LocalOutlierFactor==-1) & 
         (Xpca.EllipticEnvelope==-1) &
         (Xpca.IsolationForest==-1), 'AllAnomal'] = 1

Xpca.AllAnomal.value_counts(normalize=True)

In [None]:
%%time

normalDF = Xpca[Xpca.AllAnomal== 0]
normalDF = normalDF[['Dim0','Dim1', 'Dim2']]
Normal = np.array(normalDF)
print(Normal.shape)

anomalDF = Xpca[Xpca.AllAnomal== 1]
anomalDF = anomalDF[['Dim0','Dim1', 'Dim2']]
Anomal = np.array(anomalDF)
print(Anomal.shape)

# Plot 3D with Anomalies

fig = plt.figure(figsize=(16,16))
ax = fig.add_subplot(projection='3d')

xs = Normal[:,0]
ys = Normal[:,1]
zs = Normal[:,2]
ax.scatter(xs, ys, zs, marker='.', c='g')

xsA = Anomal[:,0]
ysA = Anomal[:,1]
zsA = Anomal[:,2]
ax.scatter(xsA, ysA, zsA, marker='o', c='r')
plt.title('Ensemble of ALL of the 3 models')
plt.show()

In [None]:
%%time

# PCA Variance loss

pcaVar = PCA(n_components=0.9).fit_transform(CleanData)
pcaVar.shape

# Downsample 1M to 250k

In [None]:
nrows = 250000

SmallCleanData = Xpca[['AnyAnomal','AllAnomal']][:nrows]
print(SmallCleanData.shape)


In [None]:
del Xpca
del CleanData
del pcaVar
gc.collect()

In [None]:
RawClaimBenef = pd.read_csv('/kaggle/input/claims-data-prep-2/ClaimsCleanNorm.csv', nrows=nrows)
RawClaimBenef.drop(['Unnamed: 0'], axis=1, inplace=True)
print(RawClaimBenef.shape)
RawClaimBenef.sample()

In [None]:
SmallCleanData.AnyAnomal.value_counts()

# Feature Importance of a Supervised model

In [None]:
# Split into X and y 

X = RawClaimBenef.copy()
#X = np.array(X)
print(X.shape)

y = SmallCleanData.AnyAnomal
y.shape

In [None]:
# SMOTE ONLY on train ... otherwise data leak ...

# Split the data into training and testing set

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, shuffle=True, random_state=101)

print('Original dataset shape %s' % Counter(y_train))
random_state = 42

smote = SMOTE(random_state=42)
X_res, y_res = smote.fit_resample(X_train, y_train)
print('Resampled dataset shape %s' % Counter(y_res))

X_train = X_res
y_train = y_res

print("X_train - ",X_train.shape)
print("y_train - ",y_train.shape)
print("X_test - ",X_test.shape)
print("y_test - ",y_test.shape)

In [None]:
Xcols = X.columns
del RawClaimBenef
del X
del y
gc.collect()

## Logistic Regression

%%time
logreg = LogisticRegression(solver='liblinear')
logreg.fit(X_train, y_train) 
y_pred = logreg.predict(X_test)

print('Accuracy :{0:0.5f}'.format(metrics.accuracy_score(y_pred , y_test))) 
print('AUC : {0:0.5f}'.format(metrics.roc_auc_score(y_test , y_pred)))
print('Precision : {0:0.5f}'.format(metrics.precision_score(y_test , y_pred)))
print('Recall : {0:0.5f}'.format(metrics.recall_score(y_test , y_pred)))
print('F1 : {0:0.5f}'.format(metrics.f1_score(y_test , y_pred)))

#Feature importance by LogReg

importance = logreg.coef_[0]
FIdf = pd.DataFrame()
FIdf['Feature'] = X.columns
FIdf['Importance'] = importance
FIdf.sort_values(by=['Importance'], ascending=True, inplace=True)
FIdf.head(30)

figure(figsize=(12, 100), dpi=80)

pyplot.barh(FIdf['Feature'], FIdf['Importance'])
pyplot.show()

## Random Forest Classifier

In [None]:
%%time
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train) 
y_pred = rfc.predict(X_test)

print('Accuracy :{0:0.5f}'.format(metrics.accuracy_score(y_pred , y_test))) 
print('AUC : {0:0.5f}'.format(metrics.roc_auc_score(y_test , y_pred)))
print('Precision : {0:0.5f}'.format(metrics.precision_score(y_test , y_pred)))
print('Recall : {0:0.5f}'.format(metrics.recall_score(y_test , y_pred)))
print('F1 : {0:0.5f}'.format(metrics.f1_score(y_test , y_pred)))

In [None]:
# Feature importance

importance = rfc.feature_importances_
FIdf = pd.DataFrame()
FIdf['Feature'] = Xcols
FIdf['Importance'] = importance

FIdf.sort_values(by=['Importance'], ascending=True, inplace=True)
FIdf.head(30)

In [None]:
figure(figsize=(12, 100), dpi=80)

pyplot.barh(FIdf['Feature'], FIdf['Importance'])
pyplot.show()

### Future improvements - Add SHAP (SHapley Additive exPlanations) 

"...There is a big difference between both importance measures: **Permutation feature importance is based on the decrease in model performance**.

**SHAP is based on magnitude of feature attributions**..."

* Ref https://christophm.github.io/interpretable-ml-book/shap.html#shap-feature-importance