## Detect Medicare Fraud in the US

Medical ID Fraud is a prevalent but not often talked about issue in the US. Forbes estimates that fraudulent medical accounts make up 3-10% of the entire multi-billion US healthcare system every year.

- Combining all years for CMS (2012-2017) and LEIE (01/2018-12/2019).

- Follow directions from Johnson & Khoshgoftaar (2019) to clean both sets.

- Classify each row as fraudulent (1) or not (0) based on exclusion dates.

(COS API did not work great, so all files were written in from local then uploaded to COS.)

<a id='menu'></a>
### Menu

- <a href='#cms'>1. Pull CMS Data</a>

- <a href='#leie'>2. Pull LEIE Data</a>

- <a href='#combined'>3. Combine and Label Fraud</a>

- <a href='#visualization'>4. Build Visualizations</a>
    - <a href='#usmap'>US Medicare Fraud Map</a>
    - <a href='#histogram1'>Histograms of Submitted Charge/Mediacre Reimbursements Comparing Fraud and Non-Fraud</a>
    - <a href='#correlation1'>Correlation Heatmap for Continuous Variables </a>
    - <a href='#correlation2'>Table of High Correlations</a>
    - <a href='#barchart1'>Top 15 Fraudulent Medicare Category by Service Count</a>
    - <a href='#barchart2'>Top 15 Fraudulent Medicare Category by Service Count and Gender</a>

- <a href='#train'>5. Prep Data for Training</a>
    - <a href='#norm'>One-hot Encoding and Normalization</a>
    - <a href='#ros'>Random Undersampling</a>

In [2]:
import os
import pandas as pd

In [8]:
# Set up data directory
CWD = os.getcwd()
cms_data_dir = os.path.join(CWD, 'CMSData')

In [9]:
# Some years columns are capitalized and other years the columns are lowercase:
capitalization_dict = {
    '2012': str.upper,
    '2013': str.upper,
    '2014': str.lower,
    '2015': str.lower,
    '2016': str.upper,
    '2017': str.lower,
}

<a href='#menu'>[Menu]</a>
<a id='cms'></a>

### 1. CMS Part B dataset

In [10]:
# Set dtypes based on https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/...
#Medicare-Provider-Charge-Data/Physician-and-Other-Supplier2017
partB_dtypes = {
    'npi': 'str',
    'nppes_provider_last_org_name': 'str',
    'nppes_provider_first_name': 'str',
    'nppes_provider_mi': 'str',
    'nppes_credentials': 'str',
    'nppes_provider_gender': 'str',
    'nppes_entity_code': 'str',
    'nppes_provider_street1': 'str',
    'nppes_provider_street2': 'str',
    'nppes_provider_city': 'str',
    'nppes_provider_zip': 'str',
    'nppes_provider_state': 'str',
    'nppes_provider_country': 'str',
    'provider_type': 'str',
    'medicare_participation_indicator': 'str',
    'place_of_service': 'str',
    'hcpcs_code': 'str',
    'hcpcs_description': 'str',
    'hcpcs_drug_indicator': 'str',
    'line_srvc_cnt': 'float64',
    'bene_unique_cnt': 'float64',    
    'bene_day_srvc_cnt': 'float64',
    'average_medicare_allowed_amt': 'float64',
    'average_submitted_chrg_amt': 'float64',
    'average_medicare_payment_amt': 'float64',
    'average_medicare_standard_amt': 'float64',
}

In [11]:
# Get dfs for all years - TAKE A FEW MINUTES
years = ['2012','2013','2015','2016']
dfs   = []

for year in years:
    file = os.path.join(cms_data_dir, f'cms{year}.txt')
    dtypes = dict(zip(list(map(capitalization_dict[year], partB_dtypes.keys())), list(partB_dtypes.values()))) #get correct column capitalization and dtype
    df = pd.read_csv(file, delimiter='\t', dtype=dtypes)
    df.columns = map(str.lower, df.columns)  # make all variable names lowercase
    df['year'] = year #add Year column 
    dfs.append(df)

In [13]:
# Concatenate
partB_df = pd.concat(dfs, axis=0, ignore_index=True, sort=False)
partB_df.shape

(37653939, 30)

In [15]:
# Remove rows corresponding to drugs because LINE_SRVC_CNT for them is not a desirable count
partB_df = partB_df[(partB_df['hcpcs_drug_indicator'] == 'N')]
partB_df.shape

(35368294, 30)

In [16]:
# Drop missing NPI and HCPCS - "Medicare fraud detection using neural networks" (Johnson, Khoshgoftaar 2019)
# This means dropping 2014 and 2016 - both did not have HCPCS Code
partB_df = partB_df.dropna(subset = ['npi','hcpcs_code'])
partB_df.shape

(35368294, 30)

In [17]:
# Keep variables based on "Medicare fraud detection using neural networks" (Johnson, Khoshgoftaar 2019)
partB_variables_to_keep = [
    'npi',
    'provider_type',
    'nppes_provider_city', # keep
    'nppes_provider_zip', # keep
    'nppes_provider_state', # keep
    'nppes_provider_country', # keep
    'hcpcs_code',  # not in paper but kept
    'hcpcs_description',  # not in paper but kept
    'hcpcs_drug_indicator',  # not in paper but kept
    'place_of_service',  # not in paper but kept
    'nppes_provider_gender',
    'line_srvc_cnt',
    'bene_unique_cnt',
    'bene_day_srvc_cnt',
    'average_submitted_chrg_amt',
    'average_medicare_payment_amt',
    'year' # need Year for labeling
]
partB_df = partB_df[partB_variables_to_keep]

In [18]:
partB_df.head()

Unnamed: 0,npi,provider_type,nppes_provider_city,nppes_provider_zip,nppes_provider_state,nppes_provider_country,hcpcs_code,hcpcs_description,hcpcs_drug_indicator,place_of_service,nppes_provider_gender,line_srvc_cnt,bene_unique_cnt,bene_day_srvc_cnt,average_submitted_chrg_amt,average_medicare_payment_amt,year
1,1003000126,Internal Medicine,CUMBERLAND,215021854,MD,US,99222,"Initial hospital inpatient care, typically 50 ...",N,F,M,115.0,112.0,115.0,199.0,108.115652,2012
2,1003000126,Internal Medicine,CUMBERLAND,215021854,MD,US,99223,"Initial hospital inpatient care, typically 70 ...",N,F,M,93.0,88.0,93.0,291.0,158.87,2012
3,1003000126,Internal Medicine,CUMBERLAND,215021854,MD,US,99231,"Subsequent hospital inpatient care, typically ...",N,F,M,111.0,83.0,111.0,58.0,30.720721,2012
4,1003000126,Internal Medicine,CUMBERLAND,215021854,MD,US,99232,"Subsequent hospital inpatient care, typically ...",N,F,M,544.0,295.0,544.0,105.0,56.655662,2012
5,1003000126,Internal Medicine,CUMBERLAND,215021854,MD,US,99233,"Subsequent hospital inpatient care, typically ...",N,F,M,75.0,55.0,75.0,150.0,81.39,2012


In [21]:
partB_df.loc[partB_df['npi'] == '1003000142'][['npi',
                                             'provider_type',
                                             'place_of_service',
                                             'line_srvc_cnt',
                                             'average_submitted_chrg_amt',
                                             'year']][:5]

Unnamed: 0,npi,provider_type,place_of_service,line_srvc_cnt,average_submitted_chrg_amt,year
16,1003000142,Anesthesiology,O,28.0,216.571429,2012
17,1003000142,Anesthesiology,O,24.0,111.0,2012
9153288,1003000142,Anesthesiology,F,56.0,483.0,2013
9153289,1003000142,Anesthesiology,F,16.0,1105.0,2013
9153290,1003000142,Anesthesiology,F,22.0,709.136364,2013


In [20]:
partB_df['year'].value_counts()

2016    9104337
2015    8904316
2013    8732934
2012    8626707
Name: year, dtype: int64

In [22]:
# Write all combined CMS to csv
#partB_df.to_csv('combined-partB-data-v2')

<a href='#menu'>[Menu]</a>
<a id='leie'></a>

### 2. LEIE Dataset

In [23]:
leie_data_dir = os.path.join(CWD, 'LEIEData')

In [24]:
leie_dtypes = {
    'LASTNAME': 'str',
    'FIRSTNAME': 'str',
    'MIDNAME': 'str',
    'BUSNAME' : 'str',
    'GENERAL': 'str',
    'SPECIALTY': 'str',
    'UPIN': 'str',
    'NPI': 'int64',
    'DOB': 'str',
    'ADDRESS': 'str',
    'CITY': 'str',
    'STATE': 'str',
    'ZIP': 'str',
    'EXCLTYPE': 'str',
    'EXCLDATE': 'int64',
    'REINDATE': 'int64',
    'WAIVERDATE': 'int64',
    'WVRSTATE': 'str',
}

In [25]:
#LEIE data is monthly between 01/2018 (1801) - 12/2019 (1912)
year_months = ['1801','1802','1803','1804','1805','1806','1807','1808','1809','1810','1811','1812',
            '1901','1902','1903','1904','1905','1906','1907','1908','1909','1910','1911','1912']
dfs = []

for year_month in year_months:
    file = os.path.join(leie_data_dir, f'leie{year_month}-excl.csv')
    df   = pd.read_csv(file, dtype=leie_dtypes)
    df.columns = map(str.lower, df.columns)
    dfs.append(df)

In [26]:
# Concatenate
leie_df = pd.concat(dfs, axis=0, ignore_index=True, sort=False)
leie_df.shape

(4983, 18)

In [27]:
leie_df.head()

Unnamed: 0,lastname,firstname,midname,busname,general,specialty,upin,npi,dob,address,city,state,zip,excltype,excldate,reindate,waiverdate,wvrstate
0,,,,CHANGING STEPS TREATMENT CENTE,OTHER BUSINESS,COMMUNITY HLTH CTR (,,1477704351,,"14540 HAMLIN ST, STE B",VAN NUYS,CA,91411,1128a1,20180220,0,0,
1,,,,OLIVE TREE FOSTER HOME,OTHER BUSINESS,ADULT HOME,,0,,94-245 PUPUKOAE STREET,HONOLULU,HI,96797,1128a1,20180220,0,0,
2,AIRHART,LAURA,PAULINE,,IND- LIC HC SERV PRO,NURSE/NURSES AIDE,,0,19770704.0,7304 FULLER CIRCLE,FORT WORTH,TX,76133,1128b4,20180220,0,0,
3,ALBERT,AMY,,,IND- LIC HC SERV PRO,PHYSICIAN'S ASSISTAN,,1679639397,19770818.0,1124 GAINSBORO ROAD,LOWER MERION,PA,19004,1128b4,20180220,0,0,
4,ALLEN,HEATHER,ANITRA,,IND- LIC HC SERV PRO,NURSE/NURSES AIDE,,0,19740326.0,1004 BINGHAM AVE,ROWAN,IA,50470,1128a3,20180220,0,0,


In [28]:
# Drop NPI = 0, which means missing - A LOT ARE MISSING, which is a problem for the data
leie_df = leie_df[leie_df['npi'] != 0]
leie_df.shape

(1009, 18)

In [29]:
# Keep exclusions most related to Fraud
exclusions_to_keep = [
    '1128a1',
    '1128a2',
    '1128a3',
    '1128b4',
    '1128b7',
    '1128c3Gi',
    '1128c3gii',
]
leie_df = leie_df[leie_df['excltype'].isin(exclusions_to_keep)]
leie_df.shape

(669, 18)

In [30]:
leie_df['excltype'].value_counts()

1128a1    362
1128b4    154
1128a3     77
1128a2     44
1128b7     32
Name: excltype, dtype: int64

- 1128a1: Conviction of program-related crimes
- 1128a2: Conviction of relating to patient abuse or neglect
- 1128a3: Felony conviction relating to healthcare fraud
- 1128a4: License revocation, suspension, or surrender
- 1128a3: Fraud, kickbacks, other prohibited activities

In [31]:
# Write all combined LEIE to csv
#partB_df.to_csv('combined-leie-data')

<a href='#menu'>[Menu]</a>
<a id='combined'></a>

### 3. Combine/Label Data

In [32]:
from datetime import datetime, timedelta
import numpy as np

In [33]:
# Convert to datetime
leie_df['excldate'] = pd.to_datetime(leie_df['excldate'], format='%Y%m%d', errors ='ignore')

In [34]:
# Round excl date to the nearest year Johnson & Khoshgoftaar (2019)
def round_to_year(dt=None):
    year = dt.year
    month = dt.month
    if month >= 6:
        year = year + 1
    return datetime(year=year,month=1,day=1)

leie_df['excl_year'] = leie_df.excldate.apply(lambda x: round_to_year(x))

In [35]:
# Make exclusion dict 
# 1215053665 has 2 exclusions, so sort df to get latest year
excl_year_dict = dict([npi, year] for npi, year in zip(leie_df.sort_values(by='excl_year').npi, leie_df.sort_values(by='excl_year').excl_year))

In [36]:
# Get label as 0 or 1
partB_df['excl_year'] = partB_df['npi'].map(excl_year_dict)
partB_df['excl_year'] = partB_df['excl_year'].fillna(datetime(year=1900,month=1,day=1)) # fill NaN, physicians without exclusion, with year 1900

partB_df['year'] = pd.to_datetime(partB_df['year'].astype(str), format='%Y', errors ='ignore')
partB_df['fraudulent'] = np.where(partB_df['year'] < partB_df['excl_year'], 1, 0) # compare year vs. exclusion year to get Fraudulent

In [37]:
print("partB_df is our combined dataset with shape: {0}".format(partB_df.shape))

partB_df is our combined dataset with shape: (35368294, 19)


<a href='#menu'>[Menu]</a>
<a id='visualization'></a>

### 4. Draw Visualizations

In [38]:
%matplotlib inline

import seaborn as sns
import matplotlib.pyplot as plt

import plotly.figure_factory as ff
import plotly.graph_objects  as go
from plotly.subplots import make_subplots

In [39]:
# Get number and amount of fraudulent services
partB_df['fraud_line_srvc_cnt'] = partB_df['line_srvc_cnt']*partB_df['fraudulent']
partB_df['fraud_average_submitted_chrg_amt'] = partB_df['average_submitted_chrg_amt']*partB_df['fraudulent']
partB_df['fraud_average_medicare_payment_amt'] = partB_df['average_medicare_payment_amt']*partB_df['fraudulent']

In [72]:
# Aggregate by state
state_df = partB_df.groupby('nppes_provider_state').agg({
    'line_srvc_cnt':[('total_services_count','sum')],
    'fraud_line_srvc_cnt':[('total_fraud_services_count','sum')]
}).reset_index()

# Drop multi-index
state_df.columns = ['_'.join(col) for col in state_df.columns]
state_df.columns = ['provider_state', 'total_services_count', 'fraud_services_count']

In [73]:
# Get % fraud
state_df['fraud_services_pct'] = state_df['fraud_services_count']/state_df['total_services_count']
state_df.head()

Unnamed: 0,provider_state,total_services_count,fraud_services_count,fraud_services_pct
0,AA,26875.0,0.0,0.0
1,AE,45118.0,0.0,0.0
2,AK,7096804.1,0.0,0.0
3,AL,145179172.2,0.0,0.0
4,AP,35607.0,0.0,0.0


<a id='usmap'></a>
<a href='#menu'>[Menu]</a>

In [74]:
np.seterr(divide = 'ignore') 

fig = go.Figure(data=go.Choropleth(
    locations=state_df['provider_state'],
    z = np.log(state_df['fraud_services_pct'].astype(float))+0.000001, #log-scale
    locationmode = 'USA-states',
    colorscale = 'Reds',
    colorbar_title = "Logged %",
    marker_line_color='white'
))

fig.update_layout(
    title_text = 'Logged Percentage of Medicare Fraudulent Service by US State (2012-2016)',
    geo_scope='usa',
)

fig.show()

Heat map shows a high concentration of fraudulent Medicare claims around the Northeast and South of the US, with Texas leading all states by a wide margin.

In [None]:
# Get dummy variables for gender
partB_df2 = pd.concat([partB_df, pd.get_dummies(partB_df['nppes_provider_gender'])], axis=1)

In [None]:
# Aggregate by provider type
type_df = partB_df2.groupby('provider_type').agg({
    'line_srvc_cnt':['sum'],
    'fraud_line_srvc_cnt':['sum'],
    'M':['sum'],
    'F':['sum'],
    'average_submitted_chrg_amt':['median'], #since distribution skewed right
    'average_medicare_payment_amt':['median'],
    'fraud_average_submitted_chrg_amt':['max'],
    'fraud_average_medicare_payment_amt':['max']
}).reset_index()

# Drop multi-index
type_df.columns = ['_'.join(col) for col in type_df.columns]
type_df.columns = ['provider_type', 'total_services_count', 'fraud_services_count','male_count','female_count',
                   'avg_submitted_chrg_amt','avg_medicare_payment_amt', 'fraud_avg_submitted_chrg_amt','fraud_avg_medicare_payment_amt']

In [None]:
# Sorting
type_df = type_df.sort_values('fraud_services_count',ascending=False)[:15] #get top 15 fraudulent types
type_df = type_df.sort_values('total_services_count',ascending=True).reset_index(drop=True) #re-sort by total services

# Add some fields
type_df['non_fraud_services_count'] = type_df['total_services_count'] - type_df['fraud_services_count']
type_df['fraud_services_pct'] = (type_df['fraud_services_count']/type_df['total_services_count'])*100
type_df.head()

<a id='histogram1'></a>
<a href='#menu'>[Menu]</a>

In [None]:
# Get 2015 data only for speed
partB_2015_df = partB_df2[partB_df2['year'] == datetime(year=2015,month=1,day=1)]
top_7_types  = type_df['provider_type'][:7].tolist()

for p_type in top_7_types:
    
    #Eliminate 0 payments then log
    x = np.log(partB_2015_df[(partB_2015_df.provider_type == p_type) & (partB_2015_df.average_submitted_chrg_amt!=0)]['average_submitted_chrg_amt'])
    fraud_x = np.log(partB_2015_df[(partB_2015_df.provider_type == p_type) & (partB_2015_df.fraud_average_submitted_chrg_amt!=0)]['fraud_average_submitted_chrg_amt'])
    
    fig,ax = plt.subplots(figsize=(12,5))
    sns.distplot(x,label='Non-fraudulent', hist=False, rug=False)
    sns.distplot(fraud_x,label='Fraudulent', hist=False, rug=False)

    ax.set(
        title  ='Distribution of Submitted Charge Amount - Fraud vs. Non-fraud - '+ p_type,
        xlabel = 'Log of USD Payment Amount',
        ylabel = 'Count'
          )
    
plt.show()

No obvious evidence to say that the average fraudulent submitted charge is more expensive than non-fraudulent charge across the Top 15 Fraudulent categories, as shown in histogram comparisons.

In [None]:
for p_type in top_7_types:
    #Eliminate 0 payments
    x = partB_2015_df[(partB_2015_df.provider_type == p_type) & (partB_2015_df.average_medicare_payment_amt!=0)]['average_medicare_payment_amt']
    fraud_x = partB_2015_df[(partB_2015_df.provider_type == p_type) & (partB_2015_df.fraud_average_medicare_payment_amt!=0)]['fraud_average_medicare_payment_amt']
    
    fig,ax = plt.subplots(figsize=(12,5))
    sns.distplot(x,label='Non-fraudulent', hist=False, rug=False)
    sns.distplot(fraud_x,label='Fraudulent', hist=False, rug=False)
    ax.set(
        title  ='Distribution of Medicare Payment Amount - Fraud vs. Non-fraud - '+ p_type,
        xlabel = 'Log of USD Payment Amount',
        ylabel = 'Count'
          )
    plt.show()

Same conclusion can be made here about Medicare Payment amount.

<a id='correlation1'></a>
<a href='#menu'>[Menu]</a>

In [None]:
partB_2015_df = partB_df[partB_df['year'] == datetime(year=2015,month=1,day=1)]

fig, ax = plt.subplots(figsize=(15,7))

sns.heatmap(partB_2015_df.corr(method='pearson'), annot=True, fmt='.4f', 
            cmap=plt.get_cmap('coolwarm'), cbar=False, robust=True)
ax.set_yticklabels(ax.get_yticklabels(), rotation="horizontal")
ax.set(
    title  ='Correlation Heatmap in 2015 (only Continuous Variables)',

      )
plt.show()

No insights are particularly interesting here, outside of the correlations we would normally expect.

In [None]:
# Get categorical variables (except location)
for col in ['nppes_provider_gender','provider_type']:
    partB_2015_df = pd.concat([partB_2015_df, pd.get_dummies(partB_2015_df[col], drop_first= True)], axis=1)
    partB_2015_df = partB_2015_df.drop(col, 1)

In [None]:
def get_redundant_pairs(df):
    '''Get duplicate pairs to drop in correlation matrix after unstacking'''
    pairs_to_drop = set()
    cols = df.columns
    for i in range(0, df.shape[1]):
        for j in range(0, i+1):
            pairs_to_drop.add((cols[i], cols[j]))
    return pairs_to_drop

def get_correlations(df):
    '''Get biggest correlations'''
    au_corr = df.corr().unstack() #unstack
    labels_to_drop = get_redundant_pairs(df)
    au_corr = au_corr.drop(labels=labels_to_drop)
    return au_corr

#get relevant columns
cols = partB_2015_df.columns[20:].tolist() + ['line_srvc_cnt','bene_unique_cnt','average_submitted_chrg_amt','fraudulent']
corr = get_correlations(partB_2015_df[cols])

<a id='correlation2'></a>
<a href='#menu'>[Menu]</a>

In [None]:
corr = corr.to_frame().reset_index() #to frame
corr.columns = ['Variable A', 'Variable B', 'Correlation']

def color_df(value):
    '''Color red if positive, green if negative'''
    if value < 0:
        color = 'red'
    elif value > 0:
        color = 'green'
    else:
        color = 'black'
    return 'color: %s' % color

corr = corr.reindex(corr['Correlation'].abs().sort_values(ascending=False).index).reset_index(drop=True)
corr[:15].style.applymap(color_df, subset=['Correlation'])

<a id='barchart1'></a>
<a href='#menu'>[Menu]</a>

In [None]:
fig = go.Figure()

col_layout_dict = {'non_fraud_services_count': ['Non-fraud Service','rgba(50, 171, 96, 0.6)'],
                 'fraud_services_count': ['Fraud Service','rgb(255, 0, 0)']} #dict for layout

for col in ['non_fraud_services_count','fraud_services_count']:
    fig.add_trace(go.Bar(
        y=type_df['provider_type'],
        x=type_df[col],
        name=col_layout_dict[col][0],
        marker=dict(
            color=col_layout_dict[col][1],
        ),
        orientation='h',
    ))

fig.update_layout(
    barmode = 'stack',
    title = 'Top 15 Fraudulent Medicare Category by Service Count',
    paper_bgcolor='white',
    plot_bgcolor='white',
    yaxis=dict(
        showgrid=False,
        showline=False,
        showticklabels=True,
        domain=[0, 0.95],
    ),
    xaxis=dict(
        zeroline=False,
        showline=False,
        showticklabels=True,
        showgrid=True,
        domain=[0, 0.95],
    ),
    xaxis_title_text='Count',
)

annotations = [] #annotate with %

x   = type_df['fraud_services_count']+type_df['non_fraud_services_count']+100000000
y_p = np.round(type_df['fraud_services_pct'].tolist(), decimals=2)

for y_p, x, y in zip(y_p,x,type_df['provider_type']):
    annotations.append(dict(xref='x1', yref='y1',
                            y=y, x=x,
                            text=str(y_p) + '%',
                            font=dict(family='Arial', size=12,
                                      color='rgb(255, 0, 0)'),
                            showarrow=False))

annotations.append(dict(xref='paper', yref='paper',
                        x=-0.2, y=-0.209,
                        text='Combined CMS and LEIE data' +
                             'to label the leading Fraudulent physician categories (15 Feb 2020)',
                        font=dict(family='Arial', size=10, color='rgb(150,150,150)'),
                        showarrow=False))

fig.update_layout(annotations=annotations)

fig.show()

<a id='barchart2'></a>
<a href='#menu'>[Menu]</a>

In [None]:
fig = go.Figure()

#dict for layout
col_layout_dict = {'female_count': ['Female Physicians ','#ffcdd2'],
                 'male_count': ['Male Physicians','#A2D5F2']}

for col in ['female_count','male_count']:
    fig.add_trace(go.Bar(
        y=type_df['provider_type'],
        x=type_df[col],
        name=col_layout_dict[col][0],
        marker=dict(
            color=col_layout_dict[col][1],
        ),
        orientation='h',
    ))

fig.update_layout(
    barmode = 'stack',
    title = 'Top 15 Fraudulent Medicare Category by Service Count and Gender',
    paper_bgcolor='white',
    plot_bgcolor='white',
    yaxis=dict(
        showgrid=False,
        showline=False,
        showticklabels=True,
        domain=[0, 0.90],
    ),
    xaxis=dict(
        zeroline=False,
        showline=False,
        showticklabels=True,
        showgrid=True,
        domain=[0, 0.90],
    ),
    xaxis_title_text='Count',
)

fig.show()

Clinical Laboratory and Ambulance Service Provider has a lot of line services but require fewer doctors?

<a href='#menu'>[Menu]</a>
<a id='train'></a>

### 5. Prep Data for Training

- Aggregate data following paper's method
- Normalize predictors
- One hot encoding
- Write data for training

In [188]:
new_variables_to_keep = [
    'year',
    'npi',
    'provider_type',
    'nppes_provider_city',  
    'nppes_provider_state',
    'nppes_provider_country',
    'nppes_provider_gender',
    'line_srvc_cnt',
    'bene_unique_cnt',
    'bene_day_srvc_cnt',
    'average_submitted_chrg_amt',
    'average_medicare_payment_amt',
    'fraudulent'
]

In [189]:
#group by
temp_df = partB_df[new_variables_to_keep]

#agg by npi - provider_type and get sum stats
agg_partB_df = temp_df.groupby(by=['year','npi','provider_type','nppes_provider_city',
                      'nppes_provider_state','nppes_provider_country','nppes_provider_gender']).agg(
                    {
                    'line_srvc_cnt':["mean","median","std", min,max,sum],
                    'bene_unique_cnt':["mean","median","std", min,max,sum],
                    'bene_day_srvc_cnt':["mean","median","std", min,max,sum],
                    'average_submitted_chrg_amt':["mean","median","std", min,max,sum],
                    'average_medicare_payment_amt':["mean","median","std", min,max,sum],
                    'fraudulent':["mean"],
                    }).reset_index()

agg_partB_df.columns = ["_".join(x) for x in agg_partB_df.columns.ravel()] #unravel to get rid of multi-index column names

In [190]:
agg_partB_df.head()

Unnamed: 0,year_,npi_,provider_type_,nppes_provider_city_,nppes_provider_state_,nppes_provider_country_,nppes_provider_gender_,line_srvc_cnt_mean,line_srvc_cnt_median,line_srvc_cnt_std,...,average_submitted_chrg_amt_min,average_submitted_chrg_amt_max,average_submitted_chrg_amt_sum,average_medicare_payment_amt_mean,average_medicare_payment_amt_median,average_medicare_payment_amt_std,average_medicare_payment_amt_min,average_medicare_payment_amt_max,average_medicare_payment_amt_sum,fraudulent_mean
0,2012-01-01,1003000126,Internal Medicine,CUMBERLAND,MD,US,M,174.857143,111.0,166.951518,...,58.0,291.0,1060.0,82.218697,81.39,41.94257,30.720721,158.87,575.530877,0
1,2012-01-01,1003000134,Pathology,EVANSTON,IL,US,M,959.125,223.0,2076.546889,...,39.0,263.0,1065.0,26.053306,25.187934,18.836154,7.815385,64.015735,208.426446,0
2,2012-01-01,1003000142,Anesthesiology,TOLEDO,OH,US,M,26.0,26.0,2.828427,...,111.0,216.571429,327.571429,88.93,88.93,48.295393,54.78,123.08,177.86,0
3,2012-01-01,1003000381,Physical Therapist,LADY LAKE,FL,US,M,166.8,137.0,159.703475,...,35.0,96.956522,274.965652,24.315331,19.488837,19.937615,8.767883,58.643913,121.576654,0
4,2012-01-01,1003000407,Family Practice,PATTON,PA,US,M,154.4375,63.5,180.288276,...,55.214521,510.0,2343.821257,77.469027,77.795,35.981435,29.93,153.106164,1239.504432,0


In [191]:
a = agg_partB_df.fraudulent_mean.value_counts()
display(a)
print('Fraudulent physicians are: {0}% of all data'.format(str(np.round((a[1]/a[0])*100,decimals = 6)))) 

0    3521673
1        703
Name: fraudulent_mean, dtype: int64

Fraudulent physicians are: 0.019962% of all data


<a href='#menu'>[Menu]</a>
<a id='norm'></a>

In [197]:
# Normalize predictors to [0,1] min-max scale
from sklearn import preprocessing

def scale_predictors(df):
    '''Takes in df and returns normalized data [0,1] '''
    x  = df.values #numpy array
    min_max_scaler = preprocessing.MinMaxScaler()
    x_scaled = min_max_scaler.fit_transform(x)
    return pd.DataFrame(x_scaled, columns=df.columns, index=df.index)

scl_predict_df = scale_predictors(agg_partB_df.iloc[:,7:(agg_partB_df.shape[1]-1)])

In [198]:
# merge back in with npi and label
agg_partB_df['npi_']  = agg_partB_df['npi_'].astype(str)
agg_partB_df['year_'] = agg_partB_df['year_'].astype(str).str[:4]

agg_partB_df = pd.concat([ agg_partB_df.iloc[:,[0,1,2,6,37]], scl_predict_df], axis=1)

In [199]:
# One-hot encoding
for col in ['year_','nppes_provider_gender_','provider_type_']:
    agg_partB_df = pd.concat([agg_partB_df, pd.get_dummies(agg_partB_df[col], drop_first= True)], axis=1)
    agg_partB_df = agg_partB_df.drop(col, 1) #drop old column that's been encoded

In [200]:
display(agg_partB_df.shape)
print('There are {0} predictors.'.format(agg_partB_df.shape[1]-1)) 

(3522376, 144)

There are 143 predictors.


In [11]:
# Write data for training
# %timeit agg_partB_df.to_csv('labeled-data-training-v1')

6min 30s ± 6.28 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


<a href='#menu'>[Menu]</a>
<a id='ros'></a>

In [None]:
# Split data into fraud and not fraud  
agg_partB_df_fraud = agg_partB_df.loc[agg_partB_df['fraudulent_mean'] == 1]
agg_partB_df_fraud_filtered = agg_partB_df_fraud[~agg_partB_df_fraud.isnull().any(axis=1)]  # drop rows with any missing values
agg_partB_df_not_fraud = agg_partB_df.loc[agg_partB_df['fraudulent_mean'] == 0]
agg_partB_df_not_fraud_filtered = agg_partB_df_not_fraud[~agg_partB_df_not_fraud.isnull().any(axis=1)]  # drop rows with any missing values

In [None]:
# Sample data 
num_rows = 3200  # --> 20000 rows ~ 100 MB (limit on AutoAI)
percent_fraud = 0.2  # TODO: Change this value to increase the percentage of AutoAI sample that is fraud 
n_fraud = min(int(num_rows * percent_fraud), agg_partB_df_fraud_filtered['fraudulent_mean'].count())
random_state = np.random.RandomState(seed=0)
auto_ai_df = pd.concat([agg_partB_df_fraud_filtered.sample(n=n_fraud, random_state=random_state),
                       agg_partB_df_not_fraud_filtered.sample(n=(num_rows-n_fraud), random_state=random_state)])
print(f'Sample breakdown: {n_fraud} ({100*(n_fraud/num_rows):.2f}%) Fraud & {num_rows-n_fraud} ({100*((num_rows-n_fraud)/num_rows):.2f}%) Not Fraud')
auto_ai_df.head()

In [None]:
# Save auto_ai sample to project
from project_lib import Project
project = Project('healthlockai-donotdelete-pr-ogi4ydhjozwgjy', "8c9fc83d-f281-4e14-b055-cfd23187bcd2", "p-e8118b231f6c482c8860583a28f33308fe6de9a8")

project.save_data(file_name = f'auto_ai_sample_rows-{num_rows}_nfraud-{n_fraud}.csv',
                  data=auto_ai_df.to_csv(index=False, header=True), overwrite=True)