<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Load-Data" data-toc-modified-id="Load-Data-0.1"><span class="toc-item-num">0.1&nbsp;&nbsp;</span>Load Data</a></span></li><li><span><a href="#Set-up-relevant-features" data-toc-modified-id="Set-up-relevant-features-0.2"><span class="toc-item-num">0.2&nbsp;&nbsp;</span>Set up relevant features</a></span></li></ul></li><li><span><a href="#List-of-features-to-create-based-on-EDA" data-toc-modified-id="List-of-features-to-create-based-on-EDA-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>List of features to create based on EDA</a></span><ul class="toc-item"><li><span><a href="#Set-up-a-provider-oriented-data-frame" data-toc-modified-id="Set-up-a-provider-oriented-data-frame-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Set up a provider-oriented data frame</a></span></li><li><span><a href="#Create-new-features-for-providers" data-toc-modified-id="Create-new-features-for-providers-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Create new features for providers</a></span></li></ul></li></ul></div>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sys
sys.path.insert(0, '..')
from joblib import load
import Functions as fxns
from hashable_df import hashable_df
%matplotlib inline
plt.rcParams['figure.figsize'] = (9, 6)
sns.set(style = "whitegrid")
sns.set_palette("icefire")
pd.set_option('display.max_columns', 500)

## Load Data

In [2]:
# # CREATES A .PKL FILE IN THE MAIN FOLDER - ONLY NEEDS TO BE RUN ONCE/IF PRE-PROCESSING IS UPDATED.
# !python ../Preprocessing.py # REMOVE OR COMMENT OUT AFTER PRE-PROCESSING
claims = load('../claims.pkl')

## Set up relevant features

In [3]:
# Create variables for convenience 
diag_code = claims.columns[claims.columns.str.contains('DiagnosisCode')].tolist()
proc_code = claims.columns[claims.columns.str.contains('ProcedureCode')].tolist()
codes = diag_code + proc_code
chronic = claims.columns[claims.columns.str.contains("Chronic")].tolist()

In [4]:
claims["ClaimDuration"] = claims["ClaimEndDt"] - claims["ClaimStartDt"]
claims["ClaimDuration"] = claims["ClaimDuration"].dt.days + 1
claims["NoPhy"] = claims[['AttendingPhysician', 'OperatingPhysician', 'OtherPhysician']].isna().all(axis =1)
claims['AllPhy'] = claims[['AttendingPhysician', 'OperatingPhysician']].notnull().all(axis =1)
claims['SameAttOper'] = claims['AttendingPhysician'] == claims['OperatingPhysician']
claims["AdmisDuration"] = claims["DischargeDt"] - claims["AdmissionDt"]
claims["AdmisDuration"] = claims["AdmisDuration"].dt.days
claims["AgeAtClm"] = round((claims["ClaimStartDt"] - claims["DOB"]).dt.days/365,0).astype(int)
claims["TotalRev"] = claims['InscClaimAmtReimbursed'] + claims['DeductibleAmtPaid']
claims['ClmYear'] = claims.ClaimStartDt.dt.year.rename('Year')
claims['ClmMonth'] = claims.ClaimStartDt.dt.month.rename('Month')
claims['ClmWeek'] = claims.ClaimStartDt.dt.week.rename('Week')
claims['InsCovRatio'] = claims['InscClaimAmtReimbursed']/(claims['InscClaimAmtReimbursed'] + claims["DeductibleAmtPaid"])
claims['RevPerDay'] = claims["TotalRev"]/(claims['ClaimDuration']+1)
claims['Chronic_Sum'] = claims[chronic].sum(axis = 1)
claims['No_Diag_Code'] = claims[diag_code].isna().all(axis = 1)
claims['No_Proc_Code'] = claims[proc_code].isna().all(axis = 1)

In [5]:
# Create variables for convenience 
inclaims = claims[claims['IsOutpatient'] == 0]
outclaims = claims[claims['IsOutpatient'] == 1]

# List of features to create based on EDA

* Patient/Physician Ratio

* Number of unique inpatient beneficiaries
* Number of unique outpatient beneficiaries
* Number of unique states for inpatient
* Number of unique states for outpatient
* Total deductible paid for inpatients
* Total deductible paid for outpatients
* Whether the provider serves both in/out patients
* Percentage of inpatient claims
* Percentage of claims that had all physicians involved
* Percentage of inpatients with top 5 frequent chronic disease (from PotentialFraud)
* Percentage of outpatients with top 5 frequent chronic disease (from PotentialFraud)
* Average number of claims per patients
* Average number of claims per physicians

* Average claim duration for inpatients
* Average claim duration for outpatients
* Average amount of reimbursed claims for inpatients
* Average amount of reimbursed claims for outpatients
* Average admission duration for inpatients
* Average age of patients
* Average number of chronic condition for inpatients
* Average number of chronic condition for outpatients
* Average Insurance covered Ratio for inpatients (Reimbursement/(Reimbursement+Deductible) 
* Average Insurance covered Ratio for outpatients
* Average revenue per day for inpatients
* Average revenue per day for outpatients

* Percentage of attending physicians serving for different hospitals (75% threshold)
* Percentage of operating physicians serving for different hospitals (75% threshold)
* Percentage of other physicians serving for different hospitals (75% threshold)
* Percentage of inpatients going to different hospitals (75% threshold)
* Percentage of outpatients going to different hospitals (75% threshold)
* Percentage of inpatients that receive both in/out patient service
* Percentage of outpatients that receive both in/out patient service
* Percentage of claims that didn’t have any physician involved
* Percentage of inpatient claims with top 5 admtcode (from PotentialFraud)
* Percentage of outpatient claims with top 5 admtcode (from PotentialFraud)

* Percentage of Inpatient duplicate
* Percentage of Outpatient duplicate 
* Average inpatient claim duration of duplicate
* Average outpatient claim duration of duplicate
* Percentage of outpatient with no diagnosis codes 
* Percentage of inpatient with no procedure codes
* Percentage of claims from top 5 fraudulent states per provider

## Set up a provider-oriented data frame

In [6]:
# Create Provider-oriented data frame
providers = pd.DataFrame(claims.groupby('Provider')['ClaimID'].size().index)

## Create new features for providers

In [7]:
# Patient/Physician Ratio
PP_Ratio = claims.groupby('Provider')[['BeneID','AttendingPhysician','OperatingPhysician','OtherPhysician']].nunique()
PP_Ratio['P_Attphy_Ratio'] = PP_Ratio['BeneID'] - PP_Ratio['AttendingPhysician']
PP_Ratio['P_Operphy_Ratio'] = PP_Ratio['BeneID'] - PP_Ratio['OperatingPhysician']
PP_Ratio['P_Otherphy_Ratio'] = PP_Ratio['BeneID'] - PP_Ratio['OtherPhysician']
providers = providers.merge(PP_Ratio.iloc[:,-3:], how = 'left', on = 'Provider')

In [8]:
# Create duplicate boolean column
claims['code_all_nan'] = claims[diag_code + proc_code].isna().all(axis = 1)
claims_withcode = claims[claims['code_all_nan'] == False]
dup_combination = claims_withcode[diag_code + proc_code].values.tolist()
dup_combination = list(map(lambda x: [code for code in x if str(code) != "nan"], dup_combination))
claims_withcode['Dup_Combo'] = dup_combination
claims_withcode['Duplicate_Bool'] = hashable_df(claims_withcode).duplicated(subset = ['Dup_Combo'], keep = False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


In [9]:
# Inpatient duplicate percentage
inp_dup_perc = claims_withcode[claims_withcode['IsOutpatient'] == 0].groupby('Provider')[['Duplicate_Bool']].mean().reset_index()
inp_dup_perc.columns.values[1] = "IP_Dup_Perc"
providers = providers.merge(inp_dup_perc, how = 'left', on = 'Provider')
# 0's are when there is no duplicate, and NaN's are when there is no inpatients 

In [10]:
# Outpatient duplicate percentage
outp_dup_perc = claims_withcode[claims_withcode['IsOutpatient'] == 1].groupby('Provider')[['Duplicate_Bool']].mean().reset_index()
outp_dup_perc.columns.values[1] = "OP_Dup_Perc"
providers = providers.merge(outp_dup_perc, how = 'left', on = 'Provider')
# 0's are when there is no duplicate, and NaN's are when there is no outpatients 

In [11]:
# Percentage of outpatient with no diagnosis code
no_diag_code = outclaims.groupby('Provider')[['No_Diag_Code']].mean().reset_index()
no_diag_code.columns.values[1] = "OP_No_Diag_Perc"
providers = providers.merge(no_diag_code, how = 'left', on = 'Provider')

In [12]:
# Percentage of inpatient with no procedure code
no_proc_code = inclaims.groupby('Provider')[['No_Proc_Code']].mean().reset_index()
no_proc_code.columns.values[1] = "IP_No_Proc_Perc"
providers = providers.merge(no_proc_code, how = 'left', on = 'Provider')

In [13]:
# Percentage of claims from top 5 fraudulent states per provider
claims.PotentialFraud = claims.PotentialFraud.astype(int)
top_five = claims.groupby('State')[['PotentialFraud']].mean().sort_values(
                                    by = 'PotentialFraud', ascending = False).index[:5]
claims['In_Top5_State'] = claims['State'].isin(top_five)
top_five_states = claims.groupby('Provider')[['In_Top5_State']].mean().reset_index()
top_five_states.columns.values[1] = "In_Top5_St_Perc"
providers = providers.merge(top_five_states, how = 'left', on = 'Provider')