<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Load-Data" data-toc-modified-id="Load-Data-0.1"><span class="toc-item-num">0.1&nbsp;&nbsp;</span>Load Data</a></span></li><li><span><a href="#Set-up-relevant-features" data-toc-modified-id="Set-up-relevant-features-0.2"><span class="toc-item-num">0.2&nbsp;&nbsp;</span>Set up relevant features</a></span></li></ul></li><li><span><a href="#List-of-features-to-create-based-on-EDA" data-toc-modified-id="List-of-features-to-create-based-on-EDA-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>List of features to create based on EDA</a></span><ul class="toc-item"><li><span><a href="#Set-up-a-provider-oriented-data-frame" data-toc-modified-id="Set-up-a-provider-oriented-data-frame-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Set up a provider-oriented data frame</a></span></li><li><span><a href="#Create-new-features-for-providers" data-toc-modified-id="Create-new-features-for-providers-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Create new features for providers</a></span></li></ul></li><li><span><a href="#Merge-with-target-variable" data-toc-modified-id="Merge-with-target-variable-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Merge with target variable</a></span></li><li><span><a href="#Second-Iter-of-Feature-Engineering-Based-on-Initial-Modeling-Results" data-toc-modified-id="Second-Iter-of-Feature-Engineering-Based-on-Initial-Modeling-Results-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Second Iter of Feature Engineering Based on Initial Modeling Results</a></span></li></ul></div>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sys
sys.path.insert(0, '..')
from joblib import load
import Functions as fxns
from hashable_df import hashable_df
%matplotlib inline
plt.rcParams['figure.figsize'] = (9, 6)
sns.set(style = "whitegrid")
sns.set_palette("icefire")
pd.set_option('display.max_columns', 500)
import warnings
warnings.filterwarnings(action="ignore")

## Load Data

In [None]:
# # CREATES A .PKL FILE IN THE MAIN FOLDER - ONLY NEEDS TO BE RUN ONCE/IF PRE-PROCESSING IS UPDATED.
# !python ../Preprocessing.py # REMOVE OR COMMENT OUT AFTER PRE-PROCESSING
claims = load('../claims.pkl')

## Set up relevant features

In [None]:
# Create variables for convenience 
diag_code = claims.columns[claims.columns.str.contains('DiagnosisCode')].tolist()
proc_code = claims.columns[claims.columns.str.contains('ProcedureCode')].tolist()
codes = diag_code + proc_code
chronic = claims.columns[claims.columns.str.contains("Chronic")].tolist()

In [None]:
claims["ClaimDuration"] = claims["ClaimEndDt"] - claims["ClaimStartDt"]
claims["ClaimDuration"] = claims["ClaimDuration"].dt.days + 1
claims["NoPhy"] = claims[['AttendingPhysician', 'OperatingPhysician', 'OtherPhysician']].isna().all(axis =1)
claims['AllPhy'] = claims[['AttendingPhysician', 'OperatingPhysician','OtherPhysician']].notnull().all(axis =1)
claims['SameAttOper'] = claims['AttendingPhysician'] == claims['OperatingPhysician']
claims["AdmisDuration"] = claims["DischargeDt"] - claims["AdmissionDt"]
claims["AdmisDuration"] = claims["AdmisDuration"].dt.days
claims["AgeAtClm"] = round((claims["ClaimStartDt"] - claims["DOB"]).dt.days/365,0).astype(int)
claims["TotalRev"] = claims['InscClaimAmtReimbursed'] + claims['DeductibleAmtPaid']
claims['ClmYear'] = claims.ClaimStartDt.dt.year.rename('Year')
claims['ClmMonth'] = claims.ClaimStartDt.dt.month.rename('Month')
claims['ClmWeek'] = claims.ClaimStartDt.dt.week.rename('Week')
claims['InsCovRatio'] = claims['InscClaimAmtReimbursed']/(claims['InscClaimAmtReimbursed'] + claims["DeductibleAmtPaid"])
claims['RevPerDay'] = claims["TotalRev"]/(claims['ClaimDuration']+1)
claims['Chronic_Sum'] = claims[chronic].sum(axis = 1)
claims['No_Diag_Code'] = claims[diag_code].isna().all(axis = 1)
claims['No_Proc_Code'] = claims[proc_code].isna().all(axis = 1)

In [None]:
# Create variables for convenience 
inclaims = claims[claims['IsOutpatient'] == 0]
outclaims = claims[claims['IsOutpatient'] == 1]

# List of features to create based on EDA

* Patient/Physician Ratio
* Average number of claims per patients
* Average number of claims per physicians
* Percentage of inpatients going to different hospitals 
* Percentage of outpatients going to different hospitals
* Percentage of patients that receive both in/out patient service
* Whether the provider serves both in/out patients
* Percentage of attending physicians serving for different hospitals
* Percentage of operating physicians serving for different hospitals
* Percentage of other physicians serving for different hospitals
* Number of unique inpatient beneficiaries
* Number of unique outpatient beneficiaries
* Number of unique states for inpatients
* Number of unique states for outpatients
* Percentage of inpatient claims
* Percentage of claims that had all physicians involved
* Percentage of claims that had no physicians involved
* Average claim duration for inpatients
* Average claim duration for outpatients
* Average amount of reimbursed claims for inpatients
* Average amount of reimbursed claims for outpatients
* Average deductible paid for inpatients
* Average deductible paid for outpatients
* Average admission duration for inpatients
* Average age of inpatients
* Average age of outpatients
* Average number of chronic condition for inpatients
* Average number of chronic condition for outpatients
* Average Insurance covered Ratio for inpatients (Reimbursement/(Reimbursement+Deductible) 
* Average Insurance covered Ratio for outpatients
* Average revenue per day for inpatients
* Average revenue per day for outpatients
* Percentage of Inpatient duplicate
* Percentage of Outpatient duplicate 
* Average inpatient claim duration of duplicate
* Average outpatient claim duration of duplicate
* Percentage of outpatient with no diagnosis codes 
* Percentage of inpatient with no procedure codes
* Percentage of claims from top 5 fraudulent states per provider

* Percentage of inpatients with top 5 frequent chronic disease (from PotentialFraud)
* Percentage of outpatients with top 5 frequent chronic disease (from PotentialFraud)
* Percentage of inpatient claims with top 5 admtcode (from PotentialFraud)
* Percentage of outpatient claims with top 5 admtcode (from PotentialFraud)


## Set up a provider-oriented data frame

In [None]:
# Create Provider-oriented data frame
providers = pd.DataFrame(claims.groupby('Provider')['ClaimID'].size().index)

## Create new features for providers

In [None]:
# Patient/Physician Ratio
PP_Ratio = claims.groupby('Provider')[[
            'BeneID','AttendingPhysician',
            'OperatingPhysician','OtherPhysician']].nunique().reset_index()
PP_Ratio['Patient_Attphy_Ratio'] = PP_Ratio['BeneID']/PP_Ratio['AttendingPhysician']
PP_Ratio['Patient_Operphy_Ratio'] = PP_Ratio['BeneID']/ PP_Ratio['OperatingPhysician']
PP_Ratio['Patient_Otherphy_Ratio'] = PP_Ratio['BeneID']/ PP_Ratio['OtherPhysician']

PP_Ratio.drop(['BeneID','AttendingPhysician','OperatingPhysician','OtherPhysician'],1,inplace=True)
providers = providers.merge(PP_Ratio, how = 'left', on = 'Provider')

In [None]:
# Average number of claims per patients
claim_bene = claims.groupby('Provider')[[
    'ClaimID','BeneID']].agg({'ClaimID':'count','BeneID':'nunique'}).reset_index()
claim_bene['Claim_Patient_Ratio'] = claim_bene['ClaimID']/claim_bene['BeneID']
claim_bene.drop(['ClaimID','BeneID'],1,inplace=True)
providers = providers.merge(claim_bene, how = 'left', on = 'Provider')

# Average number of claims per physicians
claim_attphy = claims.groupby('Provider')[[
    'ClaimID','AttendingPhysician']].agg({
    'ClaimID':'count','AttendingPhysician':'nunique'}).reset_index()
claim_attphy['Claim_AttPhy_Ratio'] = claim_attphy['ClaimID']/claim_attphy['AttendingPhysician']
claim_attphy.drop(['ClaimID','AttendingPhysician'],1,inplace=True)
providers = providers.merge(claim_attphy, how = 'left', on = 'Provider')

In [None]:
# Percentage of outpatient claims
OP_Perc = claims.groupby('Provider')[['IsOutpatient']].mean().add_suffix('_Perc').reset_index()
providers = providers.merge(OP_Perc, how = 'left', on = 'Provider')

In [None]:
# Number of unique inpatient beneficiaries
# Number of unique states for inpatients
IP_nunique = inclaims.groupby('Provider')[[
    'BeneID','State']].nunique().add_suffix('_Nunique_IP').reset_index()

# Number of unique outpatient beneficiaries
# Number of unique states for outpatients
OP_nunique = outclaims.groupby('Provider')[[
    'BeneID','State']].nunique().add_suffix('_Nunique_OP').reset_index()

providers = providers.merge(IP_nunique, how = 'left', on = 'Provider').\
                        merge(OP_nunique, how = 'left', on = 'Provider')

In [None]:
# Percentage of claims that had all physicians involved
# Percentage of claims that had no physicians involved
# Average claim duration for inpatients
# Average claim duration for outpatients
# Average amount of reimbursed claims for inpatients
# Average amount of reimbursed claims for outpatients
# Average admission duration for inpatients
# Average age of inpatients
# Average age of outpatients
# Average number of chronic condition for inpatients
# Average number of chronic condition for outpatients
# Average Insurance covered Ratio for inpatients (Reimbursement/(Reimbursement+Deductible) 
# Average Insurance covered Ratio for outpatients
# Average revenue per day for inpatients
# Average revenue per day for outpatients
# Average deductible paid for inpatients
# Average deductible paid for outpatients
ip_mean = inclaims.groupby('Provider')[['AllPhy','NoPhy',
                                        'ClaimDuration','InscClaimAmtReimbursed',
                                        'AdmisDuration','AgeAtClm','DeductibleAmtPaid',
                                        'Chronic_Sum','InsCovRatio','RevPerDay'
                                       ]].mean().add_suffix('_mean_IP').reset_index()

op_mean = outclaims.groupby('Provider')[['AllPhy','NoPhy',
                                         'ClaimDuration','InscClaimAmtReimbursed',
                                         'AdmisDuration','AgeAtClm','DeductibleAmtPaid',
                                         'Chronic_Sum','InsCovRatio','RevPerDay'
                                        ]].mean().add_suffix('_mean_OP').reset_index()

providers = providers.merge(ip_mean, how = 'left', on = 'Provider').merge(op_mean, how = 'left', on = 'Provider')

In [None]:
# Percentage of attending physicians serving for different hospitals
nuniq_prov = claims.groupby('AttendingPhysician')["Provider"].nunique().reset_index()
phy_more = nuniq_prov[nuniq_prov.Provider > 1].AttendingPhysician.tolist()
claims.loc[claims["AttendingPhysician"].isin(phy_more),"Att_Phy_Mult"] = 1
claims.loc[~claims["AttendingPhysician"].isin(phy_more),"Att_Phy_Mult"] = 0

# Percentage of operating physicians serving for different hospitals
nuniq_prov = claims.groupby('OperatingPhysician')["Provider"].nunique().reset_index()
phy_more = nuniq_prov[nuniq_prov.Provider > 1].OperatingPhysician.tolist()
claims.loc[claims["OperatingPhysician"].isin(phy_more),"Oper_Phy_Mult"] = 1
claims.loc[~claims["OperatingPhysician"].isin(phy_more),"Oper_Phy_Mult"] = 0

# Percentage of other physicians serving for different hospitals
nuniq_prov = claims.groupby('OtherPhysician')["Provider"].nunique().reset_index()
phy_more = nuniq_prov[nuniq_prov.Provider > 1].OtherPhysician.tolist()
claims.loc[claims["OtherPhysician"].isin(phy_more),"Other_Phy_Mult"] = 1
claims.loc[~claims["OtherPhysician"].isin(phy_more),"Other_Phy_Mult"] = 0

physician_mult_prov = claims.groupby('Provider')[[
                'Att_Phy_Mult','Oper_Phy_Mult','Other_Phy_Mult'
                ]].mean().add_suffix('_Prec').reset_index()
providers = providers.merge(physician_mult_prov, how = 'left', on = 'Provider')

In [None]:
# Percentage of inpatients going to different hospitals 
nuniq_prov = inclaims.groupby('BeneID')["Provider"].nunique().reset_index()
bene_more = nuniq_prov[nuniq_prov['Provider'] > 1]['BeneID'].tolist()
claims.loc[claims["BeneID"].isin(bene_more),"IP_Multiple_Hospital"] = 1
claims.loc[~claims["BeneID"].isin(bene_more),"IP_Multiple_Hospital"] = 0

# Percentage of outpatients going to different hospitals
nuniq_prov = outclaims.groupby('BeneID')["Provider"].nunique().reset_index()
bene_more = nuniq_prov[nuniq_prov['Provider'] > 1]['BeneID'].tolist()
claims.loc[claims["BeneID"].isin(bene_more),"OP_Multiple_Hospital"] = 1
claims.loc[~claims["BeneID"].isin(bene_more),"OP_Multiple_Hospital"] = 0

patients_mult_hospital = claims.groupby('Provider')[[
    'IP_Multiple_Hospital','OP_Multiple_Hospital']].mean().add_suffix('_Prec').reset_index()
providers = providers.merge(patients_mult_hospital, how = 'left', on = 'Provider')

In [None]:
# Percentage of patients that receive both in/out patient service
bene_inp = inclaims['BeneID'].unique().tolist()
bene_both = outclaims[outclaims['BeneID'].isin(bene_inp)]['BeneID'].tolist()
claims.loc[claims["BeneID"].isin(bene_both),"Bene_Receive_Both_IO"] = 1
claims.loc[~claims["BeneID"].isin(bene_both),"Bene_Receive_Both_IO"] = 0

bene_receive_both = claims.groupby('Provider')[[
            'Bene_Receive_Both_IO']].mean().add_suffix('_Perc').reset_index()
providers = providers.merge(bene_receive_both, how = 'left', on = 'Provider')

In [None]:
# Whether the provider serves both in/out patients
prov_inp = inclaims['Provider'].unique().tolist()
prov_both = outclaims[outclaims['Provider'].isin(prov_inp)]['Provider'].tolist()
claims.loc[claims["Provider"].isin(prov_both),"Provider_Serve_BothIO"] = 1
claims.loc[~claims["Provider"].isin(prov_both),"Provider_Serve_BothIO"] = 0

provider_serve_both = claims.groupby('Provider')['Provider_Serve_BothIO'].mean()
providers = providers.merge(provider_serve_both, how = 'left', on = 'Provider')

In [None]:
# Create duplicate boolean column
claims['code_all_nan'] = claims[diag_code + proc_code].isna().all(axis = 1)
claims_withcode = claims[claims['code_all_nan'] == False]
dup_combination = claims_withcode[diag_code + proc_code].values.tolist()
dup_combination = list(
    map(lambda x: [code for code in x if str(code) != "nan"], dup_combination))
claims_withcode['Dup_Combo'] = dup_combination
claims_withcode['Duplicate_Bool'] = hashable_df(
    claims_withcode).duplicated(subset = ['Dup_Combo'], keep = False)

In [None]:
# Inpatient duplicate percentage
inp_dup_perc = claims_withcode[claims_withcode[
    'IsOutpatient'] == 0].groupby('Provider')[['Duplicate_Bool']].mean().reset_index()
inp_dup_perc.columns.values[1] = "IP_Dup_Perc"
providers = providers.merge(inp_dup_perc, how = 'left', on = 'Provider')
# 0's are when there is no duplicate, and NaN's are when there is no inpatients 

In [None]:
# Outpatient duplicate percentage
outp_dup_perc = claims_withcode[claims_withcode[
    'IsOutpatient'] == 1].groupby('Provider')[['Duplicate_Bool']].mean().reset_index()
outp_dup_perc.columns.values[1] = "OP_Dup_Perc"
providers = providers.merge(outp_dup_perc, how = 'left', on = 'Provider')
# 0's are when there is no duplicate, and NaN's are when there is no outpatients 

In [None]:
# Percentage of outpatient with no diagnosis code
no_diag_code = outclaims.groupby('Provider')[['No_Diag_Code']].mean().reset_index()
no_diag_code.columns.values[1] = "OP_No_Diag_Perc"
providers = providers.merge(no_diag_code, how = 'left', on = 'Provider')

In [None]:
# Percentage of inpatient with no procedure code
no_proc_code = inclaims.groupby('Provider')[['No_Proc_Code']].mean().reset_index()
no_proc_code.columns.values[1] = "IP_No_Proc_Perc"
providers = providers.merge(no_proc_code, how = 'left', on = 'Provider')

In [None]:
# Percentage of claims from top 5 fraudulent states per provider
claims.PotentialFraud = claims.PotentialFraud.astype(int)
top_five = claims.groupby('State')[['PotentialFraud']].mean().sort_values(
                                    by = 'PotentialFraud', ascending = False).index[:5]
claims['In_Top5_State'] = claims['State'].isin(top_five)

top_five_states = claims.groupby('Provider')[[
    'In_Top5_State']].mean().add_suffix('_Perc').reset_index()
providers = providers.merge(top_five_states, how = 'left', on = 'Provider')

# Merge with target variable

In [None]:
# target = pd.read_csv('./data/Train-1542865627584.csv')
# target['PotentialFraud'] = target['PotentialFraud'].apply(lambda x: np.where(x == "Yes",1,0))
# providers_final = providers.merge(target, how = 'left', on = 'Provider')

In [None]:
# providers_final.to_csv('providers_final.csv')

# Second Iter of Feature Engineering Based on Initial Modeling Results

* Percentage of duplicates from different states for inpatients
* Percentage of duplicates from different states for outpatients
* Percentage of duplicates from different providers for inpatients
* Percentage of duplicates from different providers for outpatients
* Mean duplicates per patient for inpatients
* Mean duplicates per physician for outpatients
* Mean duplicates per patient for inpatients
* Mean duplicates per physician for outpatients

* Mean Cost per unique patient
* Percentage of claims that have the same attending and operating physician for outpatient


In [None]:
# Creating boolean columns for Percentage of duplicates from different states
dup_same_state = claims_withcode[['State'] + diag_code + proc_code].values.tolist()
dup_same_state = list(map(lambda x: [code for code in x if str(code) != "nan"], dup_same_state))
claims_withcode['dup_same_state'] = dup_same_state
claims_withcode['duplicate_bool_st'] = hashable_df(claims_withcode).duplicated(subset = ['dup_same_state'], keep = False)
from_same_state = claims_withcode[claims_withcode['duplicate_bool_st'] == 1].index
claims_withcode = claims_withcode.loc[~claims_withcode.index.isin(from_same_state)]
dup_diff_state = claims_withcode[diag_code + proc_code].values.tolist()
dup_diff_state = list(map(lambda x: [code for code in x if str(code) != "nan"], dup_diff_state))
claims_withcode['dup_diff_state'] = dup_diff_state
claims_withcode['dup_diff_state_bool'] = \
                hashable_df(claims_withcode).duplicated(subset = ['dup_diff_state'], keep = False)

In [None]:
# Percentage of duplicates from different states for inpatients
inp_dup_diff_state = claims_withcode[claims_withcode[
    'IsOutpatient'] == 0].groupby('Provider')[['dup_diff_state_bool']].mean().reset_index()
inp_dup_diff_state.columns.values[1] = "IP_Perc_Dup_Diff_State"
providers = providers.merge(inp_dup_diff_state, how = 'left', on = 'Provider')
# 0's are when there is no duplicate, and NaN's are when there is no inpatients 

In [None]:
# Percentage of duplicates from different states for outpatients
outp_dup_diff_state = claims_withcode[claims_withcode[
    'IsOutpatient'] == 1].groupby('Provider')[['dup_diff_state_bool']].mean().reset_index()
outp_dup_diff_state.columns.values[1] = "OP_Perc_Dup_Diff_State"
providers = providers.merge(outp_dup_diff_state, how = 'left', on = 'Provider')
# 0's are when there is no duplicate, and NaN's are when there is no outpatients 

In [None]:
# Creating boolean columns for Percentage of duplicates from different provider
dup_same_provider = claims_withcode[['Provider'] + diag_code + proc_code].values.tolist()
dup_same_provider = list(map(lambda x: [code for code in x if str(code) != "nan"], dup_same_provider))
claims_withcode['dup_same_provider'] = dup_same_provider
claims_withcode['duplicate_bool_pr'] = hashable_df(claims_withcode).duplicated(subset = ['dup_same_provider'], keep = False)
from_same_provider = claims_withcode[claims_withcode['duplicate_bool_pr'] == 1].index
claims_withcode = claims_withcode.loc[~claims_withcode.index.isin(from_same_provider)]
dup_diff_provider = claims_withcode[diag_code + proc_code].values.tolist()
dup_diff_provider = list(map(lambda x: [code for code in x if str(code) != "nan"], dup_diff_provider))
claims_withcode['dup_diff_provider'] = dup_diff_provider
claims_withcode['dup_diff_provider_bool'] = \
                hashable_df(claims_withcode).duplicated(subset = ['dup_diff_provider'], keep = False)

In [None]:
# Percentage of duplicates from different providers for inpatients
inp_dup_diff_provider = claims_withcode[claims_withcode[
    'IsOutpatient'] == 0].groupby('Provider')[['dup_diff_provider_bool']].mean().reset_index()
inp_dup_diff_provider.columns.values[1] = "IP_Perc_Dup_Diff_Provider"
providers = providers.merge(inp_dup_diff_provider, how = 'left', on = 'Provider')
# 0's are when there is no duplicate, and NaN's are when there is no inpatients 

In [None]:
# Percentage of duplicates from different providers for outpatients
outp_dup_diff_provider = claims_withcode[claims_withcode[
    'IsOutpatient'] == 1].groupby('Provider')[['dup_diff_provider_bool']].mean().reset_index()
outp_dup_diff_provider.columns.values[1] = "OP_Perc_Dup_Diff_Provider"
providers = providers.merge(outp_dup_diff_provider, how = 'left', on = 'Provider')
# 0's are when there is no duplicate, and NaN's are when there is no outpatients 

In [None]:
# Mean count of duplicates per inpatient
ip_dup_count = claims_withcode[(claims_withcode['IsOutpatient'] == 0) & (claims_withcode['Duplicate_Bool'] == 1)
                              ].groupby(['Provider','BeneID'])[['ClaimID']].count().reset_index()
ip_dup_count = ip_dup_count.groupby('Provider')[['ClaimID']].mean()
ip_dup_count.columns.values[0] = "IP_Mean_Duplicate_per_Patient"
ip_dup_count = ip_dup_count.reset_index()
providers = providers.merge(ip_dup_count, how = 'left', on = 'Provider')

In [None]:
# Mean count of duplicates per outpatient
op_dup_count = claims_withcode[(claims_withcode['IsOutpatient'] == 1) & (claims_withcode['Duplicate_Bool'] == 1)
                              ].groupby(['Provider','BeneID'])[['ClaimID']].count().reset_index()
op_dup_count = op_dup_count.groupby('Provider')[['ClaimID']].mean()
op_dup_count.columns.values[0] = "OP_Mean_Duplicate_per_Patient"
op_dup_count = op_dup_count.reset_index()
providers = providers.merge(op_dup_count, how = 'left', on = 'Provider')

In [None]:
# Mean duplicates per physician for inpatient
ip_dup_count_phy = claims_withcode[(claims_withcode['IsOutpatient'] == 0) & (claims_withcode['Duplicate_Bool'] == 1)
                              ].groupby(['Provider','AttendingPhysician'])[['ClaimID']].count().reset_index()
ip_dup_count_phy = ip_dup_count_phy.groupby('Provider')[['ClaimID']].mean()
ip_dup_count_phy.columns.values[0] = "IP_Mean_Duplicate_per_AttPhy"
ip_dup_count_phy = ip_dup_count_phy.reset_index()
providers = providers.merge(ip_dup_count_phy, how = 'left', on = 'Provider')

In [None]:
# Mean duplicates per physician for outpatient
op_dup_count_phy = claims_withcode[(claims_withcode['IsOutpatient'] == 1) & (claims_withcode['Duplicate_Bool'] == 1)
                              ].groupby(['Provider','AttendingPhysician'])[['ClaimID']].count().reset_index()
op_dup_count_phy = op_dup_count_phy.groupby('Provider')[['ClaimID']].mean()
op_dup_count_phy.columns.values[0] = "OP_Mean_Duplicate_per_AttPhy"
op_dup_count_phy = op_dup_count_phy.reset_index()
providers = providers.merge(op_dup_count_phy, how = 'left', on = 'Provider')

In [None]:
# providers = providers.fillna(0)

In [None]:
# providers.to_csv('new_features_ryan.csv')