## Exploratory Data Analysis and Visualization
Here, we check our data to make sure all NaNs correspond to approved claims, and not rejected claims. First, we import packages and read our csv.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

In [2]:
cmm = pd.read_csv("Data/CMM.csv")

In [3]:
#What does it look like?
cmm.head()

Unnamed: 0,dim_claim_id,dim_pa_id,dim_date_id,bin,drug,reject_code,pharmacy_claim_approved,date_val,calendar_year,calendar_month,calendar_day,day_of_week,is_weekday,is_workday,is_holiday,correct_diagnosis,tried_and_failed,contraindication,pa_approved
0,1,1.0,1,417380,A,75.0,0,2017-01-01,2017,1,1,1,0,0,1,1.0,1.0,0.0,1.0
1,2,,1,999001,A,,1,2017-01-01,2017,1,1,1,0,0,1,,,,
2,3,2.0,1,417740,A,76.0,0,2017-01-01,2017,1,1,1,0,0,1,1.0,0.0,0.0,1.0
3,4,,1,999001,A,,1,2017-01-01,2017,1,1,1,0,0,1,,,,
4,5,,1,417740,A,,1,2017-01-01,2017,1,1,1,0,0,1,,,,


First, we make sure claims aren't duplicated in this dataset. To do this, we compare the number of entries to the number of claim ids.

In [4]:
datalength = len(cmm)
print("We have",datalength,"records")

We have 1335576 records


In [5]:
if len(cmm['dim_claim_id'].unique())==datalength:
    print("There are",len(cmm['dim_claim_id'].unique()),"unique records, the same as the total number of records.")

There are 1335576 unique records, the same as the total number of records.


Next, we look at how many unique Payer BINs there are, and how many drugs. We see there are 4 payers represented, and 3 drugs.

In [6]:
print("There are",len(cmm['bin'].unique()),"unique payer BINs in the dataset.")

There are 4 unique payer BINs in the dataset.


In [7]:
print("There are",len(cmm['drug'].unique()),"unique drugs in the dataset.")

There are 3 unique drugs in the dataset.


Now, we check to make sure the only NaNs for PA info are for those claims where a PA form was not needed (and the claim was approved by pharmacy). This check is on reject_code, pa_approved, correct_diagnosis, contraindication, and tried_and_failed.

In [8]:
print("The number of claims with a PA form is",len(cmm[cmm['dim_pa_id'].notna()]),"and the number of claims",
      "that were rejected is",len(cmm[cmm['pharmacy_claim_approved']==0]),".")
if len(cmm[cmm['dim_pa_id'].notna()])==len(cmm[cmm['pharmacy_claim_approved']==0]):
    print("These are the same, so we can continue.")  
    pa_count=len(cmm[cmm['dim_pa_id'].notna()])
else:
    print("These differ, look back at claims that have claim not approved (pharmacy_claim_approved=0) but no corresponding PA id.")

The number of claims with a PA form is 555951 and the number of claims that were rejected is 555951 .
These are the same, so we can continue.


In [9]:
if np.sum(cmm[cmm['dim_pa_id'].notna()]['reject_code'].notna())==pa_count:
    print("There are",np.sum(cmm[cmm['dim_pa_id'].notna()]['reject_code'].notna()),"claims with a PA form and a reject_code, so no reject codes are missing.")

There are 555951 claims with a PA form and a reject_code, so no reject codes are missing.


In [10]:
if np.sum(cmm[cmm['dim_pa_id'].notna()]['pa_approved'].notna())==pa_count:
    print("There are",np.sum(cmm[cmm['dim_pa_id'].notna()]['pa_approved'].notna()),"claims with a PA form and an approval flag, so no results of the form (approved/denied) are missing.")

There are 555951 claims with a PA form and an approval flag, so no results of the form (approved/denied) are missing.


In [11]:
if np.sum(cmm[cmm['dim_pa_id'].notna()]['correct_diagnosis'].notna())==pa_count:
    print("There are",np.sum(cmm[cmm['dim_pa_id'].notna()]['correct_diagnosis'].notna()),"claims with a PA form and a correct_diagnosis flag, so no information on correct diagnosis are missing.")

There are 555951 claims with a PA form and a correct_diagnosis flag, so no information on correct diagnosis are missing.


In [12]:
if np.sum(cmm[cmm['dim_pa_id'].notna()]['tried_and_failed'].notna())==pa_count:
    print("There are",np.sum(cmm[cmm['dim_pa_id'].notna()]['tried_and_failed'].notna()),"claims with a PA form and a tried_and_failed flag, so no information on if patients tried and failed the generic alternatives is missing.")

There are 555951 claims with a PA form and a tried_and_failed flag, so no information on if patients tried and failed the generic alternatives is missing.


In [13]:
if np.sum(cmm[cmm['dim_pa_id'].notna()]['contraindication'].notna())==pa_count:
    print("There are",np.sum(cmm[cmm['dim_pa_id'].notna()]['contraindication'].notna()),"claims with a PA form and a contraindication flag, so no information on if patients have a contraindication to the requested drug is missing.")

There are 555951 claims with a PA form and a contraindication flag, so no information on if patients have a contraindication to the requested drug is missing.
