# Part 2: Data Wrangling

# Matching Provider Groups with Hospitals

## Purpose:
In the Part 1 of this notebook series, we were working with [United Healthcare Insurance Dataset](https://transparency-in-coverage.uhc.com/?_gl=1*5it7ok*_ga*NjMzOTkzMDA0LjE2NzI3OTc4MjA.*_ga_HZQWR2GYM4*MTY3Mjc5NzgyMC4xLjAuMTY3Mjc5NzgyMC4wLjAuMA). Specifically, we processed the in-network UHC JSON files from this dataset that were less than 100 MB in size (this may vary depending on one's computer/internet specifications). This involved taking each JSON file and creating two tables for each file stored as CSV files. One file contains provider group information and the other file contains billing information. Both tables are linked by a reference number. These two files are stored in a folder related to the JSON file they were extracted from.

The purpose of this notebook is to clean the provider group CSV file. Specifically, we will be tasked with trying to link a hospital with each provider group. This is important because the metrics we are going to working with from the CMS database are all related to hospital metrics, not provider groups. I will provide more context as we journey through this notebook.

Lets first import some of the basic packages we will be using. All of these packages are availible with !pip install or conda install.

In [9]:
import pandas as pd
import numpy as np
import os
from dotenv import load_dotenv
from ast import literal_eval
from collections import Counter

### Files to obtain

As stated above, we need to obtain the paths of the provider group CSV files we created with the prior notebook. The parent directory is locally stored in a dotenv file. You may replace the location as you see fit. The following code should print out the location of the first CSV provider group file if executed correctly.

In [10]:
load_dotenv()

hyperlink_path = 'json_completed_hyperlinks_update.csv'
parent_dir = os.getenv('dir')
data_dir = os.path.join(parent_dir,'data_update')

df = pd.read_csv(hyperlink_path, header=None)
df.head()
df.columns = ['ParseID','Hyperlink']
hyperlinks = df['Hyperlink'].tolist()

def foldername(hyperlink):
    hyperlink = hyperlink.split('/')[-1]
    return hyperlink[0:-8]
def providers_path(folder):
    return os.path.join(data_dir,folder,folder+'_providers.csv')

folder_names= [foldername(hyperlink) for hyperlink in hyperlinks]
provider_files = [providers_path(folder_name) for folder_name in folder_names]

print('The first provider file is located at:\n'+ provider_files[0])

The first provider file is located at:
D:/Vignesh/Capstone\data_update\2023-01-01_ALL-SAVERS-INSURANCE-COMPANY_Insurer_PPO---NDC_PPO-NDC_in-network-rates\2023-01-01_ALL-SAVERS-INSURANCE-COMPANY_Insurer_PPO---NDC_PPO-NDC_in-network-rates_providers.csv


### Lets read one:

Let us read one file to see how the files are formated and become fimilar with the values within our provider group CSV files.

In [11]:
df = pd.read_csv(provider_files[0], usecols=['tin','npi_provider_groups'], converters={'npi_provider_groups': literal_eval})
df.dtypes

tin                     int64
npi_provider_groups    object
dtype: object

As expected, <font color=blue>tin</font> is an integer and <font color=blue>npi_provider_groups</font> is an object. Specifically, a list of integers.

In [12]:
df.head()

Unnamed: 0,tin,npi_provider_groups
0,593582520,[1225090087]
1,272050459,[1639508567]
2,160743209,[1609314343]
3,561844651,"[1215134309, 1043649635, 1851721047, 1285715185]"
4,371756970,[1174710636]


### Combining Provider Group Files

We currently have provider files seperated matching to billing information via reference values. Our goal within this notebook is to match provider groups to hospitals. In a document provided by the CMS, we have NPIs matched to their affilated hospital, specifically a CCN value, which is unquie to each hospital. If we can combine our provider group files, sort out which NPI's are associated with which hospitals, and merge those files back to our originally provider groups files, we should be able to match our CMS metrics in future notebooks. Since the providers files are relatively small, we will be combining all the providers group into one dataframe with the follow code. This may take several minutes to execute depending on how many files were parsed.

In [13]:
df = pd.concat((pd.read_csv(f, usecols=['tin','npi_provider_groups'], converters={'npi_provider_groups': literal_eval}) for f in provider_files), ignore_index=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17332460 entries, 0 to 17332459
Data columns (total 2 columns):
 #   Column               Dtype  
---  ------               -----  
 0   tin                  float64
 1   npi_provider_groups  object 
dtypes: float64(1), object(1)
memory usage: 264.5+ MB


Lets check to see if there are any NaN values.

In [14]:
df[df.isna().any(axis=1)]

Unnamed: 0,tin,npi_provider_groups
1993,,"[1487195756, 1942589122]"
21891,,[]
62098,,[]
101913,,[]
141728,,[]
...,...,...
17151887,,[]
17191702,,[]
17231517,,[]
17271332,,[]


TIN is a tax identification number and is an option to organize provider groups outside of their reference numbers. This is useful because reference numbers are specific to the JSON file they were derived from and do not correlate with other JSON files, but TIN numbers are universal and do. 

For example, the first JSON file my have the reference number 1 = NPI 11111 but the second JSON file may have reference number 1 = NPI 22222. We do not want to treat these as equal; however, if the first JSON file has reference number 3 = 22222. These two TIN should match and they should have the same array list of NPIs. There are instances where an NPI will match to multiple TINs. This would be if a physician works for hospital and maybe has a private practice as well. For the purpose of our case, we want to match each list of NPIs to a hospital. I will explain how we do this later on in the notebook.

For now, let us drop NaN values in either column as there will be no way to figure out which hospital these groups belong to, or if they belong to any group at all. In addition, we will drop duplicates along TIN values, as they should have matching lists of NPIs.

In [15]:
df.dropna(inplace=True)
df.drop_duplicates(subset=['tin'], inplace=True, ignore_index=True)

Lets take a look at the data.

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46854 entries, 0 to 46853
Data columns (total 2 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   tin                  46854 non-null  float64
 1   npi_provider_groups  46854 non-null  object 
dtypes: float64(1), object(1)
memory usage: 732.2+ KB


## Matching Providers with Hospitals

The following file is provided by the CMS and is publically availible: _Facility_Affiliation.csv_. It provides a CSV table that matches NPI with an associated hospital/facility. CCN is a unquie number that refers to a hospital/facility where an NPI either works or is contracted for. Hospital/facilities can have parent CCNs, this may be the case for large hospital systems. CCN might not just refer to a hospital, but could be related to any facility, such as a nursing home or pharmacy. 

In [17]:
npi_path = 'Facility_Affiliation.csv'
npi = pd.read_csv(npi_path,usecols=['NPI','facility_afl_ccn','parent_ccn'], encoding='windows-1252')

In [21]:
npi.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1576038 entries, 0 to 1576037
Data columns (total 3 columns):
 #   Column            Non-Null Count    Dtype  
---  ------            --------------    -----  
 0   NPI               1576038 non-null  int64  
 1   facility_afl_ccn  1576038 non-null  object 
 2   parent_ccn        5635 non-null     float64
dtypes: float64(1), int64(1), object(1)
memory usage: 36.1+ MB


### Parent vs Facility CCN

Lets create a new column where we determine which CCN to use for each NPI. If parent_cnn is NaN, then we will use the facility_afl_ccn number. However, if a parent CCN is present, we will use that facilities CCN. Since the CMS metrics we will be using are from large hospitals, it is more likely to match with larger facilities than smaller ones. Below is a lambda function that can be applied to the dataframe to determine which CCN to use.

In [26]:
npi['ccn'] = npi.apply(lambda row: row['facility_afl_ccn'] if np.isnan(row['parent_ccn']) else str(int(row['parent_ccn'])), axis=1)

CCN has been cast as a string since letters may be contained within them.

In [27]:
npi=npi.astype({'ccn':'str'})
npi.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1576038 entries, 0 to 1576037
Data columns (total 4 columns):
 #   Column            Non-Null Count    Dtype  
---  ------            --------------    -----  
 0   NPI               1576038 non-null  int64  
 1   facility_afl_ccn  1576038 non-null  object 
 2   parent_ccn        5635 non-null     float64
 3   ccn               1576038 non-null  object 
dtypes: float64(1), int64(1), object(2)
memory usage: 48.1+ MB


### Matching NPI groups with their hospital

We are currently tasked with taking NPI groups, and matching them with their affiliated hospital. As stated earlier, an individual NPI can be associated with many hospitals/facilities. There are multiple ways to go about matching a list of NPIs to a hospital; however, I believe the best approach is to create a list of CCNs for each NPI within an NPI group list that matches the NPI value on in our CMS dataframe, combine these CCN lists for an NPI group/TIN, do a majority count of which CCN appears most for an NPI group, and assign that CCN for the TIN value. 

I defined a function here that does exactly that and applies it to our dataframe. This may take several minutes.

In [28]:
def find_npi(x):
    ccns = []
    for npi_value in x['npi_provider_groups']:
        queried = npi[npi['NPI']==npi_value]
        lst = queried['ccn'].to_list()
        if lst: 
            ccns = ccns + lst
    if ccns:
        count= Counter(ccns)
        x['ccn'] = count.most_common()[0][0]
    else:
        x['ccn'] = None
    return x

df = df.apply(find_npi, axis=1)

In [30]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46854 entries, 0 to 46853
Data columns (total 3 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   tin                  46854 non-null  float64
 1   npi_provider_groups  46854 non-null  object 
 2   ccn                  3895 non-null   object 
dtypes: float64(1), object(2)
memory usage: 1.1+ MB


As one might notice, a lot of of NPI provider groups do not have an assigned CCN. This is to be expected, as most NPIs within the United states are not associated with hospitals, and our NPI from the CMS are those associated with a hospital.

Lets store this data. I used the to_parquet here, but you can easily store this as a CSV file. 

In [17]:
df.to_parquet('tin_to_ccn.parquet')

Lets clean this data for any NaN values for NPI groups not associated with a hospital and store this as a seperate file.

In [18]:
df.dropna(subset=['ccn'],inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3895 entries, 96 to 46853
Data columns (total 3 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   tin                  3895 non-null   float64
 1   npi_provider_groups  3895 non-null   object 
 2   ccn                  3895 non-null   object 
dtypes: float64(1), object(2)
memory usage: 121.7+ KB


In [19]:
df.head()

Unnamed: 0,tin,npi_provider_groups,ccn
96,1356639000.0,[1356638811],297112
97,1285698000.0,[1285698381],30101
106,1508899000.0,[1508899253],290007
129,1932107000.0,[1932106853],30055
131,1215107000.0,[1215107347],60006


Lets store this data seperately, as this will be its useful form.

In [20]:
df.to_parquet('tin_to_ccn_nonan.parquet')

## Conclusion

In this notebook we were able to take all the NPI groups and assign them a hospital/facility within the CMS dataset. We created two parquet files, one containing NaN values and one without. Our next step will be match billing information created in the prior notebook with the proper CCN from the parquet file we created.