# Part 1: Data Wrangling
# Extracting JSON Files

The following series of notebooks are seperated for convinence. Data Wrangling was seperated into multiple parts for convience. 

## Datasets

There are two datasets we are working with for this project. The first dataset is the United Healthcare Insurance Claims dataset, which was just released in December 2022. This dataset is divided into multiple files and will be the main source of files we work with in this notbook. Unfortunately for us, these files are large json files that are not in a format that we can use at this time. In this notebook I hope to obtain the negotiated rates that we need for all procedures availible within this dataset. 

The second dataset comes from the Centers for Medicare & Medicaid Services (CMS). This includes important metrics and indentifiers for specific hospitals that the US government pays for services. We will explore this dataset furthure in other notebooks.

The focus of this notebook is to organize the United Healtcare Insurance Claims dataset into CSV files that we can explore futhure. 

### United Healthcare Insurance Claims Dataset

As stated before, these are large json files. This data contains information about all claims made by United Health Incurance clients. They contain information about the procedures done (eg. drug purchases to performing surgery) and which health care provider performed the service. I will explain in more detail how these files are organized.


## Import packages

Here I import the packages that I need. All the packages can be downloaded through conda or pip install.

In [1]:
import requests #we shall be requesting packages to download from UHC insurance website
import shutil
import ijson.backends.python as ijson # a json file parser
import json
import gzip 
from tqdm.auto import tqdm
from csv import writer
import os
from io import StringIO
import pandas as pd
import numpy as np
from dotenv import load_dotenv

### Data Organization

To get started, its important to note how the data is available to us. As stated prior, the data is availiable online at: 

[United Healthcare Insurance Dataset](https://transparency-in-coverage.uhc.com/?_gl=1*5it7ok*_ga*NjMzOTkzMDA0LjE2NzI3OTc4MjA.*_ga_HZQWR2GYM4*MTY3Mjc5NzgyMC4xLjAuMTY3Mjc5NzgyMC4wLjAuMA)

You can get some information about the data is organized through this git repository:

[Git Respository](https://github.com/CMSgov/price-transparency-guide)


For summation, some basic facts about the organization of the UHC dataset:

<img src="Claimsprocess2noterms.jpeg" alt="Drawing" style="width: 300px;"/>

Reference: Claimsprocess2noterms.jpeg. Blue Cross NC. (n.d.). Retrieved March 12, 2023, from https://www.bluecrossnc.com/file/claimsprocess2notermsjpeg

**I.  In-Network vs Out-of-Network**  
UHC has several files for download on their site. These files are organized in-network and out-of-network. Since we are interested is negotiated rates between hospitals and insurance companies, we are only going to be working with datasets that are in-network. UHC would not negotiate with providers for out-of-network rates and these rates are most likely standardized by the provider (ex. any customer that buys a drug without insurance from a local drugstore would pay a price set by the drugstore, not a negotiated rate). These json files are stored on an Azure server and need to be downloaded individually and parsed. Since there are hundreds of these files, I created an excel spreadsheet called <font color=green>_json_files_hyperlinks.xlsx_</font> with hyperlinks to each of the json files we are interested in.  

**II.  Insured Groups**  
The json files are seperated by insured group. For example, let us say you work for company X, and company X insures all of their employees with United Healthcare insurance. There would be a specific json file for company X in this Azure database. However, some insured groups are exceptionally large (think of a large company like amazon or google), and therefore are split up into multiple json files. 

**III. Negotiated Rates**  
Negotiated rates are often times repeated in these files because different insured groups are going to purchase the same procedure from the same provider at a negotiated rate; however, different insurance plans may exist. For example, Mr. Smith and Mrs. Nancy both work for company X and both live in the same area. They may both purchase insulin from the same drugstore through their United Healthcare Insurance plan. If they have the same plan through their employer, the price negoitated would most likely be the same; however, say Mr. Smith purchased a plan with some level of higher coverage through United Healthcare, his negotiated price may be reduced.

**IV. JSON Files**  
Each json file is divided into two groups that refer to each other: 
1. *Provider groups*  
These refer to hospitals, pharmacies, or private practice physicians:
    - Reference Number: A number used to refer to a group of providers (next subullet point). This number is specific for this json file and does not transfer to another json file.
    - NPI provider Groups: This is an array of National Provider Identifiers(NPIs). NPIs are unique identification numbers for covered health care providers. Each NPI refers to a provider and can be tracked through the CMS database. Sometimes one provider may have several NPIs associated with them. For example, a doctor that practices as 2 hospitals may have 2 seperate identifiers.
2. *Billing Information*  
    - Billing Type and Billing Value: In the US there are several standardized ways to bill a service/procedure/drug. The billing type refers to one of these standarized methods and the value can be interpreted to a service/procedure/drug. Examples of different standarized billing types include CPT, HCPCS and MS-DRG. An example is code type CPT(Current Procedural Terminology created by the American Medical Association) and code value 36415 refers to a Routine Venipucture that can be searched on the American Medical Association codebook. These different billing methods exist because sometimes an insurance company may pay for a drug individually, in which case they would use the CPT code for that drug, or they may pay a bulk sum for a surgery that may include several items (drugs, labor, hospital stay, etc) were they would use MS-DRG or HCPCS code which refers more to procedures. In addition providers may be more comfortable with one billing method versus another. This is complex, but it is important to know there are different billing methods and this refers to the billing type.
    - Reference Numbers: Provider groups that provided this service to this insured group
    - Negotiated Rates: The negotiated rate with the provider group.

_Code Type and Code Number are a majority of these files and are essentially transaction history for United Healthcare for each company they cover_
  

I will break this down futhure as we move along this notebook. For now I am currently using a dotenv to set up my directory where I am accessing files and storying files.

In [2]:
# Define paths to important files and places where you want to store files
load_dotenv()

# This is the file with the hyperlinks of json files we need to parse
hyperlink_path = 'json_files_hyperlinks.xlsx'
hyperlinks = pd.read_excel(hyperlink_path)['Hyperlinks'].tolist()

# I intent to store the json files to be downloaded alongside creating folders for the parsed data: These are large files so this will be done on an external hard drive.
parent_dir= os.getenv('dir')
dir_json = os.path.join(parent_dir, 'JSON')
dir_data = os.path.join(parent_dir,'data_update')

## Downloading and Parsing JSON files

The follow sections of code are used to parse the dataset available. I have some predefined fuctions, all of which I do not explicitly use in this notebook, but can be useful for your own work with this dataset depending on how many files you want to download and parse.

### Downloading JSON files
  

Here are a list of functions for downloading files from the [United Healthcare Insurance Dataset](https://transparency-in-coverage.uhc.com/?_gl=1*5it7ok*_ga*NjMzOTkzMDA0LjE2NzI3OTc4MjA.*_ga_HZQWR2GYM4*MTY3Mjc5NzgyMC4xLjAuMTY3Mjc5NzgyMC4wLjAuMA). Again, we will be only working with in-network files.

In [None]:
# This downloads one file given a url and path where the file should be stored. It returns the filename and location where it has been downloaded.
def download_file(url, path):
    local_filename = url.split('/')[-1]
    download_path = os.path.join(path,local_filename)
    if os.path.exists(download_path):
        print(download_path + '\nFile already Exists!')
        return (local_filename, download_path)
    else:
        with requests.get(url, stream=True) as r:
            total_length = int(r.headers.get('content-length'))
            with tqdm.wrapattr(r.raw, "read", total=total_length, desc="")as raw:
                with open(download_path, 'wb') as output:
                    shutil.copyfileobj(raw, output)
        print(local_filename +'\nDownload Complete')
        return (local_filename, download_path)

# This downloads more than one file given a url and path where the file should be stored. It returns the filename and location where it has been downloaded.
def download_multiple_files(urls, path):
    local_filenames = []
    download_paths = []
    for url in urls:
        local_filename = url.split('/')[-1]
        download_path = os.path.join(path,local_filename)
        if os.path.exists(download_path):
            print(download_path + '\nFile already Exists!')
            local_filenames.append(local_filename) 
            download_paths.append(download_path)
        else:
            with requests.get(url, stream=True) as r:
                total_length = int(r.headers.get('content-length'))
                with tqdm.wrapattr(r.raw, "read", total=total_length, desc="")as raw:
                    with open(download_path, 'wb') as output:
                        shutil.copyfileobj(raw, output)
            print(local_filename +'\nDownload Complete!')
            local_filenames.append(local_filename) 
            download_paths.append(download_path)
    return (local_filenames, download_paths)

# This checks the file size before the file is downloaded. This is useful if you want to avoid large files.
def check_file_size(url):
    with requests.get(url, stream=True) as r:
        return int(r.headers.get('content-length'))


# This deletes a file selected from a path. If you want to work with one file at a time, this is handy.
def delete_file(path):
    os.remove(path)
    print(path + '\nFile Deleted')

### Folder Path
This function simply makes a folder name that matchs the hyperlink file to store the files we create from one downloaded JSON file after the data is parsed.

In [None]:
def make_paths_folders(filename, json_file):
    folder_name = filename[0:-8]
    path= os.path.join(dir_data, folder_name)
    if os.path.exists(path) is False:
        os.mkdir(path)
    else:
        print(path + '\nFolder Already Exists')
    return path

### CSVs for completed/skipped files


The following code is to keep track of what files we have parsed. Since this is a large data set, it might be worth your time to split up the work as the data is parsed. Some of the files might be too large, in which case skipping the larger files may be worth it. I found it useful to create seperate CSVs for the files I skipped and the files I completed incase I wanted to go back and wanted to parse some larger files.

In [None]:
# If the file was too large, it will be written to a certain CSV
def write_large_file(filename, index, hyperlink):
    with open(filename, 'a') as f:
        writer_object = writer(f)
        writer_object.writerow([index,hyperlink])
        print('File Too Large, written to Large File CSV')
        f.close()

#if the file is parsed, it will be written to a different CSV compared to above
def write_completed_file(filename, index, hyperlink):
    with open(filename, 'a') as f:
        writer_object = writer(f)
        writer_object.writerow([index,hyperlink])
        print('File has been completed')
        f.close()

### Storing information

The following code are fuctions related to storing the parsed information. These files take in lists of values and store them as CSV files. I decided to create two seperate CSVs to seperate the provider group information from the billing information since they are seperate tables. However, since reference group link the two tables together, I usually store these files in the same location with similar names.

In [3]:
def write_provider_csv(filename, reference, tin, npi_provider_groups):
    with open(filename, 'a') as f:
        writer_object = writer(f)
        for i,r in enumerate(reference):
            writer_object.writerow([r,tin[i],npi_provider_groups[i]])
        f.close()


def write_rates_csv(filename, billing_type, billing_code, provider_reference, rate):
    with open(filename, 'a') as f:
        writer_object = writer(f)
        for i, ref in enumerate(provider_reference):
            writer_object.writerow([billing_type[i],billing_code[i],ref,rate[i]])
        f.close()

### Parsing JSON File

The following code is responsible for parsing the JSON file. It goes through the JSON code line by line and matchs the values we are interested in and stores them. The parse_file function contains two main sections, provider groups and billing information. As stated above, these files are divided into these two sections and we are interested in both. This function utilizes some of the predefined functions from above.

_There are additional values in these files not explained in the summary section. For example 'tin' refers to a buisness's tax identification number and is difficult to look up. Storing this value, since it is an unique identifier for NPI Provider Group, may come in handy. Other values were ignored._

In [None]:
def parse_file(filename, json_file):
    # creates a folder to store the two CSV files we intend to create
    path = make_paths_folders(filename, json_file)

    # intiates the 2 CSV files we intend to write
    providers_csv = os.path.join(path, filename[0:-8]) + '_providers.csv'
    rates_csv = os.path.join(path, filename[0:-8]) + '_rates.csv'
    write_provider_csv(providers_csv, ['provider_reference'], ['tin'], ['npi_provider_groups'])
    write_rates_csv(rates_csv, ['billing_type'], ['billing_code'], ['provider_reference'], ['negotiated_rates'])

    # intiating lists to store Provider Group values
    npi_provider_groups = []
    tin = []
    reference = []
    
    # intiating lists to store Billing Information values
    billing_type = []
    billing_code = []
    ref_group = []
    rates = []
    
    # opens the files to parse and parses the values
    with gzip.open(json_file, mode="rt") as f:
        parser = ijson.parse(f)

        for prefix, event, value in tqdm(parser):
            # if you have limited RAM, this insures that your lists do not get too long before you write them on your hard drive
            if len(npi_provider_groups) >= 1000:
                write_provider_csv(providers_csv, reference, tin, npi_provider_groups)
                npi_provider_groups = []
                tin = []
                reference = []
            if len(rates) >= 10000:
                write_rates_csv(rates_csv, billing_type, billing_code, ref_group, rates)
                billing_type = []
                billing_code = []
                ref_group = []
                rates = []

            # This is for parsing realavant provider group information
            if prefix =='provider_references.item.provider_groups.item.npi' and event =='start_array'and value==None:
                temp_npi = []
            elif prefix =='provider_references.item.provider_groups.item.npi.item' and event =='number':
                temp_npi.append(value)
            elif prefix =='provider_references.item.provider_groups.item.tin.value' and event =='string':
                temp_tin =value
            elif prefix =='provider_references.item.provider_group_id' and event =='number':
                npi_provider_groups.append(temp_npi)
                tin.append(temp_tin)
                reference.append(value)
            # This informs us we have reached the end of the provider group information
            elif prefix =='provider_references' and event =='end_array':
                write_provider_csv(providers_csv, reference, tin, npi_provider_groups)
                npi_provider_groups = []
                tin = []
                reference = []
            # This is for storing the billing information values
            elif prefix =='in_network.item.billing_code_type' and event =='string':
                temp_type = value
            elif prefix =='in_network.item.billing_code' and event =='string':
                temp_code = value
            elif prefix =='in_network.item.negotiated_rates.item.provider_references.item' and event =='number':
                temp_ref = value
            elif prefix =='in_network.item.negotiated_rates.item.negotiated_prices.item.negotiated_rate' and event =='number':
                billing_type.append(temp_type)
                billing_code.append(temp_code)
                ref_group.append(temp_ref)
                rates.append(value)
            # This informs us we have reached the end of the billing information values
            elif prefix =='in_network' and event =='end_array':
                write_rates_csv(rates_csv, billing_type, billing_code, ref_group, rates)
                billing_type = []
                billing_code = []
                ref_group = []
                rates = []
        f.close()
    print(json_file + '\nParse Complete')    

### Combination functions

These files combine several of the functions above.

In [None]:
def download_parse(url, path):
    (filename, json_file) = download_file(url, path)
    parse_file(filename, json_file)


def download_parse_delete(url, path):
    (filename, json_file) = download_file(url, path)
    parse_file(filename, json_file)
    delete_file(json_file)

## Parsing

Here you can store the the file you want to access for later since the following steps may take a significant amount of time.

In [16]:
hyperlinks[2949]

'https://uhc-tic-mrf.azureedge.net/public-mrf/2023-01-01/2023-01-01_UnitedHealthcare-Insurance-Company_Insurer_D0015336_UHC-Dental_in-network-rates.json.gz'

### Running the Parse

Feel free to change the following variables:
- <font color=blue>start</font>: variable to determine where you want to start your current run.
- <font color=blue>number_to_attempt_to_parse</font>: variable to determine how many files you attempt to parse for this current run. 
- <font color=blue>file_size_max</font>: determines which file sizes you are comfortable parsing. This can be affected by your CPU speed, RAM, and internet connection.

In [None]:
number_to_attempt_to_parse = 1000
start = 2949
file_size_max = 100_000_000

large_files = 'json_large_hyperlinks_update.csv'
completed = 'json_completed_hyperlinks_update.csv'

for i in range(start,start + number_to_attempt_to_parse):
    if check_file_size(hyperlinks[i]) < file_size_max: 
        print('Hyperlink File: ' + str(i) + ' Started!')
        download_parse_delete(hyperlinks[i], dir_json)
        write_completed_file(completed, i, hyperlinks[i])
    else:
        write_large_file(large_files,i,hyperlinks[i])
        print('File number: ' + str(i))


## Conclusion:

We now have CSV files of Provider Groups and Billing Information. Our next objective is to combine these files in a way that makes sense and determine which NPI provider groups are associated with a certain hospital.

Some compromises we made during this notebook include avoiding large files and not being able to parse all the data. On way to make this notebook better would be to parse random files.