# Uncovering Healthcare Inefficiencies - Data Retrival via API

This notebook is dedicated to retrieving data from the Centers for Medicare & Medicaid Services (CMS) via their API. The specific dataset of interest is the [Market Saturation and Utilization Data](https://data.cms.gov/summary-statistics-on-use-and-payments/program-integrity-market-saturation-by-type-of-service/market-saturation-utilization-state-county). 

The primary objective is comprehensive data retrieval, ensuring a broad scope of Medicare Fee-for-Service (FFS) claims from the CMS Integrated Data Repository (IDR). This dataset is updated quarterly and covers a 12-month reference period, providing critical insights into market saturation and utilization at the state and county levels.

---

## Import libaries 

In [2]:
# import necessary libaries 
import os
import requests
import json 
import pandas as pd
from datetime import datetime
from tqdm import tqdm
import time
import random
import openpyxl
from requests.exceptions import SSLError

import warnings
warnings.filterwarnings("ignore") #supress warnings

## Define API Enpoint and Import Data

In [26]:
# define base URL for the API
base_url = "https://data.cms.gov/data-api/v1/dataset/8900b9c5-50b7-43de-9bdd-0d7113a8355e/data"

# send the GET request to the API
response = requests.get(base_url)

# check if request was successful
if response.status_code == 200:
    # parse the JSON data from the response
    data = response.json()
    # print data to check response
    print(data)
else:
    # handle the error
    print(f"Error: {response.status_code}")

[{'reference_period': '2019-01-01 to 2019-12-31', 'type_of_service': 'Ambulance (Emergency & Non-Emergency)', 'aggregation_level': 'NATION + TERRITORIES', 'state': '--ALL--', 'county': '--ALL--', 'state_fips': ' ', 'county_fips': ' ', 'number_of_fee_for_service_beneficiaries': '36,122,263', 'number_of_providers': '8,814', 'average_number_of_users_per_provider': '495.69', 'percentage_of_users_out_of_ffs_beneficiaries': '12.09%', 'number_of_users': '4,368,976', 'average_number_of_providers_per_county': '39.85', 'number_of_dual_eligible_users': '1,348,463', 'percentage_of_dual_eligible_users_out_of_total_users': '30.86%', 'percentage_of_dual_eligible_users_out_of_dual_eligible_ffs_beneficiaries': '20.52%', 'total_payment': '$4,037,494,106.32', 'moratorium': ' ', 'number_of_fee_for_service_beneficiaries_dual_color': ' ', 'number_of_fee_for_service_beneficiaries_description': ' ', 'number_of_providers_dual_color': ' ', 'number_of_providers_description': ' ', 'average_number_of_users_per_pro

### Pagination 

Given the large size of the dataset, we implement pagination to retrieve data in manageable chunks. We define a page size and initialize an offset. A function is created to fetch data for each page with error handling for SSL errors and retries. We use a progress bar to monitor the fetching process and introduce random delays to avoid overwhelming the server.

In [27]:
# set the size of each page
page_size = 1000  # number of records per page 

# initialize a list to store the data
all_data = []

# function to get data from a single page
def get_page_data(url, offset, page_size, retries=3):
    attempt = 0
    while attempt < retries:
        try:
            response = requests.get(url, params={"offset": offset, "size": page_size})
            if response.status_code == 200:
                return response.json()
            else:
                print(f"Error: {response.status_code}")
                return None
        except SSLError as e:
            print(f"SSL Error: {e}. Retrying...")
            attempt += 1
            time.sleep(5)  # Wait before retrying
    print(f"Failed to fetch data after {retries} attempts.")
    return None

# initialize offset
offset = 0

# get the initial page of data
data = get_page_data(base_url, offset, page_size)

# fetch total records to calculate total pages
total_records = len(data)  # initialize with the length of the first page of data
total_pages = (total_records // page_size) + 1

# initialize progress bar
with tqdm(total=total_pages, desc="Fetching Data", unit="page") as pbar:
    # check if data is returned
    while data:
        all_data.extend(data)  # add data to the list
        # if number of records returned is less than the page size, we're done
        if len(data) < page_size:
            break
        # update offset for the next page
        offset += page_size
        
        # add a random delay to avoid overwhelming the server
        time.sleep(5 + 10 * random.random())
        
        # get the next page of data
        try:
            data = get_page_data(base_url, offset, page_size)
            pbar.update(1)  # Update progress bar
        except KeyboardInterrupt:
            print("Data fetching interrupted.")
            break

# create DataFrame from the collected data
df = pd.DataFrame(all_data)


Fetching Data: 1044page [3:44:38, 12.91s/page]                


Data saved to cms_data.xlsx


### Save data to CSV file

To preserve our fetched data, we save it to both CSV and Excel formats. We first create a directory named `data` if it does not already exist. We then save the DataFrame to a CSV file and an Excel file. 

In [32]:
# create directory to save data in new folder 'data'
output_dir = 'data'
os.makedirs(output_dir, exist_ok=True)

# save df to CSV file
output_file = 'data/cms_data.csv'
df.to_csv(output_file, index=False)

In [34]:
# save df to excel file 
output_file = "data/cms_data.xlsx"
df.to_excel(output_file, index=False)

#### Data Summary

In [28]:
df.sample(20) #ensure data was imported and saved correctly 

Unnamed: 0,reference_period,type_of_service,aggregation_level,state,county,state_fips,county_fips,number_of_fee_for_service_beneficiaries,number_of_providers,average_number_of_users_per_provider,...,number_of_fee_for_service_beneficiaries_change,number_of_providers_change,average_number_of_users_per_provider_change,percentage_of_users_out_of_ffs_beneficiaries_change,number_of_users_change,average_number_of_providers_per_county_change,number_of_dual_eligible_users_change,percentage_of_dual_eligible_users_out_of_total_users_change,percentage_of_dual_eligible_users_out_of_dual_eligible_ffs_beneficiaries_change,total_payment_change
865263,2022-07-01 to 2023-06-30,Ambulance (Emergency),COUNTY,IA,HANCOCK,19,81.0,2468,4,59.75,...,,,,,,,,,,
454,2019-01-01 to 2019-12-31,Ambulance (Emergency & Non-Emergency),COUNTY,GA,CANDLER,13,43.0,1367,7,34.29,...,,,,,,,,,,
113238,2019-04-01 to 2020-03-31,Psychotherapy,COUNTY,TX,CROCKETT,48,105.0,513,0,,...,,,,,,,,,,
679181,2021-10-01 to 2022-09-30,Ambulance (Emergency & Non-Emergency),COUNTY,NC,JACKSON,37,99.0,6068,7,45.43,...,,,,,,,,,,
91962,2019-04-01 to 2020-03-31,Independent Diagnostic Testing Facility Pt A,COUNTY,MI,BARRY,26,15.0,6281,13,214.54,...,,,,,,,,,,
45791,2019-01-01 to 2019-12-31,Podiatry Services,COUNTY,TX,KINNEY,48,271.0,563,0,,...,,,,,,,,,,
916971,2022-07-01 to 2023-06-30,Skilled Nursing Facility,COUNTY,GA,TAYLOR,13,269.0,877,0,,...,,,,,,,,,,
895189,2022-07-01 to 2023-06-30,Independent Diagnostic Testing Facility Pt B,COUNTY,NE,BOX BUTTE,31,13.0,2520,5,39.4,...,,,,,,,,,,
971433,2022-10-01 to 2023-09-30,Preventive Health Services,COUNTY,WI,GREEN,55,45.0,6517,90,40.96,...,,,,,,,,,,
127111,2019-07-01 to 2020-06-30,Ambulance (Emergency),COUNTY,KS,CRAWFORD,20,37.0,6104,1,628.0,...,,,,,,,,,,


In [29]:
# check last modified date
last_modified_header = response.headers.get('Last-Modified')
if last_modified_header:
    last_modified = datetime.strptime(last_modified_header, '%a, %d %b %Y %H:%M:%S GMT')
    print(f"Data was last modified: {last_modified}")

print(f"Total rows fetched: {len(df)}")

Data was last modified: 2024-07-08 14:03:45
Total rows fetched: 1044711


In [30]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1044711 entries, 0 to 1044710
Data columns (total 48 columns):
 #   Column                                                                                Non-Null Count    Dtype 
---  ------                                                                                --------------    ----- 
 0   reference_period                                                                      1044711 non-null  object
 1   type_of_service                                                                       1044711 non-null  object
 2   aggregation_level                                                                     1044711 non-null  object
 3   state                                                                                 1044711 non-null  object
 4   county                                                                                1044711 non-null  object
 5   state_fips                                                            

### Create seperate CSV file for SQL database

We will be uploading our data over to a SQL database - `fwa_healthcare`

Using a SQL database helps us efficiently manage and analyze our data, ensuring that it remains consistent and readily accessible for reporting and decision-making. SQL databases are well-suited for handling large datasets, provide robust querying capabilities, and integrate seamlessly with Business Intelligence (BI) tools for advanced analysis and visualization.

Here's how we adjust column names in the DataFrame:

In [3]:
# read in data - might take a while because there 1m+ rows
df = pd.read_csv('data/cms_data.csv', low_memory=False) # we use 'low_memory=False' to isntruct Pandas to use more memorry and avoid mixed data types 

In [4]:
#adjust column names for sql db - sql can not have more than 64 characters per column names
df.columns = df.columns.str.replace('percentage', 'pct').str.replace('average', 'avg').str.replace('number', 'num')
df.columns

Index(['reference_period', 'type_of_service', 'aggregation_level', 'state',
       'county', 'state_fips', 'county_fips',
       'num_of_fee_for_service_beneficiaries', 'num_of_providers',
       'avg_num_of_users_per_provider',
       'pct_of_users_out_of_ffs_beneficiaries', 'num_of_users',
       'avg_num_of_providers_per_county', 'num_of_dual_eligible_users',
       'pct_of_dual_eligible_users_out_of_total_users',
       'percent_dual_elig_ffs', 'total_payment', 'moratorium',
       'num_of_fee_for_service_beneficiaries_dual_color',
       'num_of_fee_for_service_beneficiaries_description',
       'num_of_providers_dual_color', 'num_of_providers_description',
       'avg_num_of_users_per_provider_dual_color',
       'avg_num_of_users_per_provider_description',
       'pct_of_users_out_of_ffs_beneficiaries_dual_color',
       'pct_of_users_out_of_ffs_beneficiaries_description',
       'num_of_users_dual_color', 'num_of_users_description',
       'avg_num_of_providers_per_county_dual_

In [6]:
# check if all column names are under 64 characters
for col in df.columns:
    if len(col) > 64:
        print(f"Column name '{col}' is too long: {len(col)} characters")
    else:
        print(f"Column name '{col}' is {len(col)} characters")

# check if any column name exceeds 64 characters
exceeds_64 = any(len(col) > 64 for col in df.columns)
if exceeds_64:
    print("Some column names exceed 64 characters.")
else:
    print("All column names are within 64 characters.")

Column name 'reference_period' is 16 characters
Column name 'type_of_service' is 15 characters
Column name 'aggregation_level' is 17 characters
Column name 'state' is 5 characters
Column name 'county' is 6 characters
Column name 'state_fips' is 10 characters
Column name 'county_fips' is 11 characters
Column name 'num_of_fee_for_service_beneficiaries' is 36 characters
Column name 'num_of_providers' is 16 characters
Column name 'avg_num_of_users_per_provider' is 29 characters
Column name 'pct_of_users_out_of_ffs_beneficiaries' is 37 characters
Column name 'num_of_users' is 12 characters
Column name 'avg_num_of_providers_per_county' is 31 characters
Column name 'num_of_dual_eligible_users' is 26 characters
Column name 'pct_of_dual_eligible_users_out_of_total_users' is 45 characters
Column name 'percent_dual_elig_ffs' is 21 characters
Column name 'total_payment' is 13 characters
Column name 'moratorium' is 10 characters
Column name 'num_of_fee_for_service_beneficiaries_dual_color' is 47 ch

In [7]:
# adjust idnetified columns to be under 64 characters to meet sql requirement 
# create a dictionary to rename columns
rename_dict = {
    'pct_of_dual_eligible_users_out_of_dual_eligible_ffs_beneficiaries': 'pct_dual_eligible_ffs_beneficiaries',
    'pct_of_dual_eligible_users_out_of_dual_eligible_ffs_beneficiaries_dual_color': 'pct_dual_eligible_ffs_beneficiaries_dual_color',
    'pct_of_dual_eligible_users_out_of_dual_eligible_ffs_beneficiaries_description': 'pct_dual_eligible_ffs_beneficiaries_desc',
    'pct_of_dual_eligible_users_out_of_total_users_dual_color': 'pct_dual_eligible_total_users_dual_color',
    'pct_of_dual_eligible_users_out_of_total_users_description': 'pct_dual_eligible_total_users_desc',
    'pct_of_dual_eligible_users_out_of_dual_eligible_ffs_beneficiaries_dual_color': 'pct_dual_eligible_ffs_dual_color',
    'pct_of_dual_eligible_users_out_of_dual_eligible_ffs_beneficiaries_description': 'pct_dual_eligible_ffs_desc',
    'pct_of_dual_eligible_users_out_of_dual_eligible_ffs_beneficiaries_change': 'pct_dual_eligible_ffs_change',
}

# rename columns in the DataFrame
df.rename(columns=rename_dict, inplace=True)

# check the new column names and their lengths
for col in df.columns:
    print(f"Column name '{col}' is {len(col)} characters")

Column name 'reference_period' is 16 characters
Column name 'type_of_service' is 15 characters
Column name 'aggregation_level' is 17 characters
Column name 'state' is 5 characters
Column name 'county' is 6 characters
Column name 'state_fips' is 10 characters
Column name 'county_fips' is 11 characters
Column name 'num_of_fee_for_service_beneficiaries' is 36 characters
Column name 'num_of_providers' is 16 characters
Column name 'avg_num_of_users_per_provider' is 29 characters
Column name 'pct_of_users_out_of_ffs_beneficiaries' is 37 characters
Column name 'num_of_users' is 12 characters
Column name 'avg_num_of_providers_per_county' is 31 characters
Column name 'num_of_dual_eligible_users' is 26 characters
Column name 'pct_of_dual_eligible_users_out_of_total_users' is 45 characters
Column name 'percent_dual_elig_ffs' is 21 characters
Column name 'total_payment' is 13 characters
Column name 'moratorium' is 10 characters
Column name 'num_of_fee_for_service_beneficiaries_dual_color' is 47 ch

In [8]:
output_dir = 'data'
output_file = os.path.join(output_dir, 'cms_data_sql.csv')

# create the directory if it does not exist
os.makedirs(output_dir, exist_ok=True)

# save DataFrame to CSV
df.to_csv(output_file, index=False)

print(f"DataFrame saved to {output_file}")

DataFrame saved to data/cms_data_sql.csv
