## A1. Data Preprocessing – Hospital Data

**Description**  
This section preprocesses hospital-level data from the American Hospital Association (AHA) dataset.  
It includes steps such as data loading, filtering, recoding, and cleaning to prepare the dataset for analysis of hospital characteristics.

**Data Source**  
- American Hospital Association (AHA) Annual Survey and IT Supplement (Year: 2023)  
- Catherine E Strawley, Julia Adler-Milstein, A Jay Holmgren, Jordan Everson, New indices to track interoperability among US hospitals, Journal of the American Medical Informatics Association, Volume 32, Issue 2, February 2025, Pages 318–327

**Purpose**  
To create a cleaned, analysis-ready dataset that supports downstream tasks

**Disclaimer**  
This codebase was partially cleaned and annotated using OpenAI’s ChatGPT-4o.  
Please review and validate before using for critical applications.

**notebook workflow** 
1. load necessary libraries
2. load AHA annual survey data 
3. load AHA hospital geocodes 
4. load AHA IT data 
5. create master dataframe linking annual survey and IT data 
6. calculate interoperability features 
7. save master dataframe

### 1. load necessary libraries

In [1]:
## load libraries 
# Import standard Python libraries
import getpass  
import re 
import json 
import sys  

# Import data analysis and visualization libraries
import pandas as pd 
import numpy as np  
import seaborn as sns  
import matplotlib.pyplot as plt  

# Import datetime utilities
from datetime import datetime, timedelta  


# Import operating system utilities
import os  



### 2. load annual survey data 

In [2]:
# Define the file path to the AHA dataset (last 5 years)
AHA_AS_path = "../../../data/AHA/AHA_recent5yrs.csv"

# Load the dataset into a DataFrame
AHA_AS_df = pd.read_csv(AHA_AS_path, low_memory=False)

# Subset the data to include only records from the year 2023
AHA2023 = AHA_AS_df[AHA_AS_df.YEAR == 2023]

### 3. load geocodes 

In [None]:
# Define the file path to the geocoded AHA address dataset
AHA_address_path = "../../../data/AHA/AHA_address_geocode.csv"

# Load the address dataset
AHA_address_df = pd.read_csv(AHA_address_path, low_memory=False)

# Keep only the columns needed for geospatial merging
AHA_address_df = AHA_address_df[['ID', 'latitude_address', 'longitude_address']]

# Count hospitals missing both latitude and longitude
missing_geocodes = AHA_address_df[
    (AHA_address_df['latitude_address'].isnull()) | 
    (AHA_address_df['longitude_address'].isnull())
].shape[0]

print(f"Total of {missing_geocodes} hospitals with missing geocodes (either latitude and longitude)")


### 4. load IT data 

In [5]:
# Define the file path to the 2023 AHA IT Supplement dataset
AHA_IT_path = "../../../data/AHA/2023AHAIT.csv"

# Load the IT dataset
AHA_IT_df = pd.read_csv(AHA_IT_path, low_memory=False)


### 5. create master dataframe linking annual survey, geocodes, and IT data 

In [None]:
# 5.1 Standardize ID columns by converting to string and removing whitespace
AHA2023['ID'] = AHA2023['ID'].astype(str).str.replace(r"\s+", "", regex=True)
AHA_address_df['ID'] = AHA_address_df['ID'].astype(str).str.replace(r"\s+", "", regex=True)
AHA_IT_df['id'] = AHA_IT_df['id'].astype(str).str.replace(r"\s+", "", regex=True)



In [7]:
# 5.2 standardize column names for annnual survey data 

# Make a copy of the original AHA2023 DataFrame
AHA2023_2 = AHA2023.copy()


# Rename columns: lowercase + `_as` suffix to indicate Annual Survey source
AHA2023_2.columns = [col.lower() + '_as' for col in AHA2023_2.columns]


In [8]:
# 5.3 standardize column names for IT data 

# Make a copy of the AHA IT dataset
AHA_IT_2 = AHA_IT_df.copy()

# Rename columns: lowercase + `_it` suffix to indicate IT Supplement source
AHA_IT_2.columns = [col.lower() + '_it' for col in AHA_IT_2.columns]


In [9]:
## 5.4 Merge AHA Annual Survey, IT Supplement, and Address Data

# This step merges:
# 1. AHA Annual Survey (`AHA2023_2`) with the IT Supplement (`AHA_IT_2`) using hospital ID.
# 2. The result with geocoded address data (`AHA_address_df`), again using hospital ID.

# The merged dataset is used for further analysis. The original `ID` column from the address file is dropped to avoid redundancy.


In [10]:
# Merge AHA Annual Survey with IT Supplement data
AHA_AS_IT_joined = AHA2023_2.merge(AHA_IT_2, left_on='id_as', right_on='id_it', how='left')

# Merge with geocoded address data
AHA_AS_IT_address_joined = AHA_address_df.merge(AHA_AS_IT_joined, left_on='ID', right_on='id_as', how='left')

# Check the shape of the merged dataset
AHA_AS_IT_address_joined.shape

# Drop redundant ID column from address data
AHA_AS_IT_address_joined = AHA_AS_IT_address_joined.drop('ID', axis=1)


In [11]:
# Use the functions
AHA_AS_IT_address_joined['core_index'] = AHA_AS_IT_address_joined.apply(calculate_interoperability.calculate_core_index, axis=1)
AHA_AS_IT_address_joined['friction_index'] = AHA_AS_IT_address_joined.apply(calculate_interoperability.calculate_friction_index, axis=1)

In [12]:
# 5.5 Save the final merged dataset to CSV (excluding the index column)
AHA_AS_IT_address_joined.to_csv('./data/AHA2023_master.csv', index=False)
