## A1. Data Preprocessing – Hospital Data

**Description**  
This section preprocesses hospital-level data from the American Hospital Association (AHA) dataset.  
It includes steps such as data loading, filtering, recoding, and cleaning to prepare the dataset for analysis of hospital characteristics.

**Data Source**  
- American Hospital Association (AHA) Annual Survey and IT Supplement (Year: 2023, 2024)  
- Catherine E Strawley, Julia Adler-Milstein, A Jay Holmgren, Jordan Everson, New indices to track interoperability among US hospitals, Journal of the American Medical Informatics Association, Volume 32, Issue 2, February 2025, Pages 318–327

**Purpose**  
To create a cleaned, analysis-ready dataset that supports downstream tasks

**Input**  
- 2023 AHA Survey (AHA_recent5yrs.csv -> AHA2023) and IT supplement data (2023AHAIT.csv)

**Output**
- AHA2023_master.csv


### 1. load necessary libraries

In [None]:
## load libraries 
# Import standard Python libraries
import getpass  
import re 
import json 
import sys  

# Import data analysis and visualization libraries
import pandas as pd 
import numpy as np  
import seaborn as sns  
import matplotlib.pyplot as plt  

# Import datetime utilities
from datetime import datetime, timedelta  


# Import operating system utilities
import os  



### 2. load annual survey data 

In [None]:
# Define the file path to the AHA dataset (last 5 years)
AHA_AS_path = "../../../data/AHA/AHA_recent5yrs.csv"

# Load the dataset into a DataFrame
AHA_AS_df = pd.read_csv(AHA_AS_path, low_memory=False)

# Subset the data to include only records from the year 2023
AHA2023 = AHA_AS_df[AHA_AS_df.YEAR == 2023]

### 3. load AHA IT supplement data 

In [None]:
# Define the file path to the 2023 AHA IT Supplement dataset
AHA_IT_path = "../../../data/AHA/aha_it_survey_3years.csv"

# Load the IT dataset
AHA_IT_recent = pd.read_csv(AHA_IT_path, low_memory=False)


In [None]:
# Split by year
df24 = AHA_IT_recent.query("year == 2024").copy()
df23 = AHA_IT_recent.query("year == 2023").copy()

In [None]:
# Merge so each hospital's 2024 row lines up with its 2023 row
merged = df24.merge(df23, on='id', how='outer', suffixes=('_2024', '_2023'))
# Add indicators for whether each hospital responded in 2024 or 2023
merged['responded_2024'] = merged['year_2024'].notna().astype(int)
merged['responded_2023'] = merged['year_2023'].notna().astype(int)

# Add a data source indicator
AHA_IT_prioritized['data_source_year'] = merged.apply(
    lambda row: '2024' if pd.notna(row['year_2024']) else '2023' if pd.notna(row['year_2023']) else 'None', 
    axis=1
)
# Copy the indicators to your prioritized DataFrame
AHA_IT_prioritized['responded_2024'] = merged['responded_2024']
AHA_IT_prioritized['responded_2023'] = merged['responded_2023']

for col in AHA_IT_df.columns:
    if col in ['id', 'year']:
        continue
    AHA_IT_prioritized[col] = merged[f'{col}_2024'].combine_first(merged[f'{col}_2023'])
AHA_IT_df = AHA_IT_prioritized.copy()

### 4. create master dataframe linking annual survey, geocodes, and IT data 

In [None]:
# 5.1 Standardize ID columns by converting to string and removing whitespace
AHA2023['ID'] = AHA2023['ID'].astype(str).str.replace(r"\s+", "", regex=True)
AHA_IT_df['id'] = AHA_IT_df['id'].astype(str).str.replace(r"\s+", "", regex=True)



In [None]:
# 5.2 standardize column names for annnual survey data 

# Make a copy of the original AHA2023 DataFrame
AHA2023_2 = AHA2023.copy()


# Rename columns: lowercase + `_as` suffix to indicate Annual Survey source
AHA2023_2.columns = [col.lower() + '_as' for col in AHA2023_2.columns]


In [None]:
# 5.3 standardize column names for IT data 

# Make a copy of the AHA IT dataset
AHA_IT_2 = AHA_IT_df.copy()

# Rename columns: lowercase + `_it` suffix to indicate IT Supplement source
AHA_IT_2.columns = [col.lower() + '_it' for col in AHA_IT_2.columns]


In [None]:
## 5.4 Merge AHA Annual Survey andIT Supplement

# This step merges:
# AHA Annual Survey (`AHA2023_2`) with the IT Supplement (`AHA_IT_2`) using hospital ID.


In [None]:
# Merge AHA Annual Survey with IT Supplement data
AHA_AS_IT_joined = AHA2023_2.merge(AHA_IT_2, left_on='id_as', right_on='id_it', how='left')


In [None]:
# Use the functions to calculate the core and friction index (refer to the cited source)
AHA_AS_IT_joined['core_index'] = AHA_AS_IT_joined.apply(calculate_interoperability.calculate_core_index, axis=1)
AHA_AS_IT_joined['friction_index'] = AHA_AS_IT_joined.apply(calculate_interoperability.calculate_friction_index, axis=1)

In [None]:
# 5.5 Save the final merged dataset to CSV (excluding the index column)
AHA_AS_IT_joined.to_csv('./data/AHA20232024_master.csv', index=False)
