## A0_data_collection

This notebook documents the data collection steps for all raw datasets used in the project.  
- Most datasets were downloaded manually.
- AHA Annual Survey: requires manual download via Wharton Research Data Services (WRDS) portal.
- AHA IT Supplement: downloaded via WRDS cloud.
- Other data (e.g., CMS, geospatial): downloaded from public repositories.
- Preprocessing steps are explained in subsequent notebooks. 


#### AHA Annual Survey Data Collection 
- AHA access was requested through WRDS 
- no software needed. Annual Survey data was not available in the WRDS Cloud

#### AHA IT Supplement Data Collection 
- AHA access was requested through WRDS 
- requires initial set up following WRDS website https://wrds-www.wharton.upenn.edu/pages/support/programming-wrds/programming-python/python-wrds-cloud/

In [None]:
# AHA IT Supplement Data Collection 
import wrds
db = wrds.Connection()
data = db.raw_sql('SELECT * FROM aha_it_survey_3years.it_survey_3years')
# save the recent 3 years of IT survey data 
data.to_csv('aha_it_survey_3years.csv', index=False)

#### Dartmouth Atlas Data 
- used the most recent ZIP Code Crosswalk (https://data.dartmouthatlas.org/downloads/geography/ZipHsaHrr19.csv.zip)
- HSA geographic boundary data (https://data.dartmouthatlas.org/downloads/geography/HSA_Bdry__AK_HI_unmodified.zip)

#### Area Deprivation Index 
- Requires login to the Neighborhood Atlas (https://www.neighborhoodatlas.medicine.wisc.edu/)
- Downloaded 2023 individual states zipcode ADI

#### Social Vulnerability Index 
- Year: 2022, Geography: United States, Geography Type: ZIP Code Tabulation Areas https://www.atsdr.cdc.gov/place-health/php/svi/svi-data-documentation-download.html

#### Health Professional Shortage Area and Medically UnderServed area 
- Primary Care Area HPSA Designation Boundaries SHP (https://data.hrsa.gov//DataDownload/DD_Files/HPSA_PLYPC_SHP.zip)
- Dental Care Area HPSA Designation Boundaries SHP (https://data.hrsa.gov//DataDownload/DD_Files/HPSA_PLYDH_SHP.zip)
- Mental Health Care Area HPSA Designation Boundaries SHP (https://data.hrsa.gov//DataDownload/DD_Files/HPSA_PLYMH_SHP.zip)
- Medically Underserved Area Designation Boundaries SHP (https://data.hrsa.gov//DataDownload/DD_Files/MUA_SHP.zip)

#### County Digital Infrastructure 
- used Census API 

In [None]:
import requests
import pandas as pd

# Set up
API_KEY = " API KEY"
url = "https://api.census.gov/data/2023/acs/acs5"
params = {
    "get": "NAME,B28001_001E,B28001_002E,B28001_008E,B28002_001E,B28002_008E,B28002_004E,B28002_002E,B28002_013E",
    "for": "county:*",
    "in": "state:*",
    "key": API_KEY
}

# Request data
response = requests.get(url, params=params)
data = response.json()

# Convert to DataFrame
internet_access = pd.DataFrame(data[1:], columns=data[0])

# Rename columns for clarity
internet_access = internet_access.rename(columns={
    "NAME": "County",
    "B28001_001E": "Total_Computer_Households",
    "B28001_002E": "at_least_1_device",
    "B28001_008E": "no_device",
    "B28002_001E": "Total_Internet_Households",
    "B28002_004E": "With_Broadband",
    "B28002_002E": "Internet_Access",
    "B28002_013E" : "No_Internet",

    "state": "State_Code",
    "county": "County_Code"
})

# Convert numeric columns
for col in ["Total_Computer_Households", "at_least_1_device", "no_device", "Total_Internet_Households", "With_Broadband", "No_Internet"]:
    internet_access[col] = pd.to_numeric(internet_access[col])

# Calculate broadband percent
internet_access["Device_Percent"] = (internet_access["at_least_1_device"] / internet_access["Total_Computer_Households"] * 100).round(2)
internet_access["Broadband_Percent"] = (internet_access["With_Broadband"] / internet_access["Total_Internet_Households"] * 100).round(2)
internet_access["Internet_Percent"] = (
    (1 - internet_access["No_Internet"] / internet_access["Total_Internet_Households"]) * 100
).round(2)

# Preview
print(internet_access.head())


#### CMS Hospital Quality 
- https://data.cms.gov/provider-data/archived-data/hospitals
- 14 quarterly timepoints from 2022 to 2025 2Q 
- there are 5 data snapshots for 2023. To maintain consistency we used 10/06/2023 for 4Q 