## C5. Geographically Weighted Regression 

**Description**  
This section conducts geographically weighted regression between hosptial/community characteristics and AI implementation level

**Purpose**  
To model and analyze spatial relationships between hosptial/community characteristics and AI implementation level 



### 1 Load necessary libraries, functions, and pre-processed data 

In [47]:
# Import necessary libraries
import numpy as np
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import seaborn as sns
import contextily as ctx
import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler
from scipy.spatial.distance import pdist, squareform

In [None]:
# load preprocessed dataframe 
AHA_master = pd.read_csv('./data/AHA_master_external_data.csv', low_memory=False)
AHA_IT = AHA_master[~AHA_master['id_it'].isnull()]

In [49]:
import os
os.environ['SHAPE_RESTORE_SHX'] = 'YES'
states = gpd.read_file('../../../data/map_data/state_boundary.shp')

In [None]:

AHA_master2 = apply_ai_scores_to_dataframe(AHA_IT)

### 2 Data engineering 

These hospital characteristics were selected based on investigator consensus, and we used LASSO regression analysis to explore and identify additional variables that predict AI/ML implementation and reflect hospital resource levels.

- **rural_urban_type** : collected from AHA survey. categorized into {1: rural, 2: micro, 3: metro} based on the location of the hospital ('CBSATYPE')
- **system member** : hospital belonging to a corporate body that owns or manage health provider facilities or health-related subsidiaries. ('MHSMEMB')
- **delivery_system** : delivery system identified using existing theory and AHA Annual Survey data {1: Centralized Health System, 2: Centralized Physician/Insurance Health System, 3: Moderately Centralized Health System, 4: Decentralized Health System, 5: Independent Hospital System, 6/Missing: Insufficient data to determine} ('CLUSTER')
- **community_hospital** : all nonfederal, short-term general, and special hospitals whose facilities and services are available to the public {0: No, 1: Yes}('CHC')
- **subsidary_hospital** : Hospital itself operates subsidiary corporation {0: No, 1: Yes} ('SUBS')
- **frontline_hospital** : Frontline facility {0: No, 1: Yes} ('FRTLN')
- **joint_commission_accreditaion** : Accreditation by joint commision {0: No, 1: Yes} ('MAPP1')
- **center_quality** : Center for Improvement in Healthcare Quality Accreditation {0: No, 1: Yes} ('MAPP22')
- **teaching_hospital** : major teaching hospital ('MAPP8'), minor teaching hospital ('MAPP3' or 'MAPP5')
- **critical_access** critical access hospital {0: No, 1: Yes} ('MAPP18')
- **rural_referral** : rural referral center {0: No, 1: Yes} ('MAPP19')
- **ownership_type** : type of organization responsible for establishing policy concerning overall operation {government_federal, government_nonfederal, nonprofit, forprofit, other} ('CNTRL')
- **bedsize** : bed-size category, ordinal variable ('BSC')
- **medicare_ipd_percentage** : medicare inpatient days / total inpatient days. Proxy variable to reflect the proportion of medicare patient 
- **medicaid_ipd_percentage** : medicaid inpatient days / total inpatient days. Proxy variable to reflect the proportion of medicaid patients 
- **core_index** : summary measure to track the interoperability of US hospitals (https://doi.org/10.1093/jamia/ocae289)
- **friction_index** : summary measures to track the barrier or difficulty in interoperability between hospitals (https://doi.org/10.1093/jamia/ocae289)


In [None]:
## rural_urban_type
# Continue with CBSA type and other variables
AHA_master2['rural_urban_type'] = AHA_master2['cbsatype_as'].map({
    'Rural': 1,      # Rural = 1 (lowest)
    'Micro': 2,      # Micropolitan = 2 (middle)
    'Metro': 3       # Metropolitan = 3 (highest)
})

## system_member
# Create new column 'system_member' based on the conditions
AHA_master2['system_member'] = AHA_master2['mhsmemb_as'].copy()
# Set to 1 where sysid_as is not null and mhsmemb_as is null
AHA_master2.loc[(AHA_master2['sysid_as'].notna()) & (AHA_master2['mhsmemb_as'].isna()), 'system_member'] = 1
# Convert all remaining null values to 0
AHA_master2['system_member'] = AHA_master2['system_member'].fillna(0)

## AHA System Cluster Code - delivery_system
AHA_master2['delivery_system'] = AHA_master2['cluster_as']

## community_hospital
AHA_master2['community_hospital'] = AHA_master2['chc_as'].replace(2, 0)

## subsidary_hospital
AHA_master2['subsidary_hospital'] = AHA_master2['subs_as']

## frontline_hospital
AHA_master2['frontline_hospital'] = AHA_master2['frtln_as'].replace('.', 0)

## joint_commission_accreditation
AHA_master2['joint_commission_accreditation'] = AHA_master2['mapp1_as'].replace(2,0)

## center_quality
AHA_master2['center_quality'] = AHA_master2['mapp22_as'].replace(2,0)

# teaching hospitals 
AHA_master2['teaching_hospital'] = ((AHA_master2['mapp5_as'] == 1) | (AHA_master2['mapp3_as'] == 1) | (AHA_master2['mapp8_as'] == 1)).astype(int)
AHA_master2['major_teaching_hospital'] = ((AHA_master2['mapp8_as'] == 1)).astype(int)
AHA_master2['minor_teaching_hospital'] = (((AHA_master2['mapp5_as'] == 1) | (AHA_master2['mapp3_as'] == 1))&~(AHA_master2['mapp8_as'] == 1)).astype(int)

# critical access hospital
AHA_master2['critical_access'] = (AHA_master2['mapp18_as'] == 1).astype(int)


# rural referral center 
AHA_master2['rural_referral'] = (AHA_master2['mapp19_as'] == 1).astype(int)

# medicare medicaid percentage
AHA_master2['medicare_ipd_percentage'] = AHA_master2['mcripd_as'] / AHA_master2['ipdtot_as'] * 100
AHA_master2['medicaid_ipd_percentage'] = AHA_master2['mcdipd_as'] / AHA_master2['ipdtot_as'] * 100

# bed size 
AHA_master2['bedsize'] = AHA_master2['bsc_as'].astype(int)


In [None]:
# hospital ownership type 

AHA_master2['nonfederal_governement'] = ((AHA_master2['cntrl_as'] == 12) | (AHA_master2['cntrl_as'] == 13)|(AHA_master['cntrl_as'] == 14) | (AHA_master['cntrl_as'] == 15)| (AHA_master['cntrl_as'] == 16)).astype(int)
AHA_master2['non_profit_nongovernment'] = ((AHA_master2['cntrl_as'] == 21) | (AHA_master2['cntrl_as'] == 23)).astype(int)
AHA_master2['for_profit'] = ((AHA_master2['cntrl_as'] == 31) | (AHA_master2['cntrl_as'] == 32) | (AHA_master['cntrl_as'] == 33)).astype(int)
AHA_master2['federal_governement'] = ((AHA_master2['cntrl_as'] == 40) | (AHA_master2['cntrl_as'] == 44) | (AHA_master2['cntrl_as'] == 45) | (AHA_master2['cntrl_as'] == 46) | (AHA_master['cntrl_as'] == 47) | (AHA_master['cntrl_as'] == 48)).astype(int)
# Create a categorical column for hospital ownership types
def create_ownership_category(row):
    if row['cntrl_as'] in [12, 13, 14, 15, 16]:
        return 'nonfederal_government'
    elif row['cntrl_as'] in [21, 23]:
        return 'non_profit_nongovernment'
    elif row['cntrl_as'] in [31, 32, 33]:
        return 'for_profit'
    elif row['cntrl_as'] in [40, 44, 45, 46, 47, 48]:
        return 'federal_government'
    else:
        return 'other'

# Create the categorical column
AHA_master2['ownership_type'] = AHA_master2.apply(create_ownership_category, axis=1)


In [None]:
valid_hospitals = AHA_master2.dropna(subset=['long_as', 'lat_as', 'ai_base_score'])
valid_geo_hospitals = AHA_master2.dropna(subset=['long_as', 'lat_as'])
# Create a GeoDataFrame
hospitals_gdf = gpd.GeoDataFrame(
    valid_hospitals, 
    geometry=gpd.points_from_xy(valid_hospitals.long_as, valid_hospitals.lat_as),
    crs="EPSG:4326" #geographic coordinate system using latitude and longitude
)

In [None]:
# First, let's check what we have in valid_hospitals
print("\nChecking valid_hospitals:")
print(f"valid_hospitals exists: {'valid_hospitals' in locals()}")
if 'valid_hospitals' in locals():
    print(f"valid_hospitals shape: {valid_hospitals.shape}")
    print("\nFirst few rows of valid_hospitals:")
    print(valid_hospitals.head())
    print("\nColumns in valid_hospitals:")
    print(valid_hospitals.columns.tolist())
    print("\nData types of columns:")
    print(valid_hospitals.dtypes)
    print("\nMissing values in each column:")
    print(valid_hospitals.isnull().sum())


### 3 Run GWR 

In [None]:

# 1. Prepare coordinates and target variable
coords = np.array(valid_hospitals[['long_as', 'lat_as']])
y = valid_hospitals['ai_base_score'].values
    
# 2. Prepare feature matrix (same as  original code)
print("Preparing feature matrix...")
X_data = pd.DataFrame(index=valid_hospitals.index)
    
# Hospital characteristics missing value imputation 
X_data['subsidary_hospital'] = valid_hospitals['subsidary_hospital'].fillna(0)
X_data['frontline_hospital'] = valid_hospitals['frontline_hospital'].fillna(0)
X_data['joint_commission_accreditation'] = valid_hospitals['joint_commission_accreditation'].astype(float).fillna(0)
X_data['delivery_system'] = valid_hospitals['delivery_system'].astype(float).fillna(0)
X_data['teaching_hospital'] = valid_hospitals['teaching_hospital'].astype(float)
X_data['critical_access'] = valid_hospitals['critical_access'].astype(float)
X_data['rural_referral'] = valid_hospitals['rural_referral'].astype(float)
X_data['for_profit'] = (valid_hospitals['ownership_type'] == 'for_profit').astype(float)
X_data['bedsize'] = valid_hospitals['bedsize'].astype(float).fillna(valid_hospitals['bedsize'].median())
X_data['medicare_ipd_percentage'] = valid_hospitals['medicare_ipd_percentage'].astype(float).fillna(valid_hospitals['medicare_ipd_percentage'].median())
X_data['medicaid_ipd_percentage'] = valid_hospitals['medicaid_ipd_percentage'].astype(float).fillna(valid_hospitals['medicaid_ipd_percentage'].median())
X_data['core_index'] = valid_hospitals['core_index'].astype(float).fillna(valid_hospitals['core_index'].median())
X_data['friction_index'] = valid_hospitals['friction_index'].astype(float).fillna(valid_hospitals['friction_index'].median())
    
# Standardize continuous variables (except binary variables)
continuous_vars = ['delivery_system', 'bedsize', 'medicare_ipd_percentage', 'medicaid_ipd_percentage', 
                      'core_index', 'friction_index']

In [56]:
def simple_gwr_robust(coords, y, X, bandwidth):
    """Simple GWR with numerical stability"""
    n = len(coords)
    p = X.shape[1]
    results = np.zeros((n, p))
    
    for i in range(n):
        # Calculate distances
        dists = np.sqrt(np.sum((coords - coords[i])**2, axis=1))
        
        # Calculate Gaussian weights
        weights = np.exp(-0.5 * (dists / bandwidth)**2)
        
        # Filter observations with meaningful weights
        threshold = 0.001
        mask = weights >= threshold
        
        if np.sum(mask) <= p:  # Need p+1 observations minimum
            results[i] = np.nan
            continue
        
        # Subset data
        X_local = X[mask]
        y_local = y[mask]
        w_local = weights[mask]
        
        # Proper weighted least squares
        W = np.diag(np.sqrt(w_local))  # Square root for correct weighting
        XW = W @ X_local
        yW = W @ y_local
        
        # Normal equations
        XtWX = XW.T @ XW
        XtWy = XW.T @ yW
        
        # Add regularization if needed
        reg_param = 1e-6 * np.trace(XtWX) / p
        XtWX_reg = XtWX + reg_param * np.eye(p)
        
        try:
            beta = np.linalg.solve(XtWX_reg, XtWy)
            results[i] = beta
        except np.linalg.LinAlgError:
            results[i] = np.nan
    
    return results

def run_gwr(X_data, y, coords, continuous_vars, bandwidth):
    """Complete GWR workflow with data preparation"""
    
    # Prepare data
    X_processed = X_data.copy()
    
    # Standardize continuous variables
    if continuous_vars:
        scaler = StandardScaler()
        X_processed[continuous_vars] = scaler.fit_transform(X_data[continuous_vars])
    else:
        scaler = None
    
    # Add intercept
    X_processed['const'] = 1.0
    
    # Convert to array
    X_array = X_processed.values.astype(float)
    
    # Run GWR
    print(f"Running GWR with bandwidth: {bandwidth:.3f}")
    final_params = simple_gwr_robust(coords, y, X_array, bandwidth)
    
    # Create results DataFrame
    results_df = pd.DataFrame()
    
    # Add coefficients (exclude intercept from main results)
    for i, col in enumerate(X_processed.columns):
        if col != 'const':
            results_df[f'{col}_coeff'] = final_params[:, i]
    
    # Add intercept separately
    const_idx = list(X_processed.columns).index('const')
    results_df['intercept'] = final_params[:, const_idx]
    
    # Add coordinates
    results_df['longitude'] = coords[:, 0]
    results_df['latitude'] = coords[:, 1]
    
    # Summary statistics
    total_valid = np.sum(np.all(np.isfinite(final_params), axis=1))
    total_obs = len(coords)
    
    print(f"Results Summary:")
    print(f"  Valid observations: {total_valid}/{total_obs} ({100*total_valid/total_obs:.1f}%)")
    
    # Variable summaries
    for col in X_data.columns:
        coeff_col = f'{col}_coeff'
        if coeff_col in results_df.columns:
            coeffs = results_df[coeff_col].dropna()
            if len(coeffs) > 0:
                print(f"  {col}: Range=[{coeffs.min():.4f}, {coeffs.max():.4f}], SD={coeffs.std():.4f}")
    
    return results_df, scaler


In [None]:
import matplotlib.pyplot as plt
import numpy as np
from matplotlib.colors import TwoSlopeNorm
import matplotlib.patches as mpatches

def plot_gwr_results(var_coeffs, states_gdf, figsize=(20, 15)):
    """
    Plot GWR spatial coefficients on continental US map
    
    Parameters:
    -----------
    var_coeffs : DataFrame
        GWR results with coefficient columns + longitude/latitude
    states_gdf : GeoDataFrame  
        US state boundaries
    figsize : tuple
        Figure size
    """
    
    # Filter for continental US
    exclude_states = ['AK', 'HI', 'PR', 'VI', 'GU', 'AS', 'MP']
    
    # Find state column
    state_col = None
    for col in ['STUSPS', 'STATE_ABBR', 'NAME', 'STATE']:
        if col in states_gdf.columns:
            state_col = col
            break
    
    if state_col:
        continental_states = states_gdf[~states_gdf[state_col].str.contains('|'.join(exclude_states), na=False)]
    else:
        continental_states = states_gdf.cx[-130:-65, 20:50]
    
    # Get variable columns (exclude coordinates)
    variables = [col for col in var_coeffs.columns if col not in ['longitude', 'latitude']]
    plot_data = var_coeffs.dropna()
    
    # Set up subplot grid
    n_vars = len(variables)
    n_cols = 2 if n_vars > 1 else 1
    n_rows = (n_vars + n_cols - 1) // n_cols
    
    fig, axes = plt.subplots(n_rows, n_cols, figsize=figsize)
    if n_vars == 1:
        axes = [axes]
    elif n_rows == 1:
        axes = axes.reshape(1, -1) if n_cols > 1 else [axes]
    
    fig.suptitle('GWR Spatial Coefficients', fontsize=20, fontweight='bold')
    
    for i, var in enumerate(variables):
        if n_rows > 1:
            ax = axes[i // n_cols, i % n_cols]
        else:
            ax = axes[i % n_cols] if n_cols > 1 else axes[0]
        
        # Get coefficient data
        coeffs = plot_data[var].values
        x = plot_data['longitude'].values
        y = plot_data['latitude'].values
        
        # Plot state boundaries
        continental_states.boundary.plot(ax=ax, color='black', linewidth=0.5, alpha=0.7)
        
        # Set map extent
        ax.set_xlim(-130, -65)
        ax.set_ylim(20, 50)
        
        # Color normalization (centered at 0)
        vmax = np.abs(coeffs).max() if np.abs(coeffs).max() > 0 else 1
        norm = TwoSlopeNorm(vmin=-vmax, vcenter=0, vmax=vmax)
        
        # Scatter plot
        scatter = ax.scatter(x, y, c=coeffs, cmap='RdBu', norm=norm, 
                           s=60, alpha=0.8, edgecolors='black', linewidth=0.5)
        
        # Formatting
        ax.set_title(f'{var}', fontweight='bold', fontsize=14)
        ax.set_xlabel('Longitude', fontsize=12)
        ax.set_ylabel('Latitude', fontsize=12)
        ax.grid(True, alpha=0.3)
        
        # Colorbar
        cbar = plt.colorbar(scatter, ax=ax, shrink=0.8)
        cbar.set_label('Coefficient', fontsize=10)
        
        # Simple legend for positive/negative effects
        if coeffs.max() > 0 and coeffs.min() < 0:
            legend_elements = [
                mpatches.Patch(color='darkblue', label='Positive'),
                mpatches.Patch(color='darkred', label='Negative')
            ]
            ax.legend(handles=legend_elements, loc='lower right', fontsize=9)
    
    # Hide empty subplots
    total_subplots = n_rows * n_cols
    if n_vars < total_subplots:
        for i in range(n_vars, total_subplots):
            if n_rows > 1:
                ax = axes[i // n_cols, i % n_cols]
            else:
                ax = axes[i % n_cols] if n_cols > 1 else axes[0]
            ax.set_visible(False)
    
    plt.tight_lat()
    return fig

# Simple usage:
fig = plot_gwr_results(var_coeffs, states_gdf)
plt.show()

####  3.2 Conduct GWR for geospatial/community characteristics 

In [None]:

# 1. Prepare coordinates and target variable
coords = np.array(valid_hospitals[['long_as', 'lat_as']])
y2 = valid_hospitals['ai_base_score'].values
    
# 2. Prepare feature matrix 
print("Preparing feature matrix...")
X_data2 = pd.DataFrame(index=valid_hospitals.index)
    
# Hospital characteristics missing value imputation 
X_data2['svi_themes_median'] = valid_hospitals['svi_themes_median'].fillna(valid_hospitals['svi_themes_median'].median())
X_data2['svi_theme1_median'] = valid_hospitals['svi_theme1_median'].fillna(valid_hospitals['svi_theme1_median'].median())
X_data2['svi_theme2_median'] = valid_hospitals['svi_theme2_median'].fillna(valid_hospitals['svi_theme2_median'].median())    
X_data2['svi_theme3_median'] = valid_hospitals['svi_theme3_median'].fillna(valid_hospitals['svi_theme3_median'].median())
X_data2['svi_theme4_median'] = valid_hospitals['svi_theme4_median'].fillna(valid_hospitals['svi_theme4_median'].median())
X_data2['national_adi_median'] = valid_hospitals['national_adi_median'].fillna(valid_hospitals['national_adi_median'].median())
X_data2['mean_primary_hpss'] = valid_hospitals['mean_primary_hpss'].fillna(0)
X_data2['mean_dental_hpss'] = valid_hospitals['mean_dental_hpss'].fillna(0)
X_data2['mean_mental_hpss'] = valid_hospitals['mean_mental_hpss'].fillna(0)
X_data2['mean_mua_score'] = valid_hospitals['mean_mua_score'].fillna(0)
X_data2['mean_mua_elder_score'] = valid_hospitals['mean_mua_score'].fillna(0)
X_data2['mean_mua_infant_score'] = valid_hospitals['mean_mua_infant_score'].fillna(0)
X_data2['rural_urban_type'] = valid_hospitals['rural_urban_type'].fillna(valid_hospitals['rural_urban_type'].median())
X_data2['Device_Percent'] = valid_hospitals['Device_Percent'].fillna(valid_hospitals['Device_Percent'].median())
X_data2['Broadband_Percent'] = valid_hospitals['Broadband_Percent'].fillna(valid_hospitals['Broadband_Percent'].median())
X_data2['Internet_Percent'] = valid_hospitals['Internet_Percent'].fillna(valid_hospitals['Internet_Percent'].median())
    
# Standardize continuous variables (except binary variables)
continuous_vars = ['svi_themes_median', 'svi_theme1_median', 'svi_theme2_median', 'svi_theme3_median', 'svi_theme4_median', 'national_adi_median', 'Device_Percent', 'Broadband_Percent', 'Internet_Percent', 'mean_primary_hpss', 'mean_dental_hpss', 'mean_mental_hpss', 'mean_mua_score', 'mean_mua_elder_score', 'mean_mua_infant_score', 'rural_urban_type','Device_Percent', 'Broadband_Percent', 'Internet_Percent']

In [None]:

var_coeffs, selected_bandwidth, X_data2 = run_gwr(X_data2, continuous_vars, 2)

# Then plot results
fig = plot_gwr_results(var_coeffs, states, figsize=(25, 30))
plt.show()