## D1. Feature importance analysis 

**Description**  
This section conducts machine learning (random forest) to identify the most important feature predicting the AI/ML implementation level 

**Purpose**  
To identify the most important feature predicting the AI/ML implementation level 
 


### 1 Load necessary libraries, functions, and pre-processed data 

In [27]:
# Data manipulation and analysis
import pandas as pd
import numpy as np

# Machine learning and model evaluation
from sklearn.model_selection import cross_val_score, train_test_split, KFold
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.inspection import permutation_importance
from sklearn.linear_model import LassoCV
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import (
    accuracy_score, 
    classification_report,
    r2_score, 
    mean_squared_error, 
    mean_absolute_error,
    mean_absolute_percentage_error
)

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import shap

In [None]:
# load preprocessed dataframe 
AHA_master = pd.read_csv('./data/AHA_master_external_data.csv', low_memory=False)
AHA_IT = AHA_master[~AHA_master['id_it'].isnull()]
AHA_IT.shape

In [None]:

# Import and if needed, reload the module
import calculate_ai_scores
AHA_master2 = calculate_ai_scores.apply_ai_scores_to_dataframe(AHA_IT)

### 2 Data engineering 

These hospital characteristics were selected based on investigator consensus, and we used LASSO regression analysis to explore and identify additional variables that predict AI/ML implementation and reflect hospital resource levels.

- **rural_urban_type** : collected from AHA survey. categorized into {1: rural, 2: micro, 3: metro} based on the location of the hospital ('CBSATYPE')
- **system member** : hospital belonging to a corporate body that owns or manage health provider facilities or health-related subsidiaries. ('MHSMEMB')
- **delivery_system** : delivery system identified using existing theory and AHA Annual Survey data {1: Centralized Health System, 2: Centralized Physician/Insurance Health System, 3: Moderately Centralized Health System, 4: Decentralized Health System, 5: Independent Hospital System, 6/Missing: Insufficient data to determine} ('CLUSTER')
- **community_hospital** : all nonfederal, short-term general, and special hospitals whose facilities and services are available to the public {0: No, 1: Yes}('CHC')
- **subsidary_hospital** : Hospital itself operates subsidiary corporation {0: No, 1: Yes} ('SUBS')
- **frontline_hospital** : Frontline facility {0: No, 1: Yes} ('FRTLN')
- **joint_commission_accreditaion** : Accreditation by joint commision {0: No, 1: Yes} ('MAPP1')
- **center_quality** : Center for Improvement in Healthcare Quality Accreditation {0: No, 1: Yes} ('MAPP22')
- **teaching_hospital** : major teaching hospital ('MAPP8'), minor teaching hospital ('MAPP3' or 'MAPP5')
- **critical_access** critical access hospital {0: No, 1: Yes} ('MAPP18')
- **rural_referral** : rural referral center {0: No, 1: Yes} ('MAPP19')
- **ownership_type** : type of organization responsible for establishing policy concerning overall operation {government_federal, government_nonfederal, nonprofit, forprofit, other} ('CNTRL')
- **bedsize** : bed-size category, ordinal variable ('BSC')
- **medicare_ipd_percentage** : medicare inpatient days / total inpatient days. Proxy variable to reflect the proportion of medicare patient 
- **medicaid_ipd_percentage** : medicaid inpatient days / total inpatient days. Proxy variable to reflect the proportion of medicaid patients 
- **core_index** : summary measure to track the interoperability of US hospitals (https://doi.org/10.1093/jamia/ocae289)
- **friction_index** : summary measures to track the barrier or difficulty in interoperability between hospitals (https://doi.org/10.1093/jamia/ocae289)


In [None]:
## rural_urban_type
# Continue with CBSA type and other variables
AHA_master2['rural_urban_type'] = AHA_master2['cbsatype_as'].map({
    'Rural': 1,      # Rural = 1 (lowest)
    'Micro': 2,      # Micropolitan = 2 (middle)
    'Metro': 3       # Metropolitan = 3 (highest)
})

## system_member
# Create new column 'system_member' based on the conditions
AHA_master2['system_member'] = AHA_master2['mhsmemb_as'].copy()
# Set to 1 where sysid_as is not null and mhsmemb_as is null
AHA_master2.loc[(AHA_master2['sysid_as'].notna()) & (AHA_master2['mhsmemb_as'].isna()), 'system_member'] = 1
# Convert all remaining null values to 0
AHA_master2['system_member'] = AHA_master2['system_member'].fillna(0)

## AHA System Cluster Code - delivery_system
AHA_master2['delivery_system'] = AHA_master2['cluster_as']

## community_hospital
AHA_master2['community_hospital'] = AHA_master2['chc_as'].replace(2, 0)

## subsidary_hospital
AHA_master2['subsidary_hospital'] = AHA_master2['subs_as']

## frontline_hospital
AHA_master2['frontline_hospital'] = AHA_master2['frtln_as'].replace('.', 0)

## joint_commission_accreditation
AHA_master2['joint_commission_accreditation'] = AHA_master2['mapp1_as'].replace(2,0)

## center_quality
AHA_master2['center_quality'] = AHA_master2['mapp22_as'].replace(2,0)

# teaching hospitals 
AHA_master2['teaching_hospital'] = ((AHA_master2['mapp5_as'] == 1) | (AHA_master2['mapp3_as'] == 1) | (AHA_master2['mapp8_as'] == 1)).astype(int)
AHA_master2['major_teaching_hospital'] = ((AHA_master2['mapp8_as'] == 1)).astype(int)
AHA_master2['minor_teaching_hospital'] = (((AHA_master2['mapp5_as'] == 1) | (AHA_master2['mapp3_as'] == 1))&~(AHA_master2['mapp8_as'] == 1)).astype(int)

# critical access hospital
AHA_master2['critical_access'] = (AHA_master2['mapp18_as'] == 1).astype(int)


# rural referral center 
AHA_master2['rural_referral'] = (AHA_master2['mapp19_as'] == 1).astype(int)

# medicare medicaid percentage
AHA_master2['medicare_ipd_percentage'] = AHA_master2['mcripd_as'] / AHA_master2['ipdtot_as'] * 100
AHA_master2['medicaid_ipd_percentage'] = AHA_master2['mcdipd_as'] / AHA_master2['ipdtot_as'] * 100

# bed size 
AHA_master2['bedsize'] = AHA_master2['bsc_as'].astype(int)

# hospital ownership type 

AHA_master2['nonfederal_governement'] = ((AHA_master2['cntrl_as'] == 12) | (AHA_master2['cntrl_as'] == 13)|(AHA_master['cntrl_as'] == 14) | (AHA_master['cntrl_as'] == 15)| (AHA_master['cntrl_as'] == 16)).astype(int)
AHA_master2['non_profit_nongovernment'] = ((AHA_master2['cntrl_as'] == 21) | (AHA_master2['cntrl_as'] == 23)).astype(int)
AHA_master2['for_profit'] = ((AHA_master2['cntrl_as'] == 31) | (AHA_master2['cntrl_as'] == 32) | (AHA_master['cntrl_as'] == 33)).astype(int)
AHA_master2['federal_government'] = ((AHA_master2['cntrl_as'] == 40) | (AHA_master2['cntrl_as'] == 44) | (AHA_master2['cntrl_as'] == 45) | (AHA_master2['cntrl_as'] == 46) | (AHA_master['cntrl_as'] == 47) | (AHA_master['cntrl_as'] == 48)).astype(int)
# Create a categorical column for hospital ownership types
def create_ownership_category(row):
    if row['cntrl_as'] in [12, 13, 14, 15, 16]:
        return 'nonfederal_government'
    elif row['cntrl_as'] in [21, 23]:
        return 'non_profit_nongovernment'
    elif row['cntrl_as'] in [31, 32, 33]:
        return 'for_profit'
    elif row['cntrl_as'] in [40, 44, 45, 46, 47, 48]:
        return 'federal_government'
    else:
        return 'other'

# Create the categorical column
AHA_master2['ownership_type'] = AHA_master2.apply(create_ownership_category, axis=1)



In [33]:
hospital_features = ["teaching_hospital",
"nonfederal_governement",
"non_profit_nongovernment",
"for_profit",
"federal_government",
"critical_access",
"rural_referral",
"medicare_ipd_percentage",
"medicaid_ipd_percentage",
"bedsize",
"delivery_system",
"community_hospital",
"subsidary_hospital",
"frontline_hospital",
"joint_commission_accreditation",
"center_quality",
"system_member"]
coordinates_features = ["latitude_address",
"longitude_address"]
geo_features = ["rural_urban_type",
"national_adi_median",
"svi_themes_median",
"svi_theme1_median",
"svi_theme2_median",
"svi_theme3_median",
"svi_theme4_median",
"Device_Percent",
"Broadband_Percent",
"Internet_Percent",
"mean_primary_hpss",
"mean_dental_hpss",
"mean_mental_hpss",
"mean_mua_score",
"mean_mua_elders_score",
"mean_mua_infant_score"]
interoperability_features = ['core_index', "friction_index"]


In [34]:
feature_columns = hospital_features + geo_features + interoperability_features 

In [None]:
AHA_master2["ai_base_score"] = AHA_master2["ai_base_score"].astype(float).fillna(0)
y = AHA_master2['ai_base_score'].values  # Replace this with  actual target column
X = AHA_master2[feature_columns]

In [None]:
# Initialize cross-validation
n_splits = 5
kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)

# Lists to store metrics for each fold
r2_scores = []
rmse_scores = []
mae_scores = []
mape_scores = []
feature_importances = []

print("Starting cross-validation with Random Forest feature importance...")

# Perform cross-validation with proper scaling
for fold, (train_idx, val_idx) in enumerate(kf.split(X), 1):
    print(f"Processing fold {fold}/{n_splits}...")
    
    # Split data
    X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
    y_train, y_val = y[train_idx], y[val_idx]
    
    # CORRECTED: Fit scaler on training data only
    scaler = StandardScaler()
    X_train_scaled = pd.DataFrame(
        scaler.fit_transform(X_train), 
        columns=X.columns,
        index=X_train.index
    )
    X_val_scaled = pd.DataFrame(
        scaler.transform(X_val), 
        columns=X.columns,
        index=X_val.index
    )
    
    # Train model
    rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
    rf_model.fit(X_train_scaled, y_train)
    

    feature_importances.append(rf_model.feature_importances_)
    
    # Make predictions
    y_pred = rf_model.predict(X_val_scaled)
    
    # Calculate metrics
    r2_scores.append(r2_score(y_val, y_pred))
    rmse_scores.append(np.sqrt(mean_squared_error(y_val, y_pred)))
    mae_scores.append(mean_absolute_error(y_val, y_pred))
    mape_scores.append(mean_absolute_percentage_error(y_val, y_pred))



# Calculate average feature importance across folds
avg_feature_importance = np.mean(feature_importances, axis=0)
std_feature_importance = np.std(feature_importances, axis=0)

# Create feature importance DataFrame
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': avg_feature_importance,
    'std_importance': std_feature_importance,
    'cv_stability': 1 - (std_feature_importance / (avg_feature_importance + 1e-10))  # Stability metric
})
feature_importance = feature_importance.sort_values('importance', ascending=False)

# Print cross-validation results
print("\n" + "="*60)
print("MODEL PERFORMANCE METRICS")
print("="*60)
print(f"R² Score: {np.mean(r2_scores):.4f} (±{np.std(r2_scores):.4f})")
print(f"Root Mean Squared Error: {np.mean(rmse_scores):.4f} (±{np.std(rmse_scores):.4f})")
print(f"Mean Absolute Error: {np.mean(mae_scores):.4f} (±{np.std(mae_scores):.4f})")
print(f"Mean Absolute Percentage Error: {np.mean(mape_scores)*100:.2f}% (±{np.std(mape_scores)*100:.2f}%)")

# Print top 10 most important features with their standard deviations
print(f"\n" + "="*60)
print("TOP 10 MOST IMPORTANT FEATURES")
print("="*60)
print("Note: Using Random Forest built-in feature importance (Gini importance)")
top_10 = feature_importance.head(10)[['feature', 'importance', 'std_importance', 'cv_stability']]
print(top_10.to_string(index=False, float_format='%.4f'))

# Create the figure with specific size and DPI
fig, ax = plt.subplots(figsize=(10, 8), dpi=300)

# Create the horizontal bar plot
feature_importance_plot = feature_importance.head(10)
bars = ax.barh(feature_importance_plot['feature'], 
               feature_importance_plot['importance'],
               xerr=feature_importance_plot['std_importance'],
               capsize=5,
               color='#2E86C1',  # Professional blue color
               alpha=0.8,
               edgecolor='black',
               linewidth=0.5)

# Customize the plot
ax.set_xlabel('Feature Importance (Gini Importance)', fontweight='bold', labelpad=10)
ax.set_title('Top 10 Most Important Features', fontweight='bold', pad=20)


# Adjust lat
plt.tight_lat()


# Show the plot
plt.show()


In [None]:

def create_shap_plot_cv_approach(X, y, n_folds=5, title='SHAP Feature Importance for AI Base Score'):
    """
    Create SHAP plot using cross-validation approach for more robust results
    """
    from sklearn.model_selection import KFold
    
    kf = KFold(n_splits=n_folds, shuffle=True, random_state=42)
    all_shap_values = []
    all_X_scaled = []
    
    # Collect SHAP values from multiple CV folds
    for fold, (train_idx, val_idx) in enumerate(kf.split(X)):
        print(f"Processing fold {fold+1}/{n_folds} for SHAP...")
        
        X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
        y_train, y_val = y[train_idx], y[val_idx]
        
        # Proper scaling within fold
        scaler = StandardScaler()
        X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train), columns=X.columns)
        X_val_scaled = pd.DataFrame(scaler.transform(X_val), columns=X.columns)
        
        # Train model
        model = RandomForestRegressor(n_estimators=100, random_state=42)
        model.fit(X_train_scaled, y_train)
        
        # Calculate SHAP values for validation set
        explainer = shap.TreeExplainer(model)
        shap_values = explainer.shap_values(X_val_scaled)
        
        all_shap_values.append(shap_values)
        all_X_scaled.append(X_val_scaled)
    
    # Combine all SHAP values and features
    combined_shap = np.vstack(all_shap_values)
    combined_X = pd.concat(all_X_scaled, ignore_index=True)
    
   
    plt.figure(figsize=(12, 8), dpi=300)
    shap.summary_plot(combined_shap, combined_X, show=False, max_display=10)
    
    # Customize
    plt.title(title, fontweight='bold', pad=20, fontsize=16)
    ax = plt.gca()
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    
    plt.tight_lat()
    plt.show()
    
    return combined_shap, combined_X

shap_values, X_combined = create_shap_plot_cv_approach(X, y, n_folds=5)