## D1. Feature importance analysis 

**Description**  
This section conducts machine learning (random forest) to identify the most important feature predicting the AI/ML implementation level 

**Purpose**  
To identify the most important feature predicting the AI/ML implementation level 
 


### 1 Load necessary libraries, functions, and pre-processed data 

In [None]:
# Data manipulation and analysis
import pandas as pd
import numpy as np

# Machine learning and model evaluation
from sklearn.model_selection import cross_val_score, train_test_split, KFold
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.inspection import permutation_importance
from sklearn.linear_model import LassoCV
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import (
    accuracy_score, 
    classification_report,
    r2_score, 
    mean_squared_error, 
    mean_absolute_error,
    mean_absolute_percentage_error
)

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import shap

In [None]:
# Import functions
import sys
sys.path.append('../')
from calculate_scores import create_union_aipred_row, apply_ai_scores_to_dataframe

# Load data
AHA_master = pd.read_csv('../../data/AHA_master_external_data.csv', low_memory=False)

# Create aipred_it_union separately (your choice, works perfectly)
AHA_master['aipred_it_union'] = AHA_master.apply(create_union_aipred_row, axis=1)

# Use the apply function for all other scores
AHA_IT = apply_ai_scores_to_dataframe(AHA_master)
AHA_IT = AHA_IT[AHA_IT['id_it'].notna()]


### 2 Data engineering 

These hospital characteristics were selected based on investigator consensus, and we used LASSO regression analysis to explore and identify additional variables that predict AI/ML implementation and reflect hospital resource levels.

- **rural_urban_type** : collected from AHA survey. categorized into {1: rural, 2: micro, 3: metro} based on the location of the hospital ('CBSATYPE')
- **system member** : hospital belonging to a corporate body that owns or manage health provider facilities or health-related subsidiaries. ('MHSMEMB')
- **delivery_system** : delivery system identified using existing theory and AHA Annual Survey data {1: Centralized Health System, 2: Centralized Physician/Insurance Health System, 3: Moderately Centralized Health System, 4: Decentralized Health System, 5: Independent Hospital System, 6/Missing: Insufficient data to determine} ('CLUSTER')
- **community_hospital** : all nonfederal, short-term general, and special hospitals whose facilities and services are available to the public {0: No, 1: Yes}('CHC')
- **subsidary_hospital** : Hospital itself operates subsidiary corporation {0: No, 1: Yes} ('SUBS')
- **frontline_hospital** : Frontline facility {0: No, 1: Yes} ('FRTLN')
- **joint_commission_accreditaion** : Accreditation by joint commision {0: No, 1: Yes} ('MAPP1')
- **center_quality** : Center for Improvement in Healthcare Quality Accreditation {0: No, 1: Yes} ('MAPP22')
- **teaching_hospital** : major teaching hospital ('MAPP8'), minor teaching hospital ('MAPP3' or 'MAPP5')
- **critical_access** critical access hospital {0: No, 1: Yes} ('MAPP18')
- **rural_referral** : rural referral center {0: No, 1: Yes} ('MAPP19')
- **ownership_type** : type of organization responsible for establishing policy concerning overall operation {government_federal, government_nonfederal, nonprofit, forprofit, other} ('CNTRL')
- **bedsize** : bed-size category, ordinal variable ('BSC')
- **medicare_ipd_percentage** : medicare inpatient days / total inpatient days. Proxy variable to reflect the proportion of medicare patient 
- **medicaid_ipd_percentage** : medicaid inpatient days / total inpatient days. Proxy variable to reflect the proportion of medicaid patients 
- **core_index** : summary measure to track the interoperability of US hospitals (https://doi.org/10.1093/jamia/ocae289)
- **friction_index** : summary measures to track the barrier or difficulty in interoperability between hospitals (https://doi.org/10.1093/jamia/ocae289)


In [None]:
children = [50, 51, 52, 53, 55, 56, 57, 58, 59, 90, 91]
AHA_IT['children_hospital'] = AHA_IT['serv_as'].isin(children)
AHA_IT['children_hospital'].value_counts()
## rural_urban_type
# Continue with CBSA type and other variables
AHA_IT['rural_urban_type'] = AHA_IT['cbsatype_as'].map({
    'Rural': 1,      # Rural = 1 (lowest)
    'Micro': 2,      # Micropolitan = 2 (middle)
    'Metro': 3       # Metropolitan = 3 (highest)
})

## system_member
# Create new column 'system_member' based on the conditions
AHA_IT['system_member'] = AHA_IT['mhsmemb_as'].copy()
# Set to 1 where sysid_as is not null and mhsmemb_as is null
AHA_IT.loc[(AHA_IT['sysid_as'].notna()) & (AHA_IT['mhsmemb_as'].isna()), 'system_member'] = 1
# Convert all remaining null values to 0
AHA_IT['system_member'] = AHA_IT['system_member'].fillna(0)

## AHA System Cluster Code - delivery_system
AHA_IT['delivery_system'] = AHA_IT['cluster_as']

## community_hospital
AHA_IT['community_hospital'] = AHA_IT['chc_as'].replace(2, 0)

## subsidary_hospital
AHA_IT['subsidary_hospital'] = AHA_IT['subs_as']

## frontline_hospital
AHA_IT['frontline_hospital'] = AHA_IT['frtln_as'].replace('.', 0)

## joint_commission_accreditation
AHA_IT['joint_commission_accreditation'] = AHA_IT['mapp1_as'].replace(2,0)

## center_quality
AHA_IT['center_quality'] = AHA_IT['mapp22_as'].replace(2,0)

# teaching hospitals 
AHA_IT['teaching_hospital'] = ((AHA_IT['mapp5_as'] == 1) | (AHA_IT['mapp3_as'] == 1) | (AHA_IT['mapp8_as'] == 1)).astype(int)
AHA_IT['major_teaching_hospital'] = ((AHA_IT['mapp8_as'] == 1)).astype(int)
AHA_IT['minor_teaching_hospital'] = (((AHA_IT['mapp5_as'] == 1) | (AHA_IT['mapp3_as'] == 1))&~(AHA_IT['mapp8_as'] == 1)).astype(int)

# critical access hospital
AHA_IT['critical_access'] = (AHA_IT['mapp18_as'] == 1).astype(int)


# rural referral center 
AHA_IT['rural_referral'] = (AHA_IT['mapp19_as'] == 1).astype(int)

# medicare medicaid percentage
AHA_IT['medicare_ipd_percentage'] = AHA_IT['mcripd_as'] / AHA_IT['ipdtot_as'] * 100
AHA_IT['medicaid_ipd_percentage'] = AHA_IT['mcdipd_as'] / AHA_IT['ipdtot_as'] * 100

# bed size 
AHA_IT['bedsize'] = AHA_IT['bsc_as'].astype(int)

# hospital ownership type 

AHA_IT['nonfederal_governement'] = ((AHA_IT['cntrl_as'] == 12) | (AHA_IT['cntrl_as'] == 13)|(AHA_IT['cntrl_as'] == 14) | (AHA_IT['cntrl_as'] == 15)| (AHA_IT['cntrl_as'] == 16)).astype(int)
AHA_IT['non_profit_nongovernment'] = ((AHA_IT['cntrl_as'] == 21) | (AHA_IT['cntrl_as'] == 23)).astype(int)
AHA_IT['for_profit'] = ((AHA_IT['cntrl_as'] == 31) | (AHA_IT['cntrl_as'] == 32) | (AHA_IT['cntrl_as'] == 33)).astype(int)
AHA_IT['federal_government'] = ((AHA_IT['cntrl_as'] == 40) | (AHA_IT['cntrl_as'] == 44) | (AHA_IT['cntrl_as'] == 45) | (AHA_IT['cntrl_as'] == 46) | (AHA_IT['cntrl_as'] == 47) | (AHA_IT['cntrl_as'] == 48)).astype(int)
# Create a categorical column for hospital ownership types
def create_ownership_category(row):
    if row['cntrl_as'] in [12, 13, 14, 15, 16]:
        return 'nonfederal_government'
    elif row['cntrl_as'] in [21, 23]:
        return 'non_profit_nongovernment'
    elif row['cntrl_as'] in [31, 32, 33]:
        return 'for_profit'
    elif row['cntrl_as'] in [40, 44, 45, 46, 47, 48]:
        return 'federal_government'
    else:
        return 'other'

# Create the categorical column
AHA_IT['ownership_type'] = AHA_IT.apply(create_ownership_category, axis=1)



In [None]:
# Replace invalid SVI values with median of valid values
svi_columns = ['svi_themes_median', 'svi_theme1_median', 'svi_theme2_median', 
              'svi_theme3_median', 'svi_theme4_median']

for col in svi_columns:
    # Identify valid values (0-1 range)
    valid_mask = (AHA_IT[col] >= 0) & (AHA_IT[col] <= 1)
    
    # Calculate median of valid values
    valid_median = AHA_IT.loc[valid_mask, col].median()
    
    # Replace invalid values with the median
    invalid_mask = ~valid_mask
    AHA_IT.loc[invalid_mask, col] = valid_median
    
    print(f"Fixed {col}:")
    print(f"  Replaced {invalid_mask.sum()} invalid values with median {valid_median:.4f}")
    print(f"  New range: {AHA_IT[col].min():.4f} to {AHA_IT[col].max():.4f}")

In [None]:
hospital_features = ["teaching_hospital",
"nonfederal_governement",
"non_profit_nongovernment",
"for_profit",
"federal_government",
"critical_access",
"rural_referral",
"medicare_ipd_percentage",
"medicaid_ipd_percentage",
"bedsize",
"delivery_system",
"community_hospital",
"subsidary_hospital",
"frontline_hospital",
"joint_commission_accreditation",
"center_quality",
"system_member"]
coordinates_features = ["latitude_address",
"longitude_address"]
geo_features = ["rural_urban_type",
"national_adi_median",
"svi_themes_median",
"svi_theme1_median",
"svi_theme2_median",
"svi_theme3_median",
"svi_theme4_median",
"Device_Percent",
"Broadband_Percent",
"Internet_Percent",
"mean_primary_hpss",
"mean_dental_hpss",
"mean_mental_hpss",
"mean_mua_score",
"mean_mua_elders_score",
"mean_mua_infant_score"]
interoperability_features = ['core_index', "friction_index"]


In [None]:
feature_columns = hospital_features + geo_features + interoperability_features 

In [None]:
AHA_IT["ai_base_score_imputed"] = AHA_IT["ai_base_score_imputed"].astype(float).fillna(0)
y = AHA_IT['ai_base_score_imputed'].values  # Replace this with  actual target column
X = AHA_IT[feature_columns]

In [None]:
# === CV setup ===
n_splits = 5
kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)

# === Storage ===
r2_scores = []
rmse_scores = []
mae_scores = []
mape_scores = []
feature_importances = []

print("Starting cross-validation with Random Forest feature importance...")

for fold, (train_idx, val_idx) in enumerate(kf.split(X), 1):
    print(f"Processing fold {fold}/{n_splits}...")

    # Split
    X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
    y_train, y_val = y[train_idx], y[val_idx]

    imputer = IterativeImputer(
    max_iter=10,        # number of MICE iterations
    random_state=42,
    sample_posterior=False  # True -> Bayesian, adds randomness
    )
    X_train_imp = pd.DataFrame(
    imputer.fit_transform(X_train),
    columns=X.columns, index=X_train.index
    )
    X_val_imp = pd.DataFrame(
    imputer.transform(X_val),
    columns=X.columns, index=X_val.index
    )

    # Scale after imputation (fit on train only)
    scaler = StandardScaler()
    X_train_scaled = pd.DataFrame(
        scaler.fit_transform(X_train_imp),
        columns=X.columns, index=X_train.index
    )
    X_val_scaled = pd.DataFrame(
        scaler.transform(X_val_imp),
        columns=X.columns, index=X_val.index
    )

    # Model
    rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
    rf_model.fit(X_train_scaled, y_train)

    # Importances
    feature_importances.append(rf_model.feature_importances_)

    # Predict and metrics
    y_pred = rf_model.predict(X_val_scaled)
    r2_scores.append(r2_score(y_val, y_pred))
    rmse_scores.append(np.sqrt(mean_squared_error(y_val, y_pred)))
    mae_scores.append(mean_absolute_error(y_val, y_pred))
    mape_scores.append(mean_absolute_percentage_error(y_val, y_pred))

print("Cross-validation completed!")

# === Aggregate feature importance across folds ===
feature_importances_arr = np.array(feature_importances)  # (n_folds, n_features)
avg_feature_importance = np.mean(feature_importances_arr, axis=0)
std_feature_importance = np.std(feature_importances_arr, axis=0)

feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': avg_feature_importance,
    'std_importance': std_feature_importance,
    'cv_stability': 1 - (std_feature_importance / (avg_feature_importance + 1e-10))
}).sort_values('importance', ascending=False)


In [None]:
# === Top 10 features table ===
print("\n" + "="*60)
print("TOP 10 MOST IMPORTANT FEATURES")
print("="*60)
print("Note: Using Random Forest built-in feature importance (Gini importance)")
top_10 = feature_importance.head(10)[['feature', 'importance', 'std_importance', 'cv_stability']]
print(top_10.to_string(index=False, float_format='%.4f'))


In [None]:


# === Plot top-10 feature importances with std error bars ===
fig, ax = plt.subplots(figsize=(10, 8), dpi=300)

feature_importance_plot = feature_importance.head(10)
bars = ax.barh(
    feature_importance_plot['feature'],
    feature_importance_plot['importance'],
    xerr=feature_importance_plot['std_importance'],
    capsize=5,
    color='#2E86C1',
    alpha=0.8,
    edgecolor='black',
    linewidth=0.5
)

ax.set_xlabel('Feature Importance (Gini Importance)', fontweight='bold', labelpad=10)
ax.set_title('Top 10 Most Important Features', fontweight='bold', pad=20)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['left'].set_linewidth(1.5)
ax.spines['bottom'].set_linewidth(1.5)
ax.grid(True, axis='x', linestyle='--', alpha=0.3)
ax.invert_yaxis()

for bar in bars:
    width = bar.get_width()
    ax.text(width + 0.005, bar.get_y() + bar.get_height()/2,
            f'{width:.3f}',
            ha='left', va='center', fontsize=10)

plt.tight_layout()
plt.show()



In [None]:
def create_shap_plot_cv_approach(X, y, n_folds=5, title='SHAP Feature Importance for AI Base Score'):
    """
    Cross-validated SHAP for Random Forest with MICE imputation and scaling each fold.
    Uses interventional background and disables additivity check to avoid ExplainerError.
    """
    os.makedirs('figures', exist_ok=True)

    kf = KFold(n_splits=n_folds, shuffle=True, random_state=42)
    all_shap = []
    all_X_val = []

    for fold, (train_idx, val_idx) in enumerate(kf.split(X), 1):
        print(f"Processing fold {fold}/{n_folds} for SHAP...")

        # Split
        X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
        y_train = y[train_idx]

        # Impute (MICE) on train, transform val
        imputer = IterativeImputer(max_iter=10, random_state=42, sample_posterior=False)
        X_train_imp = pd.DataFrame(imputer.fit_transform(X_train), columns=X.columns, index=X_train.index)
        X_val_imp   = pd.DataFrame(imputer.transform(X_val),      columns=X.columns, index=X_val.index)

        # Scale on train, transform val
        scaler = StandardScaler()
        X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train_imp), columns=X.columns, index=X_train.index)
        X_val_scaled   = pd.DataFrame(scaler.transform(X_val_imp),       columns=X.columns, index=X_val.index)

        # Train model
        model = RandomForestRegressor(n_estimators=100, random_state=42)
        model.fit(X_train_scaled, y_train)

        # SHAP explainer: interventional background + disable additivity check
        masker = shap.maskers.Independent(X_train_scaled, max_samples=200)  # subsample background for speed
        explainer = shap.TreeExplainer(model, data=masker, feature_perturbation="interventional", model_output="raw")

        # Disable additivity check at call to avoid ExplainerError across folds
        sv = explainer.shap_values(X_val_scaled, check_additivity=False)  # shape (n_val, n_features)

        all_shap.append(sv)
        all_X_val.append(X_val_scaled)

    # Combine across folds
    combined_shap = np.vstack(all_shap)
    combined_X = pd.concat(all_X_val, axis=0, ignore_index=True)

    # Plot
    plt.rcParams['font.family'] = 'Helvetica'
    plt.rcParams['font.size'] = 12

    plt.figure(figsize=(12, 8), dpi=300)
    shap.summary_plot(combined_shap, combined_X, show=False, max_display=10)
    plt.title(title, fontweight='bold', pad=20, fontsize=16)
    ax = plt.gca()
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)

    plt.tight_layout()
    plt.savefig('figures/shap_summary_plot_cv.pdf',
                bbox_inches='tight', dpi=300, format='pdf',
                facecolor='white', edgecolor='none')
    plt.show()

    return combined_shap, combined_X

# Usage:
shap_values, X_combined = create_shap_plot_cv_approach(X, y, n_folds=5)