## D2. longitudinal health outcome analysis 

**Description**  
This section investigate whether AI implementation level affects the change in hospital care quality overtime. This was done because we could not specify the exact timing of the AI implementation using the AHA dataset. 

**Purpose**  
To investigate whether AI implementation level affects the change in hospital care quality overtime

**Disclaimer**  
- This codebase was partially cleaned and annotated using OpenAI’s ChatGPT-4o. Please review and validate before using for critical purposes.  
- AHA data is subscription-based and not publicly shareable. All reported results are aggregated at the state or census division level.
- All publicly available data should also be independently downlowded from the source.  

**Notebook Workflow**  

0. Load necessary libraries, functions, and pre-processed data 
1. Feature engineering for hospital characteristics 
2. Assess missingness of the care quality metric 
3. conduct ML 
4. conduct feature importance analysis
5. conduct longitudinal analysis  

### D2_0 load necessary libraries, functions, and preprocessed data 

In [383]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import geopandas as gpd
from libpysal.weights import KNN, DistanceBand
from esda.moran import Moran
from spreg import ML_Error, OLS
from shapely.geometry import Point
from scipy import stats

In [None]:
# load preprocessed dataframe 
AHA_master = pd.read_csv('./data/AHA_master_external_data.csv', low_memory=False)
AHA_IT = AHA_master[~AHA_master['id_it'].isnull()]
AHA_IT.shape

In [None]:
# Import and if needed, reload the module
import calculate_ai_scores
AHA_master2 = calculate_ai_scores.apply_ai_scores_to_dataframe(AHA_IT)

### D2_1 Feature Engineering 

These hospital characteristics were selected based on investigator consensus, and we used LASSO regression analysis to explore and identify additional variables that predict AI/ML implementation and reflect hospital resource levels.

- **rural_urban_type** : collected from AHA survey. categorized into {1: rural, 2: micro, 3: metro} based on the location of the hospital ('CBSATYPE')
- **system member** : hospital belonging to a corporate body that owns or manage health provider facilities or health-related subsidiaries. ('MHSMEMB')
- **delivery_system** : delivery system identified using existing theory and AHA Annual Survey data {1: Centralized Health System, 2: Centralized Physician/Insurance Health System, 3: Moderately Centralized Health System, 4: Decentralized Health System, 5: Independent Hospital System, 6/Missing: Insufficient data to determine} ('CLUSTER')
- **community_hospital** : all nonfederal, short-term general, and special hospitals whose facilities and services are available to the public {0: No, 1: Yes}('CHC')
- **subsidary_hospital** : Hospital itself operates subsidiary corporation {0: No, 1: Yes} ('SUBS')
- **frontline_hospital** : Frontline facility {0: No, 1: Yes} ('FRTLN')
- **joint_commission_accreditaion** : Accreditation by joint commision {0: No, 1: Yes} ('MAPP1')
- **center_quality** : Center for Improvement in Healthcare Quality Accreditation {0: No, 1: Yes} ('MAPP22')
- **teaching_hospital** : major teaching hospital ('MAPP8'), minor teaching hospital ('MAPP3' or 'MAPP5')
- **critical_access** critical access hospital {0: No, 1: Yes} ('MAPP18')
- **rural_referral** : rural referral center {0: No, 1: Yes} ('MAPP19')
- **ownership_type** : type of organization responsible for establishing policy concerning overall operation {government_federal, government_nonfederal, nonprofit, forprofit, other} ('CNTRL')
- **bedsize** : bed-size category, ordinal variable ('BSC')
- **medicare_ipd_percentage** : medicare inpatient days / total inpatient days. Proxy variable to reflect the proportion of medicare patient 
- **medicaid_ipd_percentage** : medicaid inpatient days / total inpatient days. Proxy variable to reflect the proportion of medicaid patients 
- **core_index** : summary measure to track the interoperability of US hospitals (https://doi.org/10.1093/jamia/ocae289)
- **friction_index** : summary measures to track the barrier or difficulty in interoperability between hospitals (https://doi.org/10.1093/jamia/ocae289)


In [None]:
# Add numpy import at the top if not already imported
import pandas as pd
import numpy as np

## rural_urban_type
# Continue with CBSA type and other variables
AHA_master2['rural_urban_type'] = AHA_master2['cbsatype_as'].map({
    'Rural': 1,      # Rural = 1 (lowest)
    'Micro': 2,      # Micropolitan = 2 (middle)
    'Metro': 3       # Metropolitan = 3 (highest)
})

## system_member
# Create new column 'system_member' based on the conditions
AHA_master2['system_member'] = AHA_master2['mhsmemb_as'].copy()
# Set to 1 where sysid_as is not null and mhsmemb_as is null
AHA_master2.loc[(AHA_master2['sysid_as'].notna()) & (AHA_master2['mhsmemb_as'].isna()), 'system_member'] = 1
# Convert all remaining null values to 0
AHA_master2['system_member'] = AHA_master2['system_member'].fillna(0).astype(int)

## AHA System Cluster Code - delivery_system
# Handle NaN values before converting to int
AHA_master2['delivery_system'] = AHA_master2['cluster_as'].fillna(0).astype(int)

## community_hospital
# Handle NaN values before converting to int
AHA_master2['community_hospital'] = AHA_master2['chc_as'].fillna(0).replace(2, 0).astype(int)

## subsidary_hospital
# Handle NaN values before converting to int
AHA_master2['subsidary_hospital'] = AHA_master2['subs_as'].fillna(0).astype(int)

## frontline_hospital
# Handle both '.' and NaN values before converting to int
AHA_master2['frontline_hospital'] = AHA_master2['frtln_as'].replace('.', 0).fillna(0).astype(int)

## joint_commission_accreditation
# Handle NaN values before converting to int
AHA_master2['joint_commission_accreditation'] = AHA_master2['mapp1_as'].fillna(0).replace(2, 0).astype(int)

## center_quality
# Handle NaN values before converting to int
AHA_master2['center_quality'] = AHA_master2['mapp22_as'].fillna(0).replace(2, 0).astype(int)

# teaching hospitals - Handle NaN values in the underlying columns
AHA_master2['teaching_hospital'] = ((AHA_master2['mapp5_as'].fillna(0) == 1) | 
                                   (AHA_master2['mapp3_as'].fillna(0) == 1) | 
                                   (AHA_master2['mapp8_as'].fillna(0) == 1)).astype(int)

AHA_master2['major_teaching_hospital'] = (AHA_master2['mapp8_as'].fillna(0) == 1).astype(int)

AHA_master2['minor_teaching_hospital'] = (((AHA_master2['mapp5_as'].fillna(0) == 1) | 
                                          (AHA_master2['mapp3_as'].fillna(0) == 1)) & 
                                         ~(AHA_master2['mapp8_as'].fillna(0) == 1)).astype(int)

# critical access hospital
AHA_master2['critical_access'] = (AHA_master2['mapp18_as'].fillna(0) == 1).astype(int)

# rural referral center 
AHA_master2['rural_referral'] = (AHA_master2['mapp19_as'].fillna(0) == 1).astype(int)

# medicare medicaid percentage - Handle division by zero and NaN values
# Replace inf and NaN with 0 for percentages
AHA_master2['medicare_ipd_percentage'] = (AHA_master2['mcripd_as'].fillna(0) / 
                                         AHA_master2['ipdtot_as'].replace(0, 1).fillna(1) * 100)
AHA_master2['medicare_ipd_percentage'] = AHA_master2['medicare_ipd_percentage'].replace([np.inf, -np.inf], 0).fillna(0)

AHA_master2['medicaid_ipd_percentage'] = (AHA_master2['mcdipd_as'].fillna(0) / 
                                         AHA_master2['ipdtot_as'].replace(0, 1).fillna(1) * 100)
AHA_master2['medicaid_ipd_percentage'] = AHA_master2['medicaid_ipd_percentage'].replace([np.inf, -np.inf], 0).fillna(0)

# bed size - Handle NaN values
AHA_master2['bedsize'] = AHA_master2['bsc_as'].fillna(0).astype(int)

# hospital ownership type - Handle NaN values before comparison
AHA_master2['nonfederal_government'] = ((AHA_master2['cntrl_as'].fillna(0) == 12) | 
                                       (AHA_master2['cntrl_as'].fillna(0) == 13) |
                                       (AHA_master2['cntrl_as'].fillna(0) == 14) | 
                                       (AHA_master2['cntrl_as'].fillna(0) == 15) | 
                                       (AHA_master2['cntrl_as'].fillna(0) == 16)).astype(int)

AHA_master2['non_profit_nongovernment'] = ((AHA_master2['cntrl_as'].fillna(0) == 21) | 
                                          (AHA_master2['cntrl_as'].fillna(0) == 23)).astype(int)

AHA_master2['for_profit'] = ((AHA_master2['cntrl_as'].fillna(0) == 31) | 
                            (AHA_master2['cntrl_as'].fillna(0) == 32) | 
                            (AHA_master2['cntrl_as'].fillna(0) == 33)).astype(int)

AHA_master2['federal_government'] = ((AHA_master2['cntrl_as'].fillna(0) == 40) | 
                                    (AHA_master2['cntrl_as'].fillna(0) == 44) | 
                                    (AHA_master2['cntrl_as'].fillna(0) == 45) | 
                                    (AHA_master2['cntrl_as'].fillna(0) == 46) | 
                                    (AHA_master2['cntrl_as'].fillna(0) == 47) | 
                                    (AHA_master2['cntrl_as'].fillna(0) == 48)).astype(int)

# Create a categorical column for hospital ownership types
def create_ownership_category(row):
    cntrl_val = row['cntrl_as'] if pd.notna(row['cntrl_as']) else 0
    if cntrl_val in [12, 13, 14, 15, 16]:
        return 'nonfederal_government'
    elif cntrl_val in [21, 23]:
        return 'non_profit_nongovernment'
    elif cntrl_val in [31, 32, 33]:
        return 'for_profit'
    elif cntrl_val in [40, 44, 45, 46, 47, 48]:
        return 'federal_government'
    else:
        return 'other'

# Create the categorical column
AHA_master2['ownership_type'] = AHA_master2.apply(create_ownership_category, axis=1)

# Optional: Print some diagnostics to check your data
print("Data processing completed!")
print(f"Data shape: {AHA_master2.shape}")
print(f"Ownership type distribution:")
print(AHA_master2['ownership_type'].value_counts())
print(f"Rural/Urban distribution:")
print(AHA_master2['rural_urban_type'].value_counts())

In [387]:
ai_exposures = ['ai_base_score', 'ai_base_breadth_score', 'ai_base_dev_score', 'ai_base_eval_score']
AHA_imputed = AHA_master2.copy()
# For a simple approach using a fixed value (e.g., 0)
AHA_imputed[ai_exposures] = AHA_imputed[ai_exposures].fillna(0)

In [388]:
hospital_features = ["teaching_hospital",
"nonfederal_government",
"non_profit_nongovernment",
"for_profit",
"federal_government",
"critical_access",
"rural_referral",
"medicare_ipd_percentage",
"medicaid_ipd_percentage",
"bedsize",
"delivery_system",
"community_hospital",
"subsidary_hospital",
"frontline_hospital",
"joint_commission_accreditation",
"system_member"]
coordinates_features = ["latitude_address",
"longitude_address"]
geo_features = ["rural_urban_type",
"national_adi_median",
"svi_themes_median",
"svi_theme1_median",
"svi_theme2_median",
"svi_theme3_median",
"svi_theme4_median",
"Device_Percent",
"Broadband_Percent",
"Internet_Percent",
"mean_primary_hpss",
"mean_dental_hpss",
"mean_mental_hpss",
"mean_mua_score",
"mean_mua_elders_score",
"mean_mua_infant_score"]
interoperability_features = ['core_index', "friction_index"]
all_covariates = hospital_features + geo_features 

In [389]:
hospital_quality_outcomes = ["COMP_HIP_KNEE",
"MORT_30_AMI",
"MORT_30_CABG",
"MORT_30_COPD",
"MORT_30_HF",
"MORT_30_PN",
"MORT_30_STK",
"PSI_03",
"PSI_04",
"PSI_06",
"PSI_08",
"PSI_09",
"PSI_10",
"PSI_11",
"PSI_12",
"PSI_13",
"PSI_14",
"PSI_15",
"PSI_90",
"Total HAC Score",
"READM-30-AMI-HRRP",
"READM-30-CABG-HRRP",
"READM-30-COPD-HRRP",
"READM-30-HF-HRRP",
"READM-30-HIP-KNEE-HRRP",
"READM-30-PN-HRRP",
"MSPB-1",
"EDV",
"ED_2_Strata_1",
"ED_2_Strata_2",
"HCP_COVID_19",
"HH_01",
"HH_02",
"IMM_3",
"OP_18b",
"OP_18c",
"OP_22",
"OP_23",
"OP_29",
"OP_31",
"OP_40",
"SAFE_USE_OF_OPIOIDS",
"SEP_1",
"SEP_SH_3HR",
"SEP_SH_6HR",
"SEV_SEP_3HR",
"SEV_SEP_6HR",
"STK_02",
"STK_03",
"STK_05",
"STK_06",
"VTE_1",
"VTE_2",
"EDAC_30_AMI",
"EDAC_30_HF",
"EDAC_30_PN",
"OP_32",
"OP_35_ADM",
"OP_35_ED",
"OP_36",
"READM_30_AMI",
"READM_30_CABG",
"READM_30_COPD",
"READM_30_HF",
"READM_30_HIP_KNEE",
"READM_30_HOSP_WIDE",
"READM_30_PN"]
# List of columns to process
outcome_columns = hospital_quality_outcomes.copy()  # Make a copy to safely modify


In [None]:
import pandas as pd
import numpy as np
import os

# Your setup
aha_df = AHA_imputed.copy()
aha_df['mcrnum_as'] = aha_df['mcrnum_as'].astype(str).str.replace('.0', '')
full_hospital_list = aha_df['mcrnum_as'].unique()
print(f"Total number of hospitals in AHA: {len(full_hospital_list)}")
print(f"Sample AHA hospital IDs: {full_hospital_list[:5]}")

data_order = ['01_2022', '04_2022', '07_2022', '10_2022', 
              '01_2023', '04_2023', '07_2023', '10_2023', 
              '01_2024', '04_2024', '07_2024', '10_2024', 
              '02_2025', '04_2025']

outcome_files = {
    'General Hospital Info': "data/outcomes/merged_general_hospital_info.csv",
    'Death Complication': "data/outcomes/merged_death_complication.csv",
    'HAC Reduction': "data/outcomes/merged_HAC_reduction.csv",
    'Readmission': "data/outcomes/merged_readmission.csv",
    'Medicare Spending': "data/outcomes/merged_Medicare_Hospital_Spending_Per_Patient.csv",
    'Timely Care': "data/outcomes/merged_Timely_and_Effective_Care.csv",
    'Unplanned Visits': "data/outcomes/merged_Unplanned_Hospital_Visits.csv"
}



### D2_2 Assess missingness

In [None]:
merged_df = pd.read_csv('merged_df_april_2025.csv')
print(f"merged_df shape: {merged_df.shape}")

# Find outcome columns (columns with prefixes from merging)
outcome_cols = [col for col in merged_df.columns 
               if any(prefix in col for prefix in ['Timely_Care_', 'Death_Complication_', 'HAC_Reduction_', 
                                                  'Readmission_', 'Medicare_Spending_', 'Unplanned_Visits_', 
                                                  'General_Hospital_Info_'])]

print(f"Found {len(outcome_cols)} outcome columns")

# Calculate missingness for each outcome column
results = []
for col in outcome_cols:
    missing_pct = (merged_df[col].isnull().sum() / len(merged_df)) * 100
    results.append({'column': col, 'missing_pct': missing_pct})

# Sort by missingness
results_df = pd.DataFrame(results).sort_values('missing_pct')

# Show results
print(f"\nColumns with < 50% missing (good for LASSO):")
good_cols = results_df[results_df['missing_pct'] < 50]
for _, row in good_cols.iterrows():
    print(f"  {row['column']}: {row['missing_pct']:.1f}% missing")

print(f"\nFound {len(good_cols)} usable columns for LASSO analysis")

In [None]:
# Simple LASSO analysis on your April 2025 merged_df

# Find outcome columns (from your merged data)
outcome_cols = [col for col in merged_df.columns 
               if any(prefix in col for prefix in ['Timely_Care_', 'Death_Complication_', 'HAC_Reduction_', 
                                                  'Readmission_', 'Medicare_Spending_', 'Unplanned_Visits_'])]

print(f"Found {len(outcome_cols)} outcome columns")

# Filter to columns with < 50% missing data
good_outcomes = []
for col in outcome_cols:
    missing_pct = (merged_df[col].isnull().sum() / len(merged_df)) * 100
    if missing_pct < 50:
        good_outcomes.append(col)

print(f"Using {len(good_outcomes)} outcomes with < 50% missing data")

predictor_cols = all_covariates 

# Filter predictors that exist and have < 75% missing
good_predictors = []
for col in predictor_cols:
    if col in merged_df.columns:
        missing_pct = (merged_df[col].isnull().sum() / len(merged_df)) * 100
        if missing_pct < 75:
            good_predictors.append(col)

# Run LASSO analysis
if good_outcomes and good_predictors:
    print("\nRunning LASSO...")
    try:
        table, feature_importance_dict = lasso_covariate_table(
            merged_df, 
            good_outcomes, 
            predictor_columns=good_predictors
        )
        
        print("✓ LASSO completed successfully!")
        print(f"Results: {len(table)} outcomes analyzed")
        
        # Show summary results
        if len(table) > 0:
            print(f"\nSummary:")
            print(f"  Average features selected: {table['num_selected'].mean():.1f}")
            print(f"  Range: {table['num_selected'].min()}-{table['num_selected'].max()} features")
            
            # Show top results
            print(f"\nTop 5 outcomes by number of features selected:")
            top_outcomes = table.nlargest(5, 'num_selected')
            for _, row in top_outcomes.iterrows():
                print(f"  {row['outcome']}: {row['num_selected']} features")
        
    except Exception as e:
        print(f"❌ LASSO failed: {e}")
        import traceback
        traceback.print_exc()
        
else:
    print("❌ No suitable outcomes or predictors found")
    print(f"Outcomes: {len(good_outcomes)}, Predictors: {len(good_predictors)}")

In [None]:
# Simple LASSO analysis on your April 2025 merged_df
print("Running LASSO analysis on April 2025 data...")
print(f"merged_df shape: {merged_df.shape}")

# Find outcome columns (from your merged data)
outcome_cols = [col for col in merged_df.columns 
               if any(prefix in col for prefix in ['Timely_Care_', 'Death_Complication_', 'HAC_Reduction_', 
                                                  'Readmission_', 'Medicare_Spending_', 'Unplanned_Visits_'])]

print(f"Found {len(outcome_cols)} outcome columns")

# Filter to columns with < 50% missing data
good_outcomes = []
for col in outcome_cols:
    missing_pct = (merged_df[col].isnull().sum() / len(merged_df)) * 100
    if missing_pct < 50:
        good_outcomes.append(col)

print(f"Using {len(good_outcomes)} outcomes with < 50% missing data")

# Use your predefined covariates (from your notebook)
predictor_cols = all_covariates  # Your hospital + geographic features

# Filter predictors that exist and have < 75% missing
good_predictors = []
for col in predictor_cols:
    if col in merged_df.columns:
        missing_pct = (merged_df[col].isnull().sum() / len(merged_df)) * 100
        if missing_pct < 75:
            good_predictors.append(col)

print(f"Using {len(good_predictors)} predictors with < 75% missing data")

# Run LASSO analysis
if good_outcomes and good_predictors:
    print("\nRunning LASSO...")
    try:
        table, feature_importance_dict = lasso_covariate_table(
            merged_df, 
            good_outcomes, 
            predictor_columns=good_predictors
        )
        
        print("✓ LASSO completed successfully!")
        print(f"Results: {len(table)} outcomes analyzed")
        
        # Show 5 most important features for each outcome in table format
        if len(table) > 0:
            print(f"\nTop 5 most important features for each outcome:")
            print("="*60)
            
            # Create a summary table
            summary_results = []
            
            for outcome, importance_df in feature_importance_dict.items():
                # Get top 5 features by absolute coefficient value
                top_5_features = importance_df.head(5)  # Already sorted by Abs_Coefficient
                
                for i, (_, row) in enumerate(top_5_features.iterrows(), 1):
                    summary_results.append({
                        'Outcome': outcome,
                        'Rank': i,
                        'Feature': row['Feature'],
                        'Coefficient': row['Coefficient'],
                        'Abs_Coefficient': row['Abs_Coefficient']
                    })
            
            # Convert to DataFrame and display
            summary_df = pd.DataFrame(summary_results)
            
            # Display table for each outcome
            for outcome in summary_df['Outcome'].unique():
                outcome_data = summary_df[summary_df['Outcome'] == outcome]
                print(f"\n{outcome}:")
                print(outcome_data[['Rank', 'Feature', 'Coefficient']].to_string(index=False))
            
            # Save the complete results table
            
            
            
            
            
            #summary_df.to_csv('lasso_top5_features_april2025.csv', index=False)
            print(f"\n✓ Saved complete results to 'lasso_top5_features_april2025.csv'")
            
            # Overall summary
            print(f"\n" + "="*60)
            print(f"OVERALL SUMMARY")
            print(f"="*60)
            print(f"Total outcomes analyzed: {len(feature_importance_dict)}")
            print(f"Average features selected per outcome: {table['num_selected'].mean():.1f}")
            
            # Most frequently selected features across all outcomes
            all_features = summary_df['Feature'].value_counts()
            if len(all_features) > 0:
                print(f"\nMost frequently selected features (top 10):")
                for feature, count in all_features.head(10).items():
                    print(f"  {feature}: selected in {count} outcomes")
        
    except Exception as e:
        print(f"❌ LASSO failed: {e}")
        import traceback
        traceback.print_exc()
        
else:
    print("❌ No suitable outcomes or predictors found")
    print(f"Outcomes: {len(good_outcomes)}, Predictors: {len(good_predictors)}")

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LassoCV
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

def run_lasso_feature_selection(df, outcome_columns, predictor_columns=None, cv=5):
    """
    Simple LASSO feature selection for multiple outcomes.
    
    Args:
        df: DataFrame with outcome and predictor columns
        outcome_columns: List of outcome column names
        predictor_columns: List of predictor column names (optional)
        cv: Number of cross-validation folds
    
    Returns:
        results_df: DataFrame with outcomes and their selected features
        feature_importance_dict: Dictionary with feature importance for each outcome
    """
    
    # Set default predictors if not provided
    if predictor_columns is None:
        exclude_cols = outcome_columns + ['time_point', 'time_point2', 'ai_base_score', 'id_it']
        predictor_columns = [col for col in df.columns if col not in exclude_cols]
    
    # Filter to existing columns
    predictor_columns = [col for col in predictor_columns if col in df.columns]
    print(f"Using {len(predictor_columns)} predictor columns")
    
    results = []
    feature_importance_dict = {}
    
    for outcome in outcome_columns:
        print(f"\nProcessing: {outcome}")
        
        if outcome not in df.columns:
            print(f"  Skipping - not found in dataframe")
            continue
        
        # Prepare data
        data = df.dropna(subset=[outcome]).copy()
        
        if len(data) < 50:
            print(f"  Skipping - only {len(data)} observations")
            continue
        
        # Convert outcome to numeric
        try:
            data[outcome] = pd.to_numeric(data[outcome], errors='coerce')
            data = data.dropna(subset=[outcome])
        except:
            print(f"  Skipping - cannot convert to numeric")
            continue
        
        if len(data) < 50:
            print(f"  Skipping - only {len(data)} observations after conversion")
            continue
        
        # Prepare features and target
        X = data[predictor_columns]
        y = data[outcome]
        
        # Remove constant columns
        constant_cols = X.columns[X.nunique() <= 1]
        if len(constant_cols) > 0:
            X = X.drop(columns=constant_cols)
            print(f"  Removed {len(constant_cols)} constant columns")
        
        # Remove completely empty columns
        empty_cols = X.columns[X.isnull().all()]
        if len(empty_cols) > 0:
            X = X.drop(columns=empty_cols)
            print(f"  Removed {len(empty_cols)} empty columns")
        
        if X.shape[1] == 0:
            print(f"  Skipping - no valid predictors")
            continue
        
        # Create simple pipeline
        pipeline = Pipeline([
            ('imputer', SimpleImputer(strategy='mean')),
            ('scaler', StandardScaler()),
            ('lasso', LassoCV(cv=cv, random_state=0, max_iter=2000))
        ])
        
        # Fit model
        try:
            pipeline.fit(X, y)
            
            # Get selected features
            coefs = pipeline.named_steps['lasso'].coef_
            selected_features = X.columns[abs(coefs) > 1e-10].tolist()
            
            # Create feature importance DataFrame
            feature_importance = pd.DataFrame({
                'Feature': X.columns,
                'Coefficient': coefs,
                'Abs_Coefficient': abs(coefs)
            }).sort_values('Abs_Coefficient', ascending=False)
            
            # Store results
            results.append({
                'outcome': outcome,
                'selected_features': ', '.join(selected_features),
                'num_selected': len(selected_features),
                'observations': len(data),
                'best_alpha': pipeline.named_steps['lasso'].alpha_
            })
            
            feature_importance_dict[outcome] = feature_importance
            
            print(f"  ✓ Selected {len(selected_features)} features from {len(data)} observations")
            
        except Exception as e:
            print(f"  ✗ Error: {e}")
            continue
    
    # Create results DataFrame
    results_df = pd.DataFrame(results)
    
    print(f"\n✓ Successfully processed {len(results)} outcomes")
    
    return results_df, feature_importance_dict

def analyze_results(results_df, feature_importance_dict, top_n=10):
    """
    Simple analysis of LASSO results.
    """
    print(f"\n{'='*50}")
    print("LASSO RESULTS SUMMARY")
    print(f"{'='*50}")
    
    if len(results_df) == 0:
        print("No outcomes were successfully processed")
        return
    
    # Summary statistics
    print(f"Total outcomes processed: {len(results_df)}")
    print(f"Average features selected: {results_df['num_selected'].mean():.1f}")
    print(f"Range of features selected: {results_df['num_selected'].min()}-{results_df['num_selected'].max()}")
    
    # Most frequently selected features
    all_features = []
    for outcome, importance_df in feature_importance_dict.items():
        selected = importance_df[importance_df['Abs_Coefficient'] > 1e-10]['Feature'].tolist()
        all_features.extend(selected)
    
    if all_features:
        feature_counts = pd.Series(all_features).value_counts()
        print(f"\nTop {top_n} most frequently selected features:")
        for feature, count in feature_counts.head(top_n).items():
            print(f"  {feature}: selected in {count} outcomes")
    
    # Show outcomes with most features selected
    print(f"\nOutcomes with most features selected:")
    top_outcomes = results_df.nlargest(5, 'num_selected')
    for _, row in top_outcomes.iterrows():
        print(f"  {row['outcome']}: {row['num_selected']} features")
    
    return feature_counts if all_features else pd.Series()



In [None]:
# Simple LASSO analysis with table output
print("Running LASSO analysis...")
print(f"merged_df shape: {merged_df.shape}")

# Find outcome columns
outcome_cols = [col for col in merged_df.columns 
               if any(prefix in col for prefix in ['Timely_Care_', 'Death_Complication_', 'HAC_Reduction_', 
                                                  'Readmission_', 'Medicare_Spending_', 'Unplanned_Visits_'])]

# Filter to good outcomes (< 50% missing)
good_outcomes = [col for col in outcome_cols 
                if (merged_df[col].isnull().sum() / len(merged_df)) * 100 < 50]

# Filter to good predictors (< 75% missing)  
good_predictors = [col for col in all_covariates 
                  if col in merged_df.columns and 
                  (merged_df[col].isnull().sum() / len(merged_df)) * 100 < 75]

print(f"Using {len(good_outcomes)} outcomes and {len(good_predictors)} predictors")

# Run LASSO
if good_outcomes and good_predictors:
    table, feature_importance_dict = lasso_covariate_table(
        merged_df, good_outcomes, predictor_columns=good_predictors
    )
    
    # Create simple dataframe with top 5 features as columns
    summary_data = []
    for outcome, importance_df in feature_importance_dict.items():
        top_5 = importance_df.head(5)['Feature'].tolist()
        
        # Pad with empty strings if less than 5 features
        while len(top_5) < 5:
            top_5.append('')
            
        summary_data.append({
            'Outcome': outcome,
            'Feature_1': top_5[0],
            'Feature_2': top_5[1], 
            'Feature_3': top_5[2],
            'Feature_4': top_5[3],
            'Feature_5': top_5[4]
        })
    
    # Create and display dataframe
    results_df = pd.DataFrame(summary_data)
    print("\nTop 5 Features for Each Outcome:")
    print("="*80)
    print(results_df.to_string(index=False))
    
    # Save results
    #results_df.to_csv('lasso_top5_simple.csv', index=False)
    print(f"\n✓ Saved to lasso_top5_simple.csv")
    
else:
    print("❌ No suitable data found")

In [399]:
import pandas as pd

def calculate_base_ai_implementation_row(row):
    """
    Calculate base AI implementation score for a single row (hospital).
    
    Args:
        row: A pandas Series representing a single hospital row
        
    Returns:
        float: Base AI implementation score
    """
    # Base AI implementation score (continuous)
    # Return None if the input value is null
    if pd.isna(row['aipred_it']):
        return None
    elif row['aipred_it'] == 1:  # Machine Learning
        return 2
    elif row['aipred_it'] == 2:  # Other Non-Machine Learning Predictive Models
        return 1
    else:  # Neither (3) or Do not know (4)
        return 0

def calculate_ai_implementation_breadth_row(row):
    """
    Calculate AI implementation breadth score for a single row (hospital).
    
    Args:
        row: A pandas Series representing a single hospital row
        
    Returns:
        float: AI implementation breadth score
    """
    # Start with base score
    base_score = calculate_base_ai_implementation_row(row)
    if base_score is None:
        return None
    elif base_score == 0:
        return 0
    else:
        breadth_score = base_score
        # Implementation Breadth Score - count use cases
        use_case_cols = ['aitraj_it', 'airfol_it', 'aimhea_it', 'airect_it', 
                     'aibill_it', 'aische_it', 'aipoth_it', 'aicloth_it']
        for col in use_case_cols:
            if row[col] is None:
                breadth_score += 0
            else:
                breadth_score += row[col] * 0.25  # 0.25 points per use case
        return breadth_score

def calculate_ai_development_row(row):
    """
    Calculate AI development score for a single row (hospital).
    
    Args:
        row: A pandas Series representing a single hospital row
        
    Returns:
        float: AI development score
    """
    # Start with base score
    base_score = calculate_base_ai_implementation_row(row)
    if base_score is None:
        return None
    elif base_score == 0:
        return 0 
    else:
        dev_score = base_score
        if 'mlsed_it' in row and pd.notna(row['mlsed_it']):
            dev_score += row['mlsed_it'] * 2  # Self-developed
        if 'mldev_it' in row and pd.notna(row['mldev_it']):
            dev_score += row['mldev_it']  # EHR developer
        if 'mlthd_it' in row and pd.notna(row['mlthd_it']):
            dev_score += row['mlthd_it']  # Third-party
        if 'mlpubd_it' in row and pd.notna(row['mlpubd_it']):
            dev_score += row['mlpubd_it'] * 0.5  # Public domain
        return dev_score

def calculate_ai_evaluation_row(row):
    """
    Calculate AI evaluation score for a single row (hospital).
    
    Args:
        row: A pandas Series representing a single hospital row
        
    Returns:
        float: AI evaluation score
    """
    # Start with base score
    base_score = calculate_base_ai_implementation_row(row)
    if base_score is None:
        return None
    elif base_score == 0:
        return 0
    else:
        eval_score = base_score
        # For model accuracy (MLACCU)
        if row['mlaccu_it'] is None:
            eval_score += 0
        elif row['mlaccu_it'] == 1:  # All models
            eval_score += 1
        elif row['mlaccu_it'] == 2:  # Most models
            eval_score += 0.75
        elif row['mlaccu_it'] == 3:  # Some models
            eval_score += 0.5
        elif row['mlaccu_it'] == 4:  # Few models
            eval_score += 0.25
        # For None (5) or Do not know (6), no points added
    
    # For model bias (MLBIAS)
        if row['mlbias_it'] is None:
            eval_score += 0
        elif row['mlbias_it'] == 1:  # All models
            eval_score += 1
        elif row['mlbias_it'] == 2:  # Most models
            eval_score += 0.75
        elif row['mlbias_it'] == 3:  # Some models
            eval_score += 0.5
        elif row['mlbias_it'] == 4:  # Few models
            eval_score += 0.25
        # For None (5) or Do not know (6), no points added
    
        return eval_score

def calculate_all_ai_scores_row(row):
    """
    Calculate all AI/ML implementation scores as continuous measures for a single row.
    
    Args:
        row: A pandas Series representing a single hospital row
        
    Returns:
        dict: Dictionary with all calculated scores
    """
    # Calculate all scores
    base_score = calculate_base_ai_implementation_row(row)
    breadth_score = calculate_ai_implementation_breadth_row(row)
    dev_score = calculate_ai_development_row(row)
    eval_score = calculate_ai_evaluation_row(row)
    
    return {
        'ai_base_score': base_score,
        'ai_base_breadth_score': breadth_score,
        'ai_base_dev_score': dev_score,
        'ai_base_eval_score': eval_score
    }

def apply_ai_scores_to_dataframe(df):
    """
    Apply all AI score calculations row by row to a dataframe.
    
    Args:
        df: A pandas DataFrame with hospital data
        
    Returns:
        pandas.DataFrame: DataFrame with added AI score columns
    """
    # Initialize empty columns for scores
    df['ai_base_score'] = float('nan')
    df['ai_base_breadth_score'] = float('nan')
    df['ai_base_dev_score'] = float('nan')
    df['ai_base_eval_score'] = float('nan')
    
    # Apply row by row calculations
    for index, row in df.iterrows():
        scores = calculate_all_ai_scores_row(row)
        for score_name, score_value in scores.items():
            df.at[index, score_name] = score_value
    
    return df


In [None]:
AHA_master2 = apply_ai_scores_to_dataframe(AHA_IT)

In [None]:
AHA_master2['delivery_system'] = AHA_master2['cluster_as']
# Create new column 'system_member' based on the conditions
AHA_master2['system_member'] = AHA_master2['mhsmemb_as'].copy()

# Set to 1 where sysid_as is not null and mhsmemb_as is null
AHA_master2.loc[(AHA_master2['sysid_as'].notna()) & (AHA_master2['mhsmemb_as'].isna()), 'system_member'] = 1

# Convert all remaining null values to 0
AHA_master2['system_member'] = AHA_master2['system_member'].fillna(0)
# bed size 
AHA_master2['bedsize'] = AHA_master2['bsc_as'].astype(int)


In [402]:
ai_ml = ['ai_base_score', 'ai_base_breadth_score', 'ai_base_dev_score', 'ai_base_eval_score']
hospital_resource = ['delivery_system', 'system_member', 'bedsize']

In [None]:
general_hospital = pd.read_csv("./data/outcomes/merged_general_hospital_info.csv", low_memory=False)
general_hospital = general_hospital.replace('Not Available', np.nan)
AHA_master3 = AHA_master2[ai_ml + all_covariates + ['id_it', 'mcrnum_as']]
AHA_master3['mcrnum_as'] = AHA_master3['mcrnum_as'].astype(str).str.zfill(6)
for col in ai_ml+hospital_resource:
    AHA_master3[col] = AHA_master3[col].astype(float).fillna(0)

In [None]:
## rural_urban_type
# Continue with CBSA type and other variables
AHA_master2['rural_urban_type'] = AHA_master2['cbsatype_as'].map({
    'Rural': 1,      # Rural = 1 (lowest)
    'Micro': 2,      # Micropolitan = 2 (middle)
    'Metro': 3       # Metropolitan = 3 (highest)
})

## system_member
# Create new column 'system_member' based on the conditions
AHA_master2['system_member'] = AHA_master2['mhsmemb_as'].copy()
# Set to 1 where sysid_as is not null and mhsmemb_as is null
AHA_master2.loc[(AHA_master2['sysid_as'].notna()) & (AHA_master2['mhsmemb_as'].isna()), 'system_member'] = 1
# Convert all remaining null values to 0
AHA_master2['system_member'] = AHA_master2['system_member'].fillna(0)

## AHA System Cluster Code - delivery_system
AHA_master2['delivery_system'] = AHA_master2['cluster_as']

## community_hospital
AHA_master2['community_hospital'] = AHA_master2['chc_as'].replace(2, 0)

## subsidary_hospital
AHA_master2['subsidary_hospital'] = AHA_master2['subs_as']

## frontline_hospital
AHA_master2['frontline_hospital'] = AHA_master2['frtln_as'].replace('.', 0)

## joint_commission_accreditation
AHA_master2['joint_commission_accreditation'] = AHA_master2['mapp1_as'].replace(2,0)

## center_quality
AHA_master2['center_quality'] = AHA_master2['mapp22_as'].replace(2,0)

# teaching hospitals 
AHA_master2['teaching_hospital'] = ((AHA_master2['mapp5_as'] == 1) | (AHA_master2['mapp3_as'] == 1) | (AHA_master2['mapp8_as'] == 1)).astype(int)
AHA_master2['major_teaching_hospital'] = ((AHA_master2['mapp8_as'] == 1)).astype(int)
AHA_master2['minor_teaching_hospital'] = (((AHA_master2['mapp5_as'] == 1) | (AHA_master2['mapp3_as'] == 1))&~(AHA_master2['mapp8_as'] == 1)).astype(int)

# critical access hospital
AHA_master2['critical_access'] = (AHA_master2['mapp18_as'] == 1).astype(int)


# rural referral center 
AHA_master2['rural_referral'] = (AHA_master2['mapp19_as'] == 1).astype(int)

# medicare medicaid percentage
AHA_master2['medicare_ipd_percentage'] = AHA_master2['mcripd_as'] / AHA_master2['ipdtot_as'] * 100
AHA_master2['medicaid_ipd_percentage'] = AHA_master2['mcdipd_as'] / AHA_master2['ipdtot_as'] * 100

# bed size 
AHA_master2['bedsize'] = AHA_master2['bsc_as'].astype(int)

# hospital ownership type 

AHA_master2['nonfederal_governement'] = ((AHA_master2['cntrl_as'] == 12) | (AHA_master2['cntrl_as'] == 13)|(AHA_master['cntrl_as'] == 14) | (AHA_master['cntrl_as'] == 15)| (AHA_master['cntrl_as'] == 16)).astype(int)
AHA_master2['non_profit_nongovernment'] = ((AHA_master2['cntrl_as'] == 21) | (AHA_master2['cntrl_as'] == 23)).astype(int)
AHA_master2['for_profit'] = ((AHA_master2['cntrl_as'] == 31) | (AHA_master2['cntrl_as'] == 32) | (AHA_master['cntrl_as'] == 33)).astype(int)
AHA_master2['federal_government'] = ((AHA_master2['cntrl_as'] == 40) | (AHA_master2['cntrl_as'] == 44) | (AHA_master2['cntrl_as'] == 45) | (AHA_master2['cntrl_as'] == 46) | (AHA_master['cntrl_as'] == 47) | (AHA_master['cntrl_as'] == 48)).astype(int)
# Create a categorical column for hospital ownership types
def create_ownership_category(row):
    if row['cntrl_as'] in [12, 13, 14, 15, 16]:
        return 'nonfederal_government'
    elif row['cntrl_as'] in [21, 23]:
        return 'non_profit_nongovernment'
    elif row['cntrl_as'] in [31, 32, 33]:
        return 'for_profit'
    elif row['cntrl_as'] in [40, 44, 45, 46, 47, 48]:
        return 'federal_government'
    else:
        return 'other'

# Create the categorical column
AHA_master2['ownership_type'] = AHA_master2.apply(create_ownership_category, axis=1)



In [None]:
AHA_master3 = AHA_master2[ai_ml + all_covariates + ['id_it', 'mcrnum_as']]
AHA_master3['mcrnum_as'] = AHA_master3['mcrnum_as'].astype(str).str.zfill(6)
for col in ai_ml+hospital_resource:
    AHA_master3[col] = AHA_master3[col].astype(float).fillna(0)

In [None]:
# First, let's see what we're dealing with
print("Sample of mcrnum_as values:")
print(AHA_master3['mcrnum_as'].head(10))

# Then try this step-by-step conversion
# 1. First replace any 'nan' strings with actual NaN
AHA_master3['mcrnum_as'] = AHA_master3['mcrnum_as'].replace('nan', np.nan)

# 2. Convert to numeric, coercing errors to NaN
AHA_master3['mcrnum_as'] = pd.to_numeric(AHA_master3['mcrnum_as'], errors='coerce')

# 3. Convert to integer (this will handle the decimal points)
AHA_master3['mcrnum_as'] = AHA_master3['mcrnum_as'].fillna(-1).astype(int)

# 4. Convert to string and pad with zeros
AHA_master3['mcrnum_as'] = AHA_master3['mcrnum_as'].apply(lambda x: str(x).zfill(6) if x != -1 else np.nan)

# Check the result
print("\nResult after conversion:")
print(AHA_master3['mcrnum_as'].head(10))

In [407]:
death_complication = pd.read_csv("./data/outcomes/merged_death_complication.csv", low_memory=False)
death_complication = death_complication.replace('Not Available', np.nan)
death_complication_AHA = AHA_master3.merge(death_complication, left_on = 'mcrnum_as', right_on = 'Facility ID', how = 'left') 

In [408]:
# Fix the get_sort_index function to properly return a value
def get_sort_index(tp):
    try:
        if pd.isna(tp) or tp == 'nan':
            return np.nan
        month, year = tp.split('_')
        return f"{year}_{month.zfill(2)}"  # Returns "2022_01" format
    except:
        return np.nan

In [409]:
def visualize_hospital_quality_over_time(df, outcome_column):
    # Set publication-quality settings
    plt.style.use('seaborn-v0_8-whitegrid')
    plt.rcParams.update({
        'font.family': 'Helvetica',
        'font.size': 12,
        'axes.labelsize': 12,
        'axes.titlesize': 14,
        'xtick.labelsize': 10,
        'ytick.labelsize': 10,
        'legend.fontsize': 10,
        'figure.dpi': 300,
        'savefig.dpi': 300,
        'pdf.fonttype': 42,
        'ps.fonttype': 42,
        'axes.linewidth': 1,
        'axes.grid': True,
        'grid.alpha': 0.3,
        'grid.linestyle': '--',
        'grid.linewidth': 0.5
    })

    # Data preparation
    df[outcome_column] = pd.to_numeric(df[outcome_column], errors='coerce')
    df['time_point'] = df['time_point'].astype(str).replace('nan', np.nan)
    
    # Apply the function to transform date format within this function
    df['time_point2'] = df['time_point'].apply(get_sort_index)
    model_df = df[[outcome_column, 'time_point2', 'ai_base_score', 'id_it']].dropna()
    model_df['ai_cat'] = model_df['ai_base_score'].astype('category')
    
    # Fit model
    model = smf.mixedlm(f'{outcome_column} ~ time_point2 + ai_cat + time_point2:ai_cat',
                    model_df, groups='id_it').fit(reml=False)
    
    # Define ALL quarters with numeric indices for proper ordering
    quarter_data = {
        '2022_01': {'idx': 0, 'label': 'Q1 2022'},
        '2022_04': {'idx': 1, 'label': 'Q2 2022'},
        '2022_07': {'idx': 2, 'label': 'Q3 2022'}, 
        '2022_10': {'idx': 3, 'label': 'Q4 2022'},
        '2023_01': {'idx': 4, 'label': 'Q1 2023'},
        '2023_04': {'idx': 5, 'label': 'Q2 2023'},
        '2023_07': {'idx': 6, 'label': 'Q3 2023'},
        '2023_10': {'idx': 7, 'label': 'Q4 2023'},
        '2024_01': {'idx': 8, 'label': 'Q1 2024'},
        '2024_04': {'idx': 9, 'label': 'Q2 2024'}, 
        '2024_07': {'idx': 10, 'label': 'Q3 2024'}, 
        '2024_10': {'idx': 11, 'label': 'Q4 2024'}, 
        '2025_02': {'idx': 12, 'label': 'Q1 2025'},
        '2025_04': {'idx': 13, 'label': 'Q2 2025'},
    }
    
    # Calculate summary statistics
    summary = (model_df
           .groupby(['time_point2', 'ai_cat'])[outcome_column]
           .agg(['mean', 'sem'])
           .reset_index())
    
    # Add numeric indices for proper sorting
    summary['x_idx'] = summary['time_point2'].map(lambda x: quarter_data.get(x, {}).get('idx', -1))
    
    # Create figure with specific size and DPI
    fig, ax = plt.subplots(figsize=(10, 6), dpi=300)
    
    # Define custom color palette
    colors = {
        0: '#808080',  # Gray for AI 0
        1: '#2ca02c',  # Green for AI 1
        2: '#1f77b4'   # Blue for AI 2
    }
    
    # Plot for each AI level using numeric indices for x-axis
    for ai in sorted(summary['ai_cat'].unique()):
        subset = summary[summary['ai_cat'] == ai].sort_values('x_idx')
        
        # Plot lines connecting available data points
        ax.plot(subset['x_idx'], subset['mean'], 
                label=f'AI {int(ai)}', 
                marker='o',
                color=colors[int(ai)],
                linewidth=2,
                markersize=8,
                markeredgecolor='white',
                markeredgewidth=1)
        
        # Add error bands
        ax.fill_between(subset['x_idx'],
                     subset['mean'] - subset['sem'],
                     subset['mean'] + subset['sem'],
                     alpha=0.2,
                     color=colors[int(ai)])

    # Set x-ticks at all quarter positions
    x_ticks = [v['idx'] for v in quarter_data.values()]
    x_labels = [v['label'] for v in quarter_data.values()]
    
    # Ensure x-axis shows ALL quarters in proper order
    ax.set_xticks(x_ticks)
    ax.set_xticklabels(x_labels, rotation=45, ha='right')
    
    # Set x-axis limits to ensure we start at Q1 2022
    ax.set_xlim(-0.5, max(x_ticks) + 0.5)
    
    # Calculate appropriate y-axis limits
    y_min = summary['mean'].min() - 2 * summary['sem'].max()
    y_max = summary['mean'].max() + 2 * summary['sem'].max()
    
    # Add some padding
    y_range = y_max - y_min
    y_min = max(0, y_min - 0.05 * y_range)  # Start at 0 if data allows
    y_max = y_max + 0.05 * y_range
    
    # Set y-axis limits
    ax.set_ylim(y_min, y_max)
    
    # Customize the plot
    ax.set_xlabel("Quarter", fontweight='bold', labelpad=10)
    ax.set_ylabel("Quality Metric (Mean ± SEM)", fontweight='bold', labelpad=10)
    ax.set_title(f"{outcome_column} Over Time by AI Level", 
                 fontweight='bold', pad=20)
    
    # Add legend with custom formatting
    legend = ax.legend(title="AI Level", 
                      title_fontsize=12,
                      frameon=True,
                      framealpha=0.95,
                      edgecolor='black',
                      loc='best')
    
    # Add grid with custom styling
    ax.grid(True, linestyle='--', alpha=0.3)
    
    # Remove top and right spines
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    
    # Adjust layout
    plt.tight_layout()
    
    # Print data point statistics 
    print("Data points by time and AI level:")
    crosstab = pd.crosstab(summary['time_point2'], summary['ai_cat'])
    print(crosstab)
    
    plt.show()

    # Print model summary statistics
    print("\nModel Summary Statistics:")
    print(f"Number of observations: {len(model_df)}")
    print(f"Number of hospitals: {model_df['id_it'].nunique()}")
    print("\nFixed Effects:")
    print(model.summary().tables[1])

In [224]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import patsy

def visualize_adjusted_hospital_quality_over_time(
    df, outcome_column, additional_covariates=None
):
    """
    Plot adjusted trends of a quality metric over time by AI level,
    adjusting for optional selected covariates.
    
    Parameters:
    - df: DataFrame with columns:
        * outcome_column (continuous)
        * time_point (str, e.g., '2022_01')
        * ai_base_score (numeric or categorical)
        * id_it (hospital identifier)
        * any additional covariates (columns in df) if provided
    - outcome_column: str name of the quality metric column
    - additional_covariates: list of column names to include as fixed effects
    """
    # Publication-style settings
    plt.style.use('seaborn-v0_8-whitegrid')
    plt.rcParams.update({
        'font.family': 'Helvetica',
        'font.size': 12,
        'axes.labelsize': 12,
        'axes.titlesize': 14,
        'xtick.labelsize': 10,
        'ytick.labelsize': 10,
        'legend.fontsize': 10,
        'figure.dpi': 300,
        'axes.linewidth': 1,
    })
    
    # Copy & core prep
    df = df.copy()
    df[outcome_column] = pd.to_numeric(df[outcome_column], errors='coerce')
    df['time_point2'] = df['time_point'].apply(get_sort_index)
    df['ai_cat'] = df['ai_base_score'].astype('category')
    
    # Build model DataFrame
    cols = [outcome_column, 'time_point2', 'ai_cat', 'id_it']
    if additional_covariates:
        cols += additional_covariates
    model_df = df[cols].dropna()
    model_df['time_num'] = model_df['time_point2'].apply(convert_time_to_numeric)
    
    # Construct formula
    base_terms = ["time_num * ai_cat"]
    cov_terms = additional_covariates if additional_covariates else []
    formula = f"{outcome_column} ~ " + " + ".join(base_terms + cov_terms)
    
    # Fit mixed-effects model
    model = smf.mixedlm(formula, model_df, groups='id_it', re_formula="~time_num")
    result = model.fit(reml=False)
    
    # Quarter mapping
    quarter_data = {
        '2022_01': (0, 'Q1 2022'), '2022_04': (1, 'Q2 2022'),
        '2022_07': (2, 'Q3 2022'), '2022_10': (3, 'Q4 2022'),
        '2023_01': (4, 'Q1 2023'), '2023_04': (5, 'Q2 2023'),
        '2023_07': (6, 'Q3 2023'), '2023_10': (7, 'Q4 2023'),
        '2024_01': (8, 'Q1 2024'), '2024_04': (9, 'Q2 2024'),
        '2024_07': (10,'Q3 2024'), '2024_10': (11,'Q4 2024'),
        '2025_02': (12,'Q1 2025'), '2025_04': (13,'Q2 2025')
    }
    
    # Get typical values for covariates
    typicals = {}
    if additional_covariates:
        for cov in additional_covariates:
            if pd.api.types.is_numeric_dtype(model_df[cov]):
                typicals[cov] = model_df[cov].median()
            else:
                typicals[cov] = model_df[cov].mode()[0]
                
    # Build prediction grid
    times = sorted(model_df['time_point2'].unique(), key=lambda x: quarter_data[x][0])
    ai_levels = sorted(model_df['ai_cat'].cat.categories)
    pred_rows = []
    for t in times:
        for a in ai_levels:
            row = {
                'time_point2': t,
                'time_num': convert_time_to_numeric(t),
                'ai_cat': a
            }
            for cov, val in typicals.items():
                row[cov] = val
            pred_rows.append(row)
    pred_df = pd.DataFrame(pred_rows)
    
    # Predict
    design = patsy.dmatrix(formula, pred_df, return_type='dataframe')
    pred_df['predicted'] = result.predict(design)
    pred_df['x_idx'] = pred_df['time_point2'].map(lambda x: quarter_data[x][0])
    pred_df['label'] = pred_df['time_point2'].map(lambda x: quarter_data[x][1])
    
    # Plot
    fig, ax = plt.subplots(figsize=(10, 6), dpi=300)
    colors = {0:'#808080',1:'#2ca02c',2:'#1f77b4'}
    for a in ai_levels:
        sub = pred_df[pred_df['ai_cat']==a].sort_values('x_idx')
        ax.plot(sub['x_idx'], sub['predicted'], marker='o', color=colors[int(a)], label=f"AI {a}")
    ax.set_xticks([v[0] for v in quarter_data.values()])
    ax.set_xticklabels([v[1] for v in quarter_data.values()], rotation=45, ha='right')
    ax.set_xlim(-0.5, max(v[0] for v in quarter_data.values())+0.5)
    ax.set_xlabel("Quarter", fontweight='bold', labelpad=10)
    ax.set_ylabel(f"Adjusted {outcome_column}", fontweight='bold', labelpad=10)
    title = f"{outcome_column} Over Time by AI Level (Adjusted)"
    if additional_covariates:
        title += "\n(adjusted for " + ", ".join(additional_covariates) + ")"
    ax.set_title(title, fontweight='bold', pad=20)
    ax.legend(title="AI Level", frameon=True, edgecolor='black')
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    plt.tight_layout()
    plt.show()
    
    # Print summary
    print("MixedLM Results:\n", result.summary())



In [410]:
# First, let's convert the time points to a numeric format
# Assuming time_point2 is in format 'YYYY_MM'
def convert_time_to_numeric(time_str):
    year, month = time_str.split('_')
    return float(year) + (float(month) - 1) / 12  # This will give us years as decimal

In [411]:
def calculate_slopes_by_ai_level(df, outcome_column):
    df[outcome_column] = pd.to_numeric(df[outcome_column], errors='coerce')
    df['time_point2'] = df['time_point'].apply(get_sort_index)
    model_df = df[[outcome_column, 'time_point2', 'ai_base_score', 'id_it']].dropna()
    model_df['time_point2_numeric'] = model_df['time_point2'].apply(convert_time_to_numeric)
    model_df['ai_cat'] = model_df['ai_base_score'].astype('category')
    # Now fit the model with the numeric time variable
    model = smf.mixedlm(f'{outcome_column} ~ time_point2_numeric + ai_cat + time_point2_numeric:ai_cat',
                    model_df, groups='id_it')
    result = model.fit(reml=False)
    # Extract coefficients
    coefs = result.params

    # Calculate slopes per group
    slopes = {
    "AI 0 (reference)": coefs["time_point2_numeric"],
    "AI 1": coefs["time_point2_numeric"] + coefs.get("time_point2_numeric:ai_cat[T.1.0]", 0),
    "AI 2": coefs["time_point2_numeric"] + coefs.get("time_point2_numeric:ai_cat[T.2.0]", 0)
    }

    # Format as a DataFrame
    slope_df = pd.DataFrame.from_dict(slopes, orient='index', columns=['Estimated Slope (per year)'])
    slope_df.index.name = "AI Level"

    # Display the results
    print("\nEstimated Slopes by AI Level:")
    print(slope_df)

    # Show the full model summary
    print("\nModel Summary:")
    print(result.summary())

In [412]:
def calculate_adjusted_slopes_by_ai_level(df, outcome_column, additional_covariates=None):
    """
    Fit a mixed-effects model for outcome over time by AI level,
    adjusting for any selected covariates, and compute slopes per AI group.

    Parameters:
    - df: pandas.DataFrame with required columns
    - outcome_column: name of the outcome variable
    - additional_covariates: list of column names to include as fixed effects

    Returns:
    - result: fitted mixedlm result object
    - slope_df: pandas.DataFrame with estimated slopes per AI level
    """
    df = df.copy()
    df[outcome_column] = pd.to_numeric(df[outcome_column], errors='coerce')
    df['time_point2'] = df['time_point'].apply(get_sort_index)
    cols = [outcome_column, 'time_point2', 'ai_base_score', 'id_it']
    if additional_covariates:
        cols += additional_covariates
    model_df = df[cols].dropna()
    
    # Numeric time and categorical AI
    model_df['time_point2_numeric'] = model_df['time_point2'].apply(convert_time_to_numeric)
    model_df['ai_cat'] = model_df['ai_base_score'].astype('category')
    
    # Build formula string
    base_terms = ['time_point2_numeric', 'ai_cat', 'time_point2_numeric:ai_cat']
    cov_terms = additional_covariates or []
    formula = f"{outcome_column} ~ " + " + ".join(base_terms + cov_terms)
    
    # Fit mixed-effects model
    model = smf.mixedlm(formula, model_df, groups='id_it')
    result = model.fit(reml=False)
    
   # Extract slopes for each AI category
    coefs = result.params
    slopes = {
        "AI 0 (reference)": coefs["time_point2_numeric"],
        "AI 1": coefs["time_point2_numeric"] + coefs.get("time_point2_numeric:ai_cat[T.1.0]", 0),
        "AI 2": coefs["time_point2_numeric"] + coefs.get("time_point2_numeric:ai_cat[T.2.0]", 0)
    }
    
    # Create detailed results table
    results_table = pd.DataFrame({
        'Coefficient': result.params,
        'Std Error': result.bse,
        'P-value': result.pvalues,
        'Lower CI': result.conf_int()[0],
        'Upper CI': result.conf_int()[1]
    }).round(5)
    
    # Create slope dataframe with more information
    slope_df = pd.DataFrame.from_dict(slopes, orient='index', columns=['Estimated Slope'])
    slope_df.index.name = "AI Level"
    slope_df = slope_df.round(5)
    
    # Display results
    print("\nSelected covariates:", additional_covariates)
    print("\nEstimated Slopes by AI Level:")
    print(slope_df)
    
    print("\nDetailed Model Results:")
    print(results_table)
    
    print("\nModel Summary:")
    print(result.summary())
    
    # Reset display options
    pd.reset_option('display.float_format')
    
    return result, slope_df, results_table

In [413]:
def get_list_covariate(result, outcome):
    """
    Extract the top 5 features for a specific outcome from lasso results.
    Allows partial matching of outcome names.
    
    Args:
        result (pd.DataFrame): DataFrame containing lasso results
        outcome (str): The outcome name to filter for (can be partial)
    
    Returns:
        list: List of top 5 features for the specified outcome
    """
    # Find all outcomes that contain the search string
    matching_outcomes = result[result['Outcome'].str.contains(outcome, case=False)]
    
    if matching_outcomes.empty:
        available_outcomes = result['Outcome'].tolist()
        raise ValueError(f"No outcomes found containing '{outcome}'. Available outcomes are:\n" + 
                       "\n".join(available_outcomes))
    
    if len(matching_outcomes) > 1:
        print(f"Warning: Multiple outcomes found containing '{outcome}':")
        for idx, row in matching_outcomes.iterrows():
            print(f"- {row['Outcome']}")
        print("Using the first match.")
    
    result_row = matching_outcomes.iloc[[0]]  # Take the first match
    
    features = []
    for i in range(5):
        num = i + 1
        string = 'Feature_' + str(num)
        features.append(result_row[string].iloc[0])
    return features

In [414]:
def visualize_adjusted_hospital_quality_over_time(df, outcome_column, additional_covariates):
    # Set publication-quality settings
    plt.style.use('seaborn-v0_8-whitegrid')
    plt.rcParams.update({
        'font.family': 'Helvetica',
        'font.size': 12,
        'axes.labelsize': 12,
        'axes.titlesize': 14,
        'xtick.labelsize': 10,
        'ytick.labelsize': 10,
        'legend.fontsize': 10,
        'figure.dpi': 300,
        'savefig.dpi': 300,
        'pdf.fonttype': 42,
        'ps.fonttype': 42,
        'axes.linewidth': 1,
        'axes.grid': True,
        'grid.alpha': 0.3,
        'grid.linestyle': '--',
        'grid.linewidth': 0.5
    })

    # Data preparation
    print("Data preparation and validation:")
    print(f"Total rows in dataset: {len(df)}")
    
    # Convert outcome to numeric and handle missing values
    df[outcome_column] = pd.to_numeric(df[outcome_column], errors='coerce')
    df['time_point'] = df['time_point'].astype(str).replace('nan', np.nan)
    
    # Apply the function to transform date format
    df['time_point2'] = df['time_point'].apply(get_sort_index)
    
    # Create model dataframe and handle missing values
    model_df = df[[outcome_column, 'time_point2', 'ai_base_score', 'id_it'] + additional_covariates].dropna()
    print(f"Rows after dropping missing values: {len(model_df)}")
    
    # Convert ai_base_score to category
    model_df['ai_cat'] = model_df['ai_base_score'].astype('category')
    
    # Create the formula string with all covariates
    formula = f"Q('{outcome_column}') ~ time_point2 + ai_cat + time_point2:ai_cat + " + " + ".join(additional_covariates)
    
    # Fit model
    model = smf.mixedlm(formula, model_df, groups='id_it').fit(reml=False)
    
    # Get unique time points from the data
    unique_time_points = sorted(model_df['time_point2'].unique())
    
    # Create prediction dataframe using only time points that exist in the data
    pred_rows = []
    for time_point in unique_time_points:
        for ai in model_df['ai_cat'].unique():
            # Use median values for additional covariates
            row = {'time_point2': time_point, 'ai_cat': ai}
            for cov in additional_covariates:
                row[cov] = model_df[cov].median()
            pred_rows.append(row)
    
    pred_df = pd.DataFrame(pred_rows)
    
    # Add the outcome column (required by statsmodels)
    pred_df[outcome_column] = 0  # This value doesn't matter for prediction
    
    # Get predictions
    pred_df['predicted'] = model.predict(pred_df)
    
    # Create mapping for x-axis labels
    quarter_labels = {
        '2022_01': 'Q1 2022',
        '2022_04': 'Q2 2022',
        '2022_07': 'Q3 2022',
        '2022_10': 'Q4 2022',
        '2023_01': 'Q1 2023',
        '2023_04': 'Q2 2023',
        '2023_07': 'Q3 2023',
        '2023_10': 'Q4 2023',
        '2024_01': 'Q1 2024',
        '2024_04': 'Q2 2024',
        '2024_07': 'Q3 2024',
        '2024_10': 'Q4 2024',
        '2025_02': 'Q1 2025',
        '2025_04': 'Q2 2025'
    }
    
    # Add numeric indices for proper sorting
    pred_df['x_idx'] = pred_df['time_point2'].map(lambda x: list(unique_time_points).index(x))
    
    # Create figure
    fig, ax = plt.subplots(figsize=(10, 6), dpi=300)
    
    # Define custom color palette
    colors = {
        0: '#808080',  # Gray for AI 0
        1: '#2ca02c',  # Green for AI 1
        2: '#1f77b4'   # Blue for AI 2
    }
    
    # Plot for each AI level
    for ai in sorted(pred_df['ai_cat'].unique()):
        subset = pred_df[pred_df['ai_cat'] == ai].sort_values('x_idx')
        
        # Plot lines connecting available data points
        ax.plot(subset['x_idx'], subset['predicted'], 
                label=f'AI {int(ai)}', 
                marker='o',
                color=colors[int(ai)],
                linewidth=2,
                markersize=8,
                markeredgecolor='white',
                markeredgewidth=1)
    
    # Set x-ticks
    x_ticks = range(len(unique_time_points))
    x_labels = [quarter_labels.get(tp, tp) for tp in unique_time_points]
    
    ax.set_xticks(x_ticks)
    ax.set_xticklabels(x_labels, rotation=45, ha='right')
    
    # Set axis limits
    ax.set_xlim(-0.5, len(unique_time_points) - 0.5)
    
    # Calculate appropriate y-axis limits
    y_min = pred_df['predicted'].min() - 0.05 * (pred_df['predicted'].max() - pred_df['predicted'].min())
    y_max = pred_df['predicted'].max() + 0.05 * (pred_df['predicted'].max() - pred_df['predicted'].min())
    ax.set_ylim(y_min, y_max)
    
    # Customize the plot
    ax.set_xlabel("Quarter", fontweight='bold', labelpad=10)
    ax.set_ylabel("Adjusted Quality Metric", fontweight='bold', labelpad=10)
    
    # Add title and subtitle
    ax.set_title(f"Adjusted {outcome_column} Over Time by AI Level", 
                 fontweight='bold', pad=20)

    
    # Add legend
    legend = ax.legend(title="AI Level", 
                      title_fontsize=12,
                      frameon=True,
                      framealpha=0.95,
                      edgecolor='black',
                      loc='best')
    
    # Add grid
    ax.grid(True, linestyle='--', alpha=0.3)
    
    # Remove top and right spines
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    
    # Adjust layout
    plt.tight_layout()
    
    # Print model summary and coefficient values
    print("\nModel Summary Statistics:")
    print(f"Number of observations: {len(model_df)}")
    print(f"Number of hospitals: {model_df['id_it'].nunique()}")
    print("\nFixed Effects:")
    print(model.summary().tables[1])
    
    # Print the actual coefficient values for additional covariates
    print("\nCoefficients for additional covariates:")
    for cov in additional_covariates:
        if cov in model.params.index:
            print(f"{cov}: {model.params[cov]:.4f}")
    
    plt.show()