In [57]:
import os
os.getcwd()
os.chdir('../../../')
os.getcwd() # TODO: make this nicer

'/'

In [58]:
import ehrapy as ep
import pandas as pd
import tableone

## Case Study Data

This tutorial explores the MIMIC-II IAC dataset. It was created for the purpose of a case study in the book: Secondary Analysis of Electronic Health Records, published by Springer in 2016. In particular, the dataset was used throughout Chapter 16 (Data Analysis) by Raffa J. et al. to investigate the effectiveness of indwelling arterial catheters in hemodynamically stable patients with respiratory failure for mortality outcomes. The dataset is derived from MIMIC-II, the publicly-accessible critical care database. It contains summary clinical data and  outcomes for 1,776 patients.

Reference: 

[1] Critical Data, M.I.T., 2016. Secondary analysis of electronic health records (p. 427). Springer Nature. (https://link.springer.com/book/10.1007/978-3-319-43742-2)

[2] https://github.com/MIT-LCP/critical-data-book/tree/master/part_ii/chapter_16/jupyter

[3] https://stackoverflow.com/questions/27328623/anova-test-for-glm-in-python/60769343#60769343

In [59]:
adata = ep.dt.mimic_2(encoded=False)

In [60]:
adata.var.index

Index(['aline_flg', 'icu_los_day', 'hospital_los_day', 'age', 'gender_num',
       'weight_first', 'bmi', 'sapsi_first', 'sofa_first', 'service_unit',
       'service_num', 'day_icu_intime', 'day_icu_intime_num',
       'hour_icu_intime', 'hosp_exp_flg', 'icu_exp_flg', 'day_28_flg',
       'mort_day_censored', 'censor_flg', 'sepsis_flg', 'chf_flg', 'afib_flg',
       'renal_flg', 'liver_flg', 'copd_flg', 'cad_flg', 'stroke_flg',
       'mal_flg', 'resp_flg', 'map_1st', 'hr_1st', 'temp_1st', 'spo2_1st',
       'abg_count', 'wbc_first', 'hgb_first', 'platelet_first', 'sodium_first',
       'potassium_first', 'tco2_first', 'chloride_first', 'bun_first',
       'creatinine_first', 'po2_first', 'pco2_first', 'iv_day_1'],
      dtype='object')

In [61]:
ep.ad.infer_feature_types(adata)
ep.ad.replace_feature_types(adata, ['day_icu_intime_num', 'hour_icu_intime'], 'numeric')


[93m![0m Features 'aline_flg', 'gender_num', 'service_num', 'day_icu_intime_num', 'hour_icu_intime', 'hosp_exp_flg', 'icu_exp_flg', 'day_28_flg', 'censor_flg', 'sepsis_flg', 'chf_flg', 'afib_flg', 'renal_flg', 'liver_flg', 'copd_flg', 'cad_flg', 'stroke_flg', 'mal_flg', 'resp_flg' were detected as categorical features stored numerically.Please verify and correct using `ep.ad.replace_feature_types` if necessary.


In [62]:
continuous_vars = adata.var.index[adata.var['feature_type'] == 'numeric']
categorical_vars = adata.var.index[adata.var['feature_type'] == 'categorical']

continuous_vars = continuous_vars.tolist()
categorical_vars = categorical_vars.tolist()
all_vars = continuous_vars + categorical_vars


## Case Study and Summary

Our Objective for this case study is as follows:
***
To estimate the effect that administration of IAC during an ICU admission has on 28 day
mortality in patients without sepsis who received mechanical ventilation within MIMIC II,
while adjusting for age, gender, severity of illness and comorbidities.
***
We will also not want to include the sepsis_ g variable as a covariate in any
of our models, as there are no patients with sepsis within this study to estimate the
effect of sepsis.

The next steps will vary slightly, but it is often useful to put yourself in the shoes
of a peer reviewer. What problems will a reviewer likely nd with your study and
how can you address them? Usually, the reviewer will want to see how the population differs for different values of the covariate of interest. In our case study, if
the treated group (IAC) differed substantially from the untreated group (no IAC),
then this may account for any effect we demonstrate. We can do this by summarizing the two groups using the [tableone](https://github.com/tompollard/tableone) package.

In [63]:
table = tableone.TableOne(adata.to_df(), all_vars, categorical=categorical_vars, groupby="aline_flg")
print(table.tabulate(tablefmt="fancy_grid"))

╒═══════════════════════════════╤═══════════╤═══════════╤═════════════════╤═════════════════╤═════════════════╕
│                               │           │ Missing   │ Overall         │ 0               │ 1               │
╞═══════════════════════════════╪═══════════╪═══════════╪═════════════════╪═════════════════╪═════════════════╡
│ n                             │           │           │ 1776            │ 792             │ 984             │
├───────────────────────────────┼───────────┼───────────┼─────────────────┼─────────────────┼─────────────────┤
│ icu_los_day, mean (SD)        │           │ 0         │ 3.3 (3.4)       │ 2.1 (1.9)       │ 4.3 (3.9)       │
├───────────────────────────────┼───────────┼───────────┼─────────────────┼─────────────────┼─────────────────┤
│ hospital_los_day, mean (SD)   │           │ 0         │ 8.1 (8.2)       │ 5.4 (5.4)       │ 10.3 (9.3)      │
├───────────────────────────────┼───────────┼───────────┼─────────────────┼─────────────────┼───────────

As you can see in Table, the IAC group differs in many respects to the
non-IAC group. Patients who were given IAC tended to have higher severity of
illness at baseline (`sapsi_first` and `sofa_first`), slightly older, less likely to
be from the MICU, and have slightly different co-morbidity pro les when compared to the non-IAC group.

Next, we can see how the covariates are distributed among the different out-
comes (death within 28 days versus alive at 28 days). This will give us an idea of
which covariates may be important for affecting the outcome.

In [64]:
table = tableone.TableOne(adata.to_df(), all_vars, categorical=categorical_vars, groupby="day_28_flg")
print(table.tabulate(tablefmt="fancy_grid"))

╒═══════════════════════════════╤═══════════╤═══════════╤═════════════════╤═════════════════╤═════════════════╕
│                               │           │ Missing   │ Overall         │ 0               │ 1               │
╞═══════════════════════════════╪═══════════╪═══════════╪═════════════════╪═════════════════╪═════════════════╡
│ n                             │           │           │ 1776            │ 1493            │ 283             │
├───────────────────────────────┼───────────┼───────────┼─────────────────┼─────────────────┼─────────────────┤
│ icu_los_day, mean (SD)        │           │ 0         │ 3.3 (3.4)       │ 3.2 (3.2)       │ 4.0 (4.0)       │
├───────────────────────────────┼───────────┼───────────┼─────────────────┼─────────────────┼─────────────────┤
│ hospital_los_day, mean (SD)   │           │ 0         │ 8.1 (8.2)       │ 8.4 (8.4)       │ 6.4 (6.4)       │
├───────────────────────────────┼───────────┼───────────┼─────────────────┼─────────────────┼───────────

we see that of the 984 subjects receiving IAC, 170 (17.2 %) died
within 28 days, whereas 113 of 792 (14.2 %) died in the no-IAC group. In a
univariate analysis we can assess if the lower rate of mortality is statistically significant, by tting a single covariate `aline_flg` logistic regression. This is easily done using the :py:func:`ep.tl.glm` function.

In [65]:

ep.ad.replace_feature_types(adata, ['day_28_flg'], 'numeric')
# target type has to be numerical for glm binomial see: https://en.wikipedia.org/wiki/Generalized_linear_model
glm_model = ep.tl.glm(adata, formula='day_28_flg ~ aline_flg', var_names=['day_28_flg', 'aline_flg'], family='Binomial')
res = glm_model.fit()
res.summary()

0,1,2,3
Dep. Variable:,day_28_flg,No. Observations:,1776.0
Model:,GLM,Df Residuals:,1774.0
Model Family:,Binomial,Df Model:,1.0
Link Function:,Logit,Scale:,1.0
Method:,IRLS,Log-Likelihood:,-777.43
Date:,"Wed, 12 Mar 2025",Deviance:,1554.9
Time:,15:47:53,Pearson chi2:,1780.0
No. Iterations:,4,Pseudo R-squ. (CS):,0.00168
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-1.7932,0.102,-17.650,0.000,-1.992,-1.594
aline_flg[T.1],0.2271,0.132,1.720,0.085,-0.032,0.486


Indeed, the p-value for `aline_flg` is about 0.09. As we saw in Table 16.2,
there are likely several important covariates that differed among those who received
IAC and those who did not. These may serve as confounders, and the possible
association we observed in the univariate analysis may be stronger, non-existent or
in the opposite direction (i.e., IAC having lower rates of mortality) depending on
the situation.

In [66]:
from anndata import AnnData
import numpy as np
import pandas as pd
import math

def cut2(adata, column_name, bins=5):
    # TODO: only create one column with the bin number as categorical
    # Convert the AnnData object to a DataFrame
    created_col_name = f"{column_name}_{bins}_bins"
    if created_col_name in adata.obs.columns:
        return adata
    df = adata.to_df()
    x = df[column_name]
    
    # Cut the data into bins
    bin_labels = [f"{column_name}_bin_{i+1}" for i in range(bins)]
    binned = pd.qcut(x, q=bins, labels=bin_labels)
    binned.name = created_col_name
   
    # one_hot_encoded = pd.get_dummies(binned)
    
    # one_hot_encoded = one_hot_encoded.reindex(index=adata.obs.index, fill_value=0)
    
    # Ensure the one-hot encoded data is numeric
    # one_hot_encoded = one_hot_encoded.astype(int)
    
    # Add the one-hot encoded columns to the DataFrame
    df = pd.concat([df, binned], axis=1)

    new_obs = adata.obs.copy()
    new_obs = pd.concat([new_obs, binned], axis=1)


    new_var = adata.var.copy()
    new_var.loc[created_col_name] = 'categorical'
    
    # Create a new AnnData object from the updated DataFrame
    new_adata = AnnData(
        X=df.values,  # Use the updated DataFrame as the new .X
        obs=new_obs,  # Copy the original observations
        var=new_var,
        uns=adata.uns.copy(),  # Copy the unstructured data
        obsm=adata.obsm.copy(),  # Copy the observation-level multidimensional annotations
        varm=adata.varm.copy(),  # Copy the variable-level multidimensional annotations
    )

    return new_adata

    

In [67]:

cut2_adata = cut2(adata, 'sapsi_first')
cut2_adata = cut2(cut2_adata, 'sofa_first')
cut2_adata.var.index

Index(['aline_flg', 'icu_los_day', 'hospital_los_day', 'age', 'gender_num',
       'weight_first', 'bmi', 'sapsi_first', 'sofa_first', 'service_unit',
       'service_num', 'day_icu_intime', 'day_icu_intime_num',
       'hour_icu_intime', 'hosp_exp_flg', 'icu_exp_flg', 'day_28_flg',
       'mort_day_censored', 'censor_flg', 'sepsis_flg', 'chf_flg', 'afib_flg',
       'renal_flg', 'liver_flg', 'copd_flg', 'cad_flg', 'stroke_flg',
       'mal_flg', 'resp_flg', 'map_1st', 'hr_1st', 'temp_1st', 'spo2_1st',
       'abg_count', 'wbc_first', 'hgb_first', 'platelet_first', 'sodium_first',
       'potassium_first', 'tco2_first', 'chloride_first', 'bun_first',
       'creatinine_first', 'po2_first', 'pco2_first', 'iv_day_1',
       'sapsi_first_5_bins', 'sofa_first_5_bins'],
      dtype='object')

In [68]:
dependant_var = 'day_28_flg' 
independent_vars = ["aline_flg", "age", "gender_num", 'sapsi_first_5_bins', 'sofa_first_5_bins', "service_unit", "chf_flg", "afib_flg", "renal_flg", "liver_flg", "copd_flg", "cad_flg", "stroke_flg", "mal_flg", "resp_flg"]
formula = f"{dependant_var} ~ {' + '.join(independent_vars)}"
var_names = independent_vars + [dependant_var]
co2_lm = ep.tl.glm(cut2_adata, var_names , formula, missing="drop", family="Binomial")
co2_lm_result = co2_lm.fit()
co2_lm_result.summary()

0,1,2,3
Dep. Variable:,day_28_flg,No. Observations:,1684.0
Model:,GLM,Df Residuals:,1661.0
Model Family:,Binomial,Df Model:,22.0
Link Function:,Logit,Scale:,1.0
Method:,IRLS,Log-Likelihood:,-479.06
Date:,"Wed, 12 Mar 2025",Deviance:,958.12
Time:,15:47:53,Pearson chi2:,1350.0
No. Iterations:,7,Pseudo R-squ. (CS):,0.2311
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-7.4832,0.849,-8.809,0.000,-9.148,-5.818
aline_flg[T.1],0.0327,0.204,0.160,0.873,-0.367,0.433
gender_num[T.1.0],0.1422,0.172,0.824,0.410,-0.196,0.480
sapsi_first_5_bins[T.sapsi_first_bin_2],0.0390,0.342,0.114,0.909,-0.631,0.709
sapsi_first_5_bins[T.sapsi_first_bin_3],0.6417,0.293,2.190,0.028,0.067,1.216
sapsi_first_5_bins[T.sapsi_first_bin_4],0.5174,0.293,1.769,0.077,-0.056,1.091
sapsi_first_5_bins[T.sapsi_first_bin_5],1.3695,0.295,4.637,0.000,0.791,1.948
sofa_first_5_bins[T.sofa_first_bin_2],0.5223,0.302,1.729,0.084,-0.070,1.114
sofa_first_5_bins[T.sofa_first_bin_3],0.6462,0.301,2.144,0.032,0.055,1.237


In [None]:
from scipy.stats import chi2
import math
def drop1(adata, dependant_var, independent_vars, missing="drop"):
    """
    Python implementation of R's drop1 function using ehrapy.tl.ols.

    Args:
        adata: The AnnData object for the OLS model.
        formula: The formula specifying the full model.
        var_names: A list of var names indicating which columns are for the OLS model.
        missing: Available options are 'none', 'drop', and 'raise'.
                 If 'none', no nan checking is done. If 'drop', any observations with nans are dropped.
                 If 'raise', an error is raised.

    Returns:
        pd.DataFrame: A DataFrame containing AIC, BIC, and LRT p-values for each model with one term dropped.
    """
    # Fit the full model
    formula = f"{dependant_var} ~ {' + '.join(independent_vars)}"
    var_names = independent_vars + [dependant_var]
    full_model = ep.tl.glm(adata, var_names=var_names, formula=formula, missing=missing, family="Binomial")
    full_model_result = full_model.fit()

    results = []

    # Drop one term at a time and compare models
    for term in independent_vars:
        reduced_formula = f"{formula.split('~')[0].strip()} ~ {' + '.join([t for t in independent_vars if t != term])}"
        reduced_model = ep.tl.glm(adata, var_names=var_names, formula=reduced_formula, missing=missing, family="Binomial")
        reduced_model_result = reduced_model.fit()

        df = full_model.df_model - reduced_model.df_model
        aic = reduced_model_result.aic
        deviance = reduced_model_result.deviance
        
        lrt_stat = -2 * (reduced_model_result.llf - full_model_result.llf)
        lrt_pvalue = chi2.sf(lrt_stat, df=df)  # p-value using chi-squared distribution

        results.append({
            "Dropped Term": term,
            "Deviance": deviance,
            "AIC": aic,
            "LRT Statistic": lrt_stat,
            "LRT p-value": lrt_pvalue,
        })

    results_df = pd.DataFrame(results)
    return results_df 


In [149]:
drop1(cut2_adata, dependant_var, independent_vars, missing="drop")

full Deviance: 958.1208636145351, reduced Deviance: 958.1464907719766
-479.0604318072676 -479.0732453859883
full Deviance: 958.1208636145351, reduced Deviance: 1010.7005076630577
-479.0604318072676 -505.3502538315289
full Deviance: 958.1208636145351, reduced Deviance: 958.8320546358298
-479.0604318072676 -479.4160273179148
full Deviance: 958.1208636145351, reduced Deviance: 1069.7956827916175
-479.0604318072676 -534.8978413958088
full Deviance: 958.1208636145351, reduced Deviance: 978.0509467148296
-479.0604318072676 -489.0254733574148
full Deviance: 958.1208636145351, reduced Deviance: 963.0696607757438
-479.0604318072676 -481.5348303878719
full Deviance: 958.1208636145351, reduced Deviance: 959.1229614832073
-479.0604318072676 -479.56148074160365
full Deviance: 958.1208636145351, reduced Deviance: 964.9391907956656
-479.0604318072676 -482.4695953978328
full Deviance: 958.1208636145351, reduced Deviance: 962.1066415419648
-479.0604318072676 -481.0533207709824
full Deviance: 958.120863

Unnamed: 0,Dropped Term,Deviance,AIC,LRT Statistic,LRT p-value
0,aline_flg,958.146491,1002.146491,0.025627,0.8728142
1,age,1010.700508,1054.700508,52.579644,4.131493e-13
2,gender_num,958.832055,1002.832055,0.711191,0.3990487
3,sapsi_first_5_bins,1069.795683,1107.795683,111.674819,3.1970940000000005e-23
4,sofa_first_5_bins,978.050947,1016.050947,19.930083,0.0005155226
5,service_unit,963.069661,1005.069661,4.948797,0.08421362
6,chf_flg,959.122961,1003.122961,1.002098,0.3168034
7,afib_flg,964.939191,1008.939191,6.818327,0.009022704
8,renal_flg,962.106642,1006.106642,3.985778,0.04588591
9,liver_flg,959.886794,1003.886794,1.765931,0.1838865


As you see from the output, each covariate is listed, along with a p-value (LRT p-value). Each row represents a hypothesis test with the bigger (alternative
model) being the full model, and each null being the full model
without the row’s covariate.  As we can see from the listed p-values,
aline_ g has the largest p-value, but we stipulated in our model selection plan
that we would retain this covariate as it’s our covariate of interest. We will then go
to the next largest p-value which is the `cad_flg` variable (coronary artery disease). From here we can update our model, and repeat the backwards elimination step on the
updated model.