---
title: GLMM and Hierarchical Bayesian Models
subject: Methods
subtitle: Partial pooling for stable estimation of hospital-level SSI risk
exports: 
  - format: pdf
    template: curvenote
    # template: arxiv_two_column # requires abstract
  # - format: pdf+tex
keywords: [hierarchical Bayesian models, generalized linear mixed models, partial pooling, hospital profiling, surgical site infection, small-area estimation, healthcare quality]
---

In [1]:
#| load-data
library(arrow)
library(lme4)

# Load data
colon_fac_ach <- read_parquet("data/colon_fac_ach.parquet")

“package ‘arrow’ was built under R version 4.5.2”

Attaching package: ‘arrow’


The following object is masked from ‘package:utils’:

    timestamp


Loading required package: Matrix



## GLMM

Facilities are nested within counties, creating a hierarchical data structure in which outcomes from facilities in the same county may be correlated due to shared local factors such as patient populations, referral patterns, or reporting practices.

To account for this clustering, I fit a binomial generalized linear mixed model with a random intercept for county and fixed effects for facility type. This specification allows the baseline log-odds of infection to vary across counties while estimating the average effect of facility type across all counties.

In [2]:
# Random intercepts glmm
glmm_fit <- 
  glmer(cbind(Infections_Reported, No_Infections) ~ Facility_Type + 
          (1 | County), data = colon_fac_ach, family = binomial)

summary(glmm_fit)

Generalized linear mixed model fit by maximum likelihood (Laplace
  Approximation) [glmerMod]
 Family: binomial  ( logit )
Formula: cbind(Infections_Reported, No_Infections) ~ Facility_Type + (1 |  
    County)
   Data: colon_fac_ach

      AIC       BIC    logLik -2*log(L)  df.resid 
    943.3     961.6    -466.6     933.3       283 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.0170 -0.7076 -0.3761  0.3658  6.6185 

Random effects:
 Groups Name        Variance Std.Dev.
 County (Intercept) 0.1779   0.4218  
Number of obs: 288, groups:  County, 42

Fixed effects:
                                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)                           -4.4636     0.2175 -20.523  < 2e-16 ***
Facility_TypeCommunity, >250 Beds      0.6328     0.2283   2.772  0.00556 ** 
Facility_TypeCommunity, 125-250 Beds   0.3296     0.2373   1.389  0.16478    
Facility_TypeMajor Teaching            0.6146     0.2152   2.855  0.00430 ** 
---
Signif. codes:  0 

The random-intercepts generalized linear mixed model yields results that are broadly consistent with those from the standard [logistic regression](#lr_fit) and [Bayesian binomial](#nh_fit) models. Facilities classified as community hospitals with more than 250 beds and major teaching hospitals exhibit significantly higher odds of surgical site infection relative to the reference group, while the effect for community hospitals with 125 to 250 beds is positive but not statistically significant.

The estimated county-level random intercept variance is 0.178, corresponding to a standard deviation of 0.422 on the log-odds scale, indicating meaningful heterogeneity in baseline infection risk across counties. This suggests that unobserved county-level factors contribute to variation in infection rates beyond what is explained by facility type alone.

Comparing AIC values with the [logistic model](#lr_fit) provides a rough measure of relative model fit. Although the models differ in structure and are estimated using different likelihood formulations, the large difference observed here, 978 for the facility-type-only logistic regression versus 943 for the GLMM, indicates that the model including a county-level random intercept fits the data substantially better.

## Partial pooling

The random-intercepts GLMM estimates facility-specific probabilities while accounting for clustering of facilities within counties. Each county $j$ is assigned a random intercept $u_j$ that represents its deviation from the overall mean log-odds. The model can be written as:

$$
\text{logit}(p_{ij}) = \beta_0 + \beta_1 \text{FacilityType}_{ij} + u_j
$$

where
- $p_{ij}$ is the probability of SSI for facility $i$ in county $j$
- $\beta_0$ is the overall intercept
- $\beta_1$ is the fixed effect of facility type
- $u_j \sim N(0, \sigma^2_\text{county})$ is the county-level random intercept

The estimated $u_j$ are empirical Bayes estimates of the random intercepts, commonly referred to as BLUPs. Each BLUP is a weighted combination of the observed county mean and the overall mean, implementing partial pooling: counties with few facilities are shrunk strongly toward the global mean, while counties with many facilities rely more on their own data. Formally, for a simplified normal approximation, the BLUP can be expressed as:

$$ \hat{u}_j = \frac{\sigma^2_\text{county}}{\sigma^2_\text{county} + \sigma^2_\text{resid}/n_j} \, (\bar{y}_j - \bar{y})$$

where
- $\hat{u}_j$ = estimated random intercept for county $j$ (BLUP)
- $\bar{y}_j$ = observed mean for county $j$
- $\bar{y}$ = overall mean
- $n_j$ = number of facilities in county $j$ (more facilities → less shrinkage)
- $\sigma^2_\text{resid}$ = residual variance at the facility level

This weighting shows that small counties are pulled toward the overall mean, while larger counties contribute more of their own information. The GLMM thus provides conditional estimates with partial pooling, stabilizing predictions for low-volume counties.

Bayesian hierarchical models implement the same partial pooling concept. They additionally produce full posterior distributions for each parameter, allowing richer probabilistic interpretation.