# Handling Clustered Data

One of the main uses for estimating equations historically has been to handle clustered data. This use was popularized by Liang & Zeger (1986). While `delicatessen` relies on estimating equations for other tasks, it can also be used to handle clustered data. This tutorial reviews how clustered observations can be handled using built-in `delicatessen` functionalities.

## Setup

In [1]:
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt

import delicatessen
from delicatessen import MEstimator
from delicatessen.estimating_equations import ee_regression
from delicatessen.utilities import aggregate_efuncs

print("Versions")
print("NumPy:       ", np.__version__)
print("SciPy:       ", sp.__version__)
print("pandas:      ", pd.__version__)
print("Matplotlib:  ", mpl.__version__)
print("Delicatessen:", delicatessen.__version__)

Versions
NumPy:        2.3.5
SciPy:        1.16.3
pandas:       2.3.3
Matplotlib:   3.10.8
Delicatessen: 4.1


This tutorial uses data High School and Beyond study from 1982. This data set comes from the `mlmRev` R package. See that package for details. In this example, we will conduct a simple regression analysis on mathematical achievement scores of students by several individual and school level factors. Here, clustering is assumed to occur at the school-level and is considered an incidental feature (i.e., school-specific coefficients are not of interest). 

The following code loads this data (saved as a .csv file)

In [2]:
d = pd.read_csv("data/hsb82.csv")
d['intercept'] = 1
d['female'] = np.where(d['sx'] == 'Female', 0, 1)

In [3]:
y = np.asarray(d['mAch'])  # Math achievement score
X = np.asarray(d[['intercept', 'female', 'cses']])
g = np.asarray(d['school'])

To begin, consider if we ignored the clustering. For this, we can fit a linear regression model using the built-in `ee_regression` function (as illustrated elsewhere)

In [4]:
def psi_i(theta):
    return ee_regression(theta=theta, y=y, X=X, model='linear')

In [5]:
estr_i = MEstimator(psi_i, init=[10, 0, 0])
estr_i.estimate()

In [6]:
estr_i.print_results()

              Estimation Method: M-estimator
--------------------------------------------------------------
No. Observations:        7185 | No. Parameters:              3
Solving algorithm:         lm | Max Iterations:           5000
Solving tolerance:      1e-09 | Allow P-Inverse:             1
Derivative Method:     approx | Deriv Approx:            1e-09
Small N Correction:      None | Distribution:           Z-stat
   Theta   StdErr  Z-score      LCL      UCL  P-value  S-value 
--------------------------------------------------------------
   12.01     0.11   114.12    11.80    12.21     0.00      inf 
    1.57     0.16     9.93     1.26     1.88     0.00    74.75 
    2.14     0.12    17.71     1.90     2.38     0.00   230.81 


From this model, we see that being male and having a higher SES relative to your school's average SES had a positive association math achievement scores. 

## Clustering by School

The previous results, particularly the inferential statistics (standard errors, Z-scores, confidence intervals, P-values, S-values), are all premised on that observations are independent. However, we might be skeptical of this assumption. In particular, those in the same school may be more similar than those from different schools. From a certain perspective, we can think about these correlated observations as contributing 'less than 1 unit's worth' of information to our model. We can use estimating equations and the sandwich variance to handle this challenge.

To do this, we will essentially collapse the estimating functions from $n$, the number of units, to $m$, the number of schools. So, this changes our sample size (and thus all asymptotics will be based on $m$ and not $n$ anymore). How we collapse observations is determined by something called the 'working correlation matrix'. This matrix stipulates how observations are correlated (and is something we assume beforehand). The good news is that the sandwich variance is robust to misspecification of this working correlation matrix.

Within `delicatessen`, the collapsing of estimating functions from $n$ to $m$ can be done by the `aggregate_efuncs` utility function. This function takes a given estimating function and adds together observations within the same cluster defined by the `group` argument. Note that this function only supports the 'independent' working correlation matrix. While this might be a known misspecification (in this and other clustering settings), this choice was made for several reasons: (1) this approach is more flexible and easily generalizes to arbitrary input estimating functions, (2) non-diagonal working correlation matrices rely on an additional assumption that may not hold and will produced biased *point* estimates when it doesn't. The second point is detailed further in Pepe & Anderson (1994) and Pan et al. (2000). The independent correlation matrix avoids this, so it doesn't rely on this assumption and won't produce biased point estimates. The downside of this choice is that the standard error estimate is not as efficient as could be (i.e., larger than needs be) when a non-independent working correlation matrix is specified and the prior assumption does hold. Despite this downside, the flexibility and robustness offered by this approach seems preferable. Therefore, only the independent working correlation matrix was made available.

The following code uses the `aggregate_efuncs` function to condense the previous estimating functions

In [7]:
def psi_s(theta):
    return aggregate_efuncs(psi_i(theta), group=g)

In [8]:
estr_s = MEstimator(psi_s, init=[10, 0, 0])
estr_s.estimate()

In [9]:
estr_s.print_results()

              Estimation Method: M-estimator
--------------------------------------------------------------
No. Observations:         160 | No. Parameters:              3
Solving algorithm:         lm | Max Iterations:           5000
Solving tolerance:      1e-09 | Allow P-Inverse:             1
Derivative Method:     approx | Deriv Approx:            1e-09
Small N Correction:      None | Distribution:           Z-stat
   Theta   StdErr  Z-score      LCL      UCL  P-value  S-value 
--------------------------------------------------------------
   12.01     0.26    45.31    11.49    12.52     0.00      inf 
    1.57     0.31     5.03     0.96     2.19     0.00    20.96 
    2.14     0.13    16.76     1.89     2.39     0.00   207.06 


In the results output, we can see several changes to our results. First, in the meta-data we see the number of observations drop from 7185 to 160. This is because while there were 7185 students in the study, these students only came from 160 different schools. Second, we see the standard errors are substantially larger than the previous case. This then leads to differences in the Z-scores, confidence intervals, P-values, and S-values. These increased, as we would expect with clustered data where there is some correlation between observations. While our inferential results changed, the point estimates did not. Again, this is because of our selection of the independent working correlation matrix.

That concludes this example of how to handle clustered data with `delicatessen`. While only show in the context of linear regression, the `aggregate_efuncs` function handles condensing any user-specified estimating functions.

## References 

Liang KY, & Zeger L. (1986). Longitudinal data analysis using generalized linear models. *Biometrika*, 73(1), 13-22.

Pan W, Louis TA, & Connett JE. (2000). A note on marginal linear regression with correlated response data. *The American Statistician*, 54(3), 191-195.

Pepe SM & Anderson GL (1994). A cautionary note on inference for marginal regression models with longitudinal data and general correlated response data. *Communications in Statistics-Simulation and Computation*, 23, 939-951.