# Dependent Data

Good video: https://www.youtube.com/watch?v=gWj4ZwB7f3o

We have dependent data when we have observations __correlated__ due to feature of study design
- several observations collected at one time point from __sampled clusters__ of analytic units (neighborhoods, schools, clinics etc) in __clustered__ studies
- several observations collected over time from __same individuals__ in __longitudinal__ studies

The models that we fit to these types of datasets __need to reflect the correlations__. The models that we were talking about in week two, basically assume that all the observations in our dataset are independent of each other (i.i.d). __Now we're talking about datasets where these observations might be correlated__. We have to make choices when specifying models for these data such that the models fitted reflect these correlations, in the observations.

We will use __Multilevel Models__ and __Marginal Models__ to model this kind of dependent data.

<font color="red">__Note__: We could partially simulate similar effect in normal linear models using __dummy variables__ for each cluster BUT we have to assume that observations from same cluster are independent. This can be a problem with dependent data. Accounting for correlations often substantially improves model fit when working with dependent data. Also, when there are too many subjects (or clusters), too many dummy variables can seriously impact the degrees of freedom and accuracy of the estimation of parameters. </font>

# Some reasonable modelling options
If we have dependent data, we have these options:
- Complete Pooling: One global model, irrespective of pooling. We don't care about clusters. Take all the data as one giant pool, assuming they are all i.i.d. and build a model. Here all the data is in one single model. Model can learn across the population but wont learn anything idiosyncratic about particular clusters.
- No Pooling: We build a separate model for each cluster. So I am going to split (also called sharding) my data into clusters and fit it separately for each cluster. Drawback is that sharding makes it so that any given model does not have much training data and the model can't learn across the population.
- One Hot Encode Cluster: Add the cluster as a regressor using dummy variables. This approach does not scale. If the cardinality of predictors rise, the computational complexity grows exponentially. Also, tree based approaches like Random forest and Gradient Boosted trees get lost in cases with high cardinal categorical variables. Hence this is not a good option. Also, scikit learn does not appropriately do splitting of high cardinality categorical variables. In packages like R, that do split properly, we can't do it for more than 50 one hot encoded categorical variables because computational complexity in splitting becomes infeasible. This is because of how splitting works in tree based models.
- Classical Mixed Effects Models: Linear, Non-linear and hierachical Bayesian models

# Multilevel Models

<font color="blue">Also known as: Random coefficient models, Varying coefficient models, Subject-specific models, Hierarchical linear models, Mixed-effects models</font>

multilevel models are a general class of statistical models that can be used to model dependent data, where the observations that arise from a randomly sampled cluster maybe correlated with each other. In these multilevel models, what makes them unique is that the regression coefficients that we were talking about in previous weeks are allowed to randomly vary across these randomly sampled higher-level clusters. So the regression coefficients no longer have to be fixed constants that we're trying to estimate, we can allow those coefficients to randomly vary across these higher-level units and estimate the amount of variability in these coefficients, which for example could describe relationships.

So in other words, each subject (or cluster) in our study could have their own unique intercept and their own unique slope, instead of assuming that everybody follows the same general relationship or the same general pattern. 

A multilevel models allow us to estimate the variability among subjects or clusters in terms of these coefficients of interest. So, in multilevel models, we still estimate regression parameters. These are still regression models. We're still interested in fixed parameters that describe relationships between variables, specifically between predictors and outcome variables but we go above and beyond what we talked about in the previous weeks and we estimate the variability of those coefficients across these clusters that have been randomly sampled at higher levels. So in addition to estimating overall relationships, we also estimate parameters that describe the variability of those relationships across these higher-level clusters. So, what this means is that multilevel models allow us to expand the types of inferences that we can make from fitting models to the data. So first of all, we can still make inference about the relationships between predictor variables and outcomes, that doesn't change. But on top of that, 
- __NEW addition:__ we can make inferences about how variable the coefficients are in the larger population from which these clusters, for example, schools or clinics, were randomly sampled, that's new. 
- __NEW addition:__ Something else that's new about the inferences we can make is that we can try to explain that variability among these higher-level clusters, with cluster level predictor variables. So we can use some feature of those randomly sampled clusters to try to explain that variability in the coefficients. That's another way that we can expand our inference with multilevel models.

__Level 1:__ 
$$y_{ij}=\beta_{0j} + \beta_{1j}x_{1ij}+e_{ij}$$
__Level 2:__
$$\beta_{0j} = \beta_0+u_{0j}$$
$$\beta_{1j} = \beta_1+u_{1j}$$

Here $\beta_{0j}$ (the intercept for each cluster) and $\beta_{1j}$ (the slope for each cluster) are __Random coefficients__ and NOT parameters. If you see level 2, $\beta_0$ is the regression parameter we are trying to estimate and this is fixed. But the $u_{0j}$ is the random variable (also called __Random effect__). Same for $\beta_{1j}$. __Random effect__ are __random variables__ whose values for different clusters are assumed to be random (depending on which clusters randomly sampled!) from normal distribution with mean 0 and __some variance__. In multilevel models, we are specifically interested in __estimating that variance__ i.e. in the variability of the u coefficients around the overall coefficient $\beta_0$ and the overall coefficient $\beta_1$. (Without random effects, we had to assume that observations from same cluster were independent.. and this was really a strong assumption and multilevel models help us here)

Here $j$ refers to the $j^{th}$ cluster and $i$ refers to the $i^{th}$ sample within the cluster. $\beta_1$ captures the relationship of the predictor $x_1$ with the dependent variable. The $j$ subscript in $\beta_{1j}$ makes that $\beta_1$ specific to the cluster $j$. $u_{0j}$ and $u_{1j}$ are random variables that are assumed to come from Normal distribution.

<font color="green">__Key Research Question__: __How much__ of unexplained variance due to __between-cluster variance__ in intercepts or slopes for given model?

That's a key research question that we try to answer with multilevel models, if we don't care about that between-cluster variance, we may not need to use multilevel models for our analysis. So, we need explicit research interest in estimating the variances of these random coefficients. If we're not interested in estimating that variance, we can easily consider other models for dependent data (like Marginal models)
</font>

http://mfviz.com/hierarchical-models/


This application provides a visual overview of the basic ideas of fitting multilevel models. Learners should follow through the text and the corresponding visualizations to get a graphical sense of what exactly is happening when we fit multilevel models.

# When to use multilevel models
- First of all, we need to have a dataset that's organized into clusters, so clinics, subjects, schools, neighborhoods, et cetera, where there are several correlated observations collected from each of the clusters. So, we have some reason to believe based on the study design that the observations on our dependent variable are going to be correlated, within one of these sampled clusters. 
- Second of all, the clusters themselves need to be randomly sampled from a larger population of clusters. So, in other words, we can't treat variables like gender or race, ethnicity as cluster variables, these are group variables where we have all the possible groups represented in the dataset. When we make the decision to include random effects of higher-level clusters, we're assuming that those higher-level clusters are randomly sampled, we don't randomly sample values of gender or values of race and ethnicity from a larger population of values on these variables. We do randomly sample neighborhoods or clinics or hospitals or whatever the case may be, and the random effects allow us to make inference about that larger population from which the clusters were sampled. 
- Third, we wish to explicitly model the correlation of observations within the same cluster, so the study design gives rise to this kind of dependency, and we want to model that correlation when we fit a statistical model to the data. 
- Fourth, we have explicit research interests in estimating that between cluster variance in the selected regression coefficients that define our model, again there are other models for dependent data that we could use, if we're not explicitly interested in that between-cluster variance.

So, given this explicit research interests in estimating between-cluster variance in these selected regression coefficients, here are some examples of the questions that we might want to answer with multilevel models. So, for example, 
- how much of the unexplained variance among hospitals in mean patient satisfaction is due to the size of the hospital? So, is there variability among hospitals and can that be explained by how big the hospital is? 
- Second example, how much variance is there in long-term trends of substance use for a sample of drug users? So, do different drug users follow different trends in terms of their long-term substance use? We can estimate that variance with multilevel models.


## Advantages of multilevel models over other approaches for dependent data

- Multilevel models also offer advantages over other approaches for dependent data. So, when we fit these models, we estimate one parameter that represents the variance of a given random coefficient across the clusters, and this is instead of estimating unique regression coefficients for every possible clusters (like in creating dummy variables for each cluster). So, this purely stratified approach where every cluster gets their own unique fixed regression coefficient, we just estimate one parameter that describes the variance of those random effects. This is a much more efficient approach to fitting these kinds of models, especially when we have a large number of clusters. 
- In addition, clusters with smaller sample sizes in our dataset, do not have as pronounced of an effect on that variance estimate as the larger clusters do. So, the effects of the smaller clusters shrink toward the overall mean of the outcome when we use this random effects approach, this is called shrinkage and this really matters when a lot of the clusters have smaller sample sizes, you don't want them to have as large of an influence, and that overall variance.

We can also add cluster-level predictors like T to explain variance in random effects denoted by u

Example: y = outcome, x = age, i = time point, j = subject
__Level 1:__ 
$$y_{ij}=\beta_{0j} + \beta_{1j}x_{1ij}+e_{ij}$$
__Level 2:__
$$\beta_{0j} = \beta_{00}+ \beta_{01}T_j + u_{0j}$$
$$\beta_{1j} = \beta_{10}+ \beta_{11}T_j + u_{1j}$$

Now we can __Test Hypothesis__ about regression parameters for T. Like _45% of between-subject variance in the age - Y relationship is due to T!_

 So, once we estimate $\beta_{01}$ and $\beta_{11}$, and test hypotheses about those parameters, if those parameters are significant, that means we're explaining some of the between cluster variance. So, we can make statements like 45 percent of the between subject variance in the relationship between age and the dependent variable y, can be explained by that subject level predictor T, and __this is a unique advantage of multilevel models. We can make inference about how much variance in the random effects gets explained by these higher-level covariates.__ 
 
 ## Multilevel models for continuous dependent variable 

Model for a __continuous dependent variable Y__, measured on __person i__ within __cluster j__

$$y_{ij}=\beta_{0} + \beta_{1}x_{1ij}+u_{0j}+u_{1j}x_{1ij}+e_{ij}$$

Here, we have two random effects.. one for intercept for each cluster and another for slope for each cluster. So, u coefficients come from bivariate normal distribution of means 0 and variance-covariance matrix of D. (we usually model random effects and random error terms with Normal distributions having mean of 0 and specified variances and co-variances.

$$ \bigg[ \begin{matrix} u_{0j} \\ u_{1j} \end{matrix} \bigg] \sim N \Bigg[ \bigg[\begin{matrix} 0 \\ 0 \end{matrix} \bigg],\bigg[ \begin{matrix} \sigma^2_0 & \sigma_{01} \\ \sigma_{01} & \sigma^2_1 \end{matrix} \bigg] \equiv D \Bigg] $$

Variance-covariance matrix of Random Effects (D)

$$e_{ij} \sim N(0,\sigma^2)$$

we allow those two random effects, the random intercept and the random slope to co-vary, so that $\sigma_{01}$  off the diagonal of the d matrix, that's the covariance of those two random effects. So, it could be possible that the higher a random intercept for example, the lower the random slope. That covariance might be negative. That's just one possibility, but we allow for that when specifying this model. We also assume that the error terms within clusters follow a normal distribution with a mean 0 and variance of $\sigma^2$

So, you can see we're actually estimating three variance components, $\sigma^2$, $\sigma^2_0$ and $\sigma^2_1$, in addition to the covariance of those two random effects $\sigma_{01}$.

__Errors independent of random effects__

Fixed effects = $\beta_{0} + \beta_{1}x_{1ij}$ (Note these beta parameters are fixed for entire model)

Random effects = $u_{0j}+u_{1j}x_{1ij}$ (u0 is for intercept and u1 is for slope)

Error = $e_{ij}$ (Error is capturing what is not covered by the regression coefficients and the random effects)

- __fixed effects__: these are regression coefficients or regression parameters that define unknown constants and these define the relationships between predictors and dependent variables that we wish to estimate. The fixed effects are largely used to define the __mean__ of our dependent variable in the model specification. 
- __random effects__: These are random variables. So, because they're random variables and not fixed constants, we need to define distributions for these random variables. 
 
Again: __Multilevel model are used because we have explicit interest in estimating variance of random cluster effects__

## why do we think about the multilevel specification? 

- This specification clearly defines the roles of covariates that are measured at higher levels in the multilevel model. 
- you can view each level two equation for a given random coefficient as an intercept only regression model in this specification. So, $\beta_{0j}$ is defined by a fixed intercept $\beta_0$, plus an error term, that random effect that allows each cluster to have a unique coefficient. 
- explain variance in those random effects by adding the fixed effects, the fixed regression parameters of level two covariates to the models.

## Multilevel Logistic Regression Models
### Model specification
Multilevel model for __binary dependent variable Y__, measured on __person i__ within __cluster j__

$$ln \bigg[ \frac{P(y_{ij}=1)}{1-P(y_{ij}=1)} \bigg] = logit[P(y_{ij}=1)]=\beta_{0} + \beta_{1}x_{1ij}+u_{0j}+u_{1j}x_{1ij}+e_{ij}$$

We use Maximum liklihood function to fit these models and come up with parameter estimates. Since multilevel models are computationally difficult, sometimes the solution might not converge. We then have to retry by adding more predictors. Manytimes, when liklihood function cannot be written, we use techniques like Adaptive Gaussian Quadrature to approximate liklihood function.

# When to use marginal models

- We have data from a longitudnal or clustered study design that __introduces dependencies in the collected data__, and we need to model those dependencies to obtain __accurate inferences__ abour relationships of interest. (Note: we don't assume i.i.d. data here. We have __dependent data__)

- We have __no interest__ in estimating between-subject or between-cluster varriance in coefficients of interest

- We wish to make inference about __overall, marginal relationships between IVs and DVs__ in the target population; i.e. we do not wish to condition on random effects of subjects or clusters in the modeling

# Why do we Fit Marginal Models?
- These models offer some advantages over other approaches for __dependent data__ (e.g. multilevel modeling):
    - Quicker computational times; faster estimation (especially for dependent variables that aren't continuous or normally distributed)
    - Robust standard errors that reflect the specified correlation structure
    - Easier accomodation of non-normal outcomes (recall that multilevel models for non-normal outcomes can take a while to estimate!)
    - __Remember though__, we can no longer make inference about between clustered variance in the coefficients of interest when fitting marginal models.