vignettes/practical_intersectionality.Rmd

---
title: "Case Study: Intersectionality Analysis Using The MAIHDA Framework"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Case Study: Intersectionality Analysis Using The MAIHDA Framework}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r set-options, echo = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  dev = "png",
  out.width = "100%",
  dpi = 300,
  message = FALSE,
  warning = FALSE,
  package.startup.message = FALSE
)

options(modelbased_join_dots = FALSE)
options(modelbased_select = "minimal")

pkgs <- c(
  "datawizard",
  "parameters",
  "glmmTMB",
  "performance",
  "marginaleffects",
  "Matrix",
  "see",
  "ggplot2"
)

if (!all(insight::check_if_installed(pkgs, quietly = TRUE))) {
  knitr::opts_chunk$set(eval = FALSE)
}
if (getRversion() < "4.1.0") {
  knitr::opts_chunk$set(eval = FALSE)
}
```

This vignette demonstrate how to use *modelbased* in the context of an intersectional multilevel analysis of individual heterogeneity, using the MAIHDA framework. The general approach of the MAIHDA framework (sometimes also _I-MAIHDA_) is described in _Axelsson Fisk et al. 2018_ and _Evans et al. 2024_.

Intersectionality analysis is a new approach in social epidemiology, which attempts to move away from looking at relevant social indicators in isolation.

> "The advantage of incorporating an intersectional framework in social epidemiology is that it goes beyond the unidimensional study of socioeconomic and demographic categorizations by considering the effect of belonging to specific strata simultaneously defined by multiple social, economic and demographic dimensions."

The steps we are showing here are:

1. Defining the intersectional strata.

2. Fitting a multilevel model to see whether intersectional strata contribute to between-stratum variance (which can be considered as "inequalities", whether social or health related).

3. Fitting partially adjusted multilevel models and calculating proportional change in the between-stratum variance (PCVs) to quantify to what degree the different intersectional dimensions contribute to the between-stratum variance (inequalities).

4. Calculate adjusted predictions (estimated marginal means) of the outcome by intersectional strata, to get a clearer picture of the variation between intersectional dimensions, as well as testing specific strata for significant differences.

## 1. Preparing the data and defining intersectional strata

First, we load the required packages and prepare a sample data set. We use the `efc` data from the **modelbased** package, which contains data of family carers who care for their elderly relatives. Our outcome of interest is _quality of life_ of family carers (score ranging from 0 to 25 points), the different dimensions of the intersectionality groups are _gender_ (male/female), _employment status_ (currently employed yes/no) and _age_ (three groups: until 40, 41 to 64 and 65 or older). We assume that there might be health-related inequalities, i.e. the quality of life differs depending on the characteristics that define our intersectional strata.

```{r}
library(modelbased) # predictions and significance testing
library(insight) # extracting random effects variances
library(datawizard) # data wrangling and preparation
library(parameters) # model summaries
library(performance) # model fit indices, ICC
library(glmmTMB) # multilevel modelling

# sample data set
data(efc, package = "modelbased")

efc <- efc |>
  # numeric to factors, set labels as levels
  to_factor(select = c("c161sex", "c172code", "c175empl")) |>
  # recode age into three groups
  recode_values(
    select = "c160age",
    recode = list(`1` = "min:40", `2` = 41:64, `3` = "65:max")
  ) |>
  # rename variables
  data_rename(
    select = c("c161sex", "c160age", "quol_5", "c175empl"),
    replacement = c("gender", "age", "qol", "employed")
  ) |>
  # age into factor, set levels, and change labels for education
  data_modify(age = factor(age, labels = c("-40", "41-64", "65+")))
```

To include the intersectional strata variables `gender`, `employed` and `age` in our mixed model, we will define them as interacting random effects (excluding main effects of interactions): `(1 | gender:employed:age)` (see also below). The idea is to have truly unique combinations in our model, similar as if we would create a factor variable with all combinations manually:

```{r}
efc$strata <- ifelse(
  is.na(efc$employed) | is.na(efc$gender) | is.na(efc$age),
  NA_character_,
  paste0(efc$gender, ", ", efc$employed, ", ", efc$age)
)
efc$strata <- factor(efc$strata)
data_tabulate(efc$strata)
```

We now have the choice and could either use the `strata` variable as group factor for our random effects, or `gender:employed:age`. For plotting predictions (see section 4), we get clearer plots when we include the three factors `gender`, `employed` and `age` instead of the integrated `strata` factor.

## 2. Fitting the simple intersectional model

Intersectionality analysis aims at recognizing effects of belonging to specific strata simultaneously. In the context of the MAIHDA framework, the interest lies in analysing the variation between strata regarding the outcome of interest. Thus, the indicators that define the intersectional dimensions are used as interacting _random effects_ (group factors) in a multilevel model (random-intercept model).

We start by fitting a linear mixed effects model, which includes no fixed effects, but only our different intersectional dimensions: `gender`, `employed` and `age`.

```{r}
# Quality of Life score ranges from 0 to 25
m_null <- glmmTMB(qol ~ 1 + (1 | gender:employed:age), data = efc)

# the above model is identical to:
# m_null <- glmmTMB(qol ~ 1 + (1 | strata), data = efc)
```

The purpose of this model is to quantify the "discriminatory accuracy", which is achieved by calculating the ICC (see [`performance::icc()`](https://easystats.github.io/performance/reference/icc.html)) of this model. The higher the ICC, the greater the degree of similarity _within the strata_ (regarding quality of life) and the greater the difference in quality of life _between the intersectional strata_. I.e., the higher the ICC, the better the model is at discriminating individuals with higher or lower quality of life score, as opposed to models with lower ICC.

We now look at the model parameters and the ICC of our simple intersectional model.

```{r}
model_parameters(m_null)

icc(m_null)
```

The ICC with a value of about 4% is rather low. Usually, this indicates that our dimensions used to define the intersectional strata do not suggest larger social inequalities regarding quality of life. But we ignore this fact for now, as the purpose of demonstrating the analysis approach is rarely affected.

## 3. Partially-adjusted intersectional model and PCV

In the next step we want to find out, which intersectional dimension contributes most to possible inequalities, i.e. which of our group factors `gender`, `employed` and `age` explains most of the between-stratum variance of the random effects. This is achieved by fitting partially-adjusted intersectional models.

> "The purpose of the partially adjusted model was to quantify to what degree the different dimensions used to construct the intersectional strata contributed to the between stratum variance seen in the previous model."

For each of the intersectional dimensions, a multilevel model including this dimension as fixed effect is fitted. We can then both look at the ICCs of the partially-adjusted models, as well as at the proportional change in the between-stratum variance, the so-called _PCV_ coefficients.

First, we fit three models each with one dimension as predictor.

```{r}
m_gender <- glmmTMB(qol ~ gender + (1 | gender:employed:age), data = efc)
m_employment <- glmmTMB(qol ~ employed + (1 | gender:employed:age), data = efc)
m_age <- glmmTMB(qol ~ age + (1 | gender:employed:age), data = efc)
```

The regression coefficients already give an impression how strong the association between each single dimension and the outcome is, taking between-stratum variance into account. The larger (in absolute values) the coefficients, the higher the degree that dimension contributed to the between-stratum variance.

```{r}
compare_parameters(m_gender, m_employment, m_age)
```

Looking at the summary tables above, it seems like `gender` is the dimension that explains least of the between-stratum variance, i.e. gender seems to be the characteristic that contributes least to potential social inequalities. `age`, in turn, seems to be the most important characteristic regarding inequalities.

Since the fixed effects now take away some of the proportion of the variance explained by the grouping factors (random effects), we expect the ICC for the above models to be lower.

```{r}
icc(m_gender)$ICC_adjusted
icc(m_employment)$ICC_adjusted
icc(m_age)$ICC_adjusted
```

Indeed, the ICC correlates with the fixed effects coefficients, i.e. the larger the coefficient (in absolute values), the lower the ICC.

Next, we want to quantify the degree the different dimensions contribute to the variance between groups more accurately. To do so, we calculate the _proportional change in between-stratum variance_, or _PCV_. This coefficient explains how much of the total proportion of explained variance by the strata can be explained by a single dimension that define those strata. The PCV ranges from 0 to 1, and the closer to 1, the more this particular dimension explains social inequalities.

```{r}
# extract random effect variances from all models
v_null <- get_variance(m_null)
v_gender <- get_variance(m_gender)
v_employment <- get_variance(m_employment)
v_age <- get_variance(m_age)

# PCV (proportional change in between-stratum variance)
# from null-model to gender-model
(v_null$var.random - v_gender$var.random) / v_null$var.random

# PCV from null-model to employment-model
(v_null$var.random - v_employment$var.random) / v_null$var.random

# PCV from null-model to age-model
(v_null$var.random - v_age$var.random) / v_null$var.random
```

Again, we see that the PCV is in line with the models' ICC's and regression coefficients. We see the highest proportional change for `age`, meaning that - although gender and education can contribute to inequalities - age is the most relevant predictor.

## 4. Predict between-stratum variance and test for significant differences

Finally, we may want to have a clearer picture of how the different strata vary, which combination of characteristics defines the highest or maybe lowest risk group. To do so, we calculate predictions of the random effects (*unit-level* predictions).

The following code shows the predicted average quality of life scores for the different groups.

```{r}
predictions <- estimate_relation(m_null, by = c("gender", "employed", "age"))
plot(predictions)
```

According to these results, employed male family carers, who are not older than 40 years, show on average the highest quality of life. On the other hand, unemployed female carers aged 65 or older have the lowest quality of life.

We can now calculate pairwise comparisons that show which differences between groups are statistically significant. Since all combinations of pairwise comparisons would return 66 rows in total, we just show the first ten rows for demonstrating purpose.

```{r}
# just show first 10 rows of output...
estimate_contrasts(predictions, contrast = c("gender", "employed", "age"))[1:10, ]
```

If we only want to modulate one factor and compare those groups within the levels of the other groups, we can use the `by` argument. This reduces the output and only compares the focal term(s) within the levels of the remaining predictors.

```{r}
# Compare levels of gender and employment status for age groups
estimate_contrasts(predictions, contrast = c("gender", "employed"), by = "age")
```

E.g., if we look at the plot and want to know whether female persons aged 65+ differ depending on their employment status, we can use the following code:

```{r}
# Compare levels employment status by gender and age groups
estimate_contrasts(predictions, contrast = "employed", by = c("gender", "age"))
```

## 5. Conclusion

Intersectional multilevel analysis of individual heterogeneity, using the MAIHDA framework, is a new approach in social epidemiology, which helps to understand the interaction of social indicators with regard to social inequalities.

This approach requires the application of multilevel models, where ICC and PCV are relevant coefficients. The *modelbased* package allows to go beyond quantifying to what degree different intersectional dimensions contribute to inequalities by predicting the average outcome by group, thereby explicitly showing the differences between those groups (strata).

Furthermore, with *modelbased* it is possible to compare differences between groups and test whether these differences are statistically significant or not, i.e. whether we find "evidence" for social inequalities in our data for certain groups (at risk).

# References

1. Axelsson Fisk S, Mulinari S, Wemrell M, Leckie G, Perez Vicente R, Merlo J. Chronic Obstructive Pulmonary Disease in Sweden: An intersectional multilevel analysis of individual heterogeneity and discriminatory accuracy. SSM - Population Health (2018) 4:334-346. doi: 10.1016/j.ssmph.2018.03.005

2. Evans CR, Leckie G, Subramanian SV, Bell A, Merlo J. A tutorial for conducting intersectional multilevel analysis of individual heterogeneity and discriminatory accuracy (MAIHDA). SSM - Population Health (2024) 26; doi: 10.1016/j.ssmph.2024.101664