8_multilevel.Rmd

```{r, echo=FALSE}
cat(paste("(C) (cc by-sa) Wouter van Atteveldt & Jan Kleinnijenhuis, file generated", format(Sys.Date(), format="%B %d %Y")))
```


Multilevel Modeling with R
====

```
*Caveat* I am not an expert on multilevel modelling. Please take this document as a source of inspiration rather than as a definitive set of answers.
```

This document gives an overview of two commonly used packages for multi-level modelling in R (also called 'mixed models' or 'random effects models'). 

Since the yearly time series data we used so far are not suitable for multilevel analysis, 
let's take the textbook data of Joop Hox on popularity of pupils in schools:
(see also http://www.ats.ucla.edu/stat/examples/ma_hox/)

```{r}
library(foreign)
popdata<-read.dta("http://www.ats.ucla.edu/stat/stata/examples/mlm_ma_hox/popular.dta")
head(popdata)
```

Now, we can model a time series model with only the random intercept at the school level:


```{r}
library(nlme)
m = lme(popular ~ sex + texp, random=~1|school, popdata)
summary(m)
```

So, popularity of a course is determined by both gender and teacher experience. 
Let's try a varying slopes model, with teacher experience also differing per school,
and see whether that is a significant improvement:

```{r}
m2 = lme(popular ~ sex + texp, random=~texp|school, popdata)
anova(m, m2)
```

So, although the log likelihood of m2 is slightly better, it also uses more degrees of freedom and the BIC is higher, 
indicating a worse model. The `anova` output means that this change is not significant. 

Next, let's have a look at the slope of the  gender effect.
First, a useful tool can be a visual inspection of the slope for a random sample of schools, just to get an idea of variation.
First, take a random sample of 12 schools from the list of unique school ids:


```{r}
schools = sample(unique(popdata$school), size=12, replace=F)
sample = popdata[popdata$school %in% schools, ]
```

Now, we can use the `xyplot` function from the `lattice` package:

```{r}
library(lattice)
xyplot(popular~sex|as.factor(school),type=c("p","g","r"), col.line="black", data=sample)
```

So, (at least in my sample) there is considerable variation: in some schools gender has almost no effect,
but in other schools the slope is relatively steep and generally positive (meaning girls have higher popularity).
Let's test whether a model with a random slope on gender is a significant improvement:

```{r}
library(texreg)
m2 = lme(popular ~ sex + texp, random=~sex|school, popdata)
anova(m, m2)
screenreg(list(m, m2))
```

So, `m2` is indeed a significant improvement. 

The lme4 package
====

`lme4` is a package that gives a bit more flexibility in specifying time series.
Specifically, it allows us to specify a binomial family, i.e. logistic regression.
The following dichotomized the popularity and checks whether the effect of gender on popularity is dependent on the school:

```{r}
library(lme4)
popdata$dich = cut(popdata$popular, 2, labels=c("lo","hi"))

m = glmer(dich ~ sex + (1|school), popdata, family="binomial")
m2 = glmer(dich ~ sex + (1 + sex|school), popdata, family="binomial")
summary(m2)
anova(m, m2)
```

If we would like to see for which schools the effect of gender were the strongest, 
we can use the `ranef` function to get the intercepts and slopes per group, and order them by slope:

```{r}
effects = ranef(m2)$school
effects = effects[order(effects$sexgirl), ]
head(effects)
tail(effects)
```