# Chapter 14.1: English s-duration (statistical analysis in R)

In [None]:
library(mgcv)
library(ggplot2)
library(GGally)

## Data preparation

Load the data created in the simulation step:

In [None]:
dat = read.csv("../res/s_duration_measures.csv")

In [None]:
head(dat)

Some data preprocessing, adapted from Schmitz et al. (2021):

In [None]:
dat <- droplevels(dat)

dat$speaker=factor(dat$speaker)
dat$Word=factor(dat$Word)

dat$Affix = factor(dat$Affix)
dat$Affix = relevel(dat$Affix, "NM")

dat$folType = factor(dat$folType)
dat$folType = relevel(dat$folType, "APP")

dat$biphoneProb <- factor(dat$preC)
print(levels(dat$biphoneProb))

levels(dat$biphoneProb) <- c("0", "0.00427397562455192", "0.000579054762036066",  "0.000716924943473225")

table(dat$biphoneProb, dat$preC)

dat$biphoneProb <- as.numeric(as.character(dat$biphoneProb))

## Classical model

We take the model specification from Schmitz et al. (2021)'s "traditional model". The only change we introduce here is that we use a Generalised Additive Mixed Model rather than a Linear Mixed Model, and we replace the random slope for Affix with a random intercept for Word.

In [None]:
gam.classical <- gam(sDurLog ~ 
                        Affix + 
                        s(speakingRate) +
                        s(baseDurLog) +
                        pauseBin +
                        biphoneProbSumBin +
                        folType +
                        s(speaker, bs = 're') + s(Word, bs = 're'),
                      data=dat, method="REML")

In [None]:
summary(gam.classical)

In [None]:
options(repr.plot.width=15, repr.plot.height=3.75)
#pdf("../../fig/s_dur.gam.classical.pdf", he=3.75, wi=15)
par(mfrow=c(1,4), mar=c(5.1, 5.1, 4.1, 2.1))
plot(gam.classical, scale=F, scheme=1, shade.col="steelblue2", 
     rug=T, ylab="Log s-duration", cex.lab=2.5, cex.axis=2, cex.main=2)
#dev.off()

Run some checks on the model to make sure there is no issue with concurvity (the equivalent of collinearity in non-linear models):

In [None]:
concurvity(gam.classical)

Concurvity should not be a problem here.
Next, we check whether the residuals are approximately normally distributed:

In [None]:
options(repr.plot.width=7, repr.plot.height=7)
gam.check(gam.classical)

## DLM-based model

For the DLM-based model, we want to make use of the "Support" variable. For this, we first need to inspect its distribution:

In [None]:
plot(density(dat$Support))

It has a trimodal distribution which arises because of systematical differences depending on words and affixes. The systematic differences between words will be captured by the random effect for Word, and the systematic differences between affixes are what we are interested in, so this is not a problem.

We now replace the Affix predictor with the Support measure:

In [None]:
gam.ldl <- gam(sDurLog ~ 
                        s(Support) +
                        s(speakingRate) +
                        s(baseDurLog) +
                        pauseBin +
                        biphoneProbSumBin +
                        folType +
                        s(speaker, bs="re") + s(Word, bs="re"),
                      data=dat, method="REML")

In [None]:
summary(gam.ldl)

In [None]:
options(repr.plot.width=11.25, repr.plot.height=7.5)
#pdf("../../fig/s_dur.gam.ldl.pdf", he=7.5, wi=11.25)
par(mfrow=c(2,3), mar=c(5.1, 5.1, 4.1, 2.1))
plot(gam.ldl, scale=F, scheme=1, shade.col="steelblue2", 
     rug=T, ylab="Log s-duration", cex.lab=2.5, cex.axis=2, cex.main=2)
#dev.off()

Run model checks:

In [None]:
concurvity(gam.ldl)

In [None]:
gam.check(gam.ldl)

Concurvity is a little high for the Support measure, but still acceptable. 

## Model comparison

Now compare the two models in terms of AIC:

In [None]:
AIC(gam.ldl)

In [None]:
AIC(gam.classical)

Test if the difference is significant:

In [None]:
install.packages("itsadug")

library(itsadug)
compareML(gam.classical, gam.ldl)

## Exercises

Replace Support with SemanticSupportForForm in the DLM-based GAMM:

In [None]:
gam.ldl2 <- gam(sDurLog ~ 
                        s(SemanticSupportForForm) +
                        s(speakingRate) +
                        s(baseDurLog) +
                        pauseBin +
                        biphoneProbSumBin +
                        folType +
                        s(speaker, bs="re") + s(Word, bs="re"),
                      data=dat, method="REML")

In [None]:
summary(gam.ldl2)

In [None]:
options(repr.plot.width=11.25, repr.plot.height=7.5)
#pdf("../../fig/s_dur.gam.ldl.pdf", he=7.5, wi=11.25)
par(mfrow=c(2,3), mar=c(5.1, 5.1, 4.1, 2.1))
plot(gam.ldl2, scale=F, scheme=1, shade.col="steelblue2", 
     rug=T, ylab="Log s-duration", cex.lab=2.5, cex.axis=2, cex.main=2)
#dev.off()

In [None]:
AIC(gam.ldl2)

The model fit is somewhat worse than that with Support. A reason could be that SemanticSupportForForm measures how much support all the trigrams in a word get from its semantics, whereas Support only measures the support for the final measures. Since we are investigating word-final s-durations, Support may be a more precise measure.