# Chapter 14.2: Spoken word duration of English homophones EOL (statistics)

## Preparations

Load csv file with EOL measures calculated for the training dataset

In [None]:
eol = read.csv("../res/timeAndThyme_EOL_measures.csv")
head(eol)

Load the dataset with the homophones from Gahl and Baayen (2024)

In [None]:
dat = read.table("../dat/time_thyme.txt", header=TRUE)
head(dat)

In [None]:
homophones = merge(dat, eol, by.x="Spelling", by.y="Word")

__not all homophones have fasttext embeddings__

In [None]:
nrow(dat) - nrow(homophones)

Load the GAM library

In [None]:
library(mgcv)

Prepare predictor variables

In [None]:
homophones$NounBias = factor(homophones$NounBias)
homophones$LogMeanBigramProbability = log(homophones$MeanBigramProbability)

Following Gahl & Baayen (2024), we use a baseline duration measure from which other predictors have been partialled out, we do this for both a model with localist variables, and a model with DLM variables.

In [None]:
homophones$ResidualLogBaselineDuration =
   resid(gam(LogBaselineDuration ~ s(LogPronunciationFrequency, k=3) +
                                   s(PhonologicalNeighborhoodDensity) +
                                   s(LogMeanBigramProbability, k=5),
             data=homophones)
   )

## GAMs

### Localist GAM

In [None]:
localist.gam = gam(list(LogMeanDuration ~ s(PauseQuotient) +
                                          NounBias +
                                          s(PhonologicalNeighborhoodDensity) +
                                          s(OrthographicRegularity) +
                                          te(LogCelexFrequency, LogRelativeFrequency) +
                                          s(ResidualLogBaselineDuration),
                                        ~ s(LogCelexFrequency)),
                   data=homophones, family="gaulss", method="ML")

In [None]:
summary(localist.gam)

In [None]:
AIC(localist.gam)

In [None]:
plot(localist.gam, pages=1, scale=0)

### GAM with DLM predictors

SemanticSupportForForm and C.Precision are strongly correlated predictors, unsurprisingly:

In [None]:
cor(homophones$SemanticSupportForForm, homophones$C.Precision, method="sp")

Gahl & Baayen (2024) selected SemanticSupportForForm, but this variable has a strong right skew, which requires a transformation. C.Precision doesn't require a transformation. In what follows, we consider both predictors.

In [None]:
options(repr.plot.width=10, repr.plot.height=5)
par(mfrow=c(1,2))
plot(density(homophones$SemanticSupportForForm), main="SemanticSupportForForm")
plot(density(homophones$C.Precision), main="C.Precision")

In [None]:
#pdf("../fig/densities_SSFF_CP.pdf", he=4,wi=8)
par(mfrow=c(1,2))
plot(density(homophones$SemanticSupportForForm), main="SemanticSupportForForm", col="steelblue2", lwd=3)
plot(density(homophones$C.Precision), main="C.Precision", col="steelblue2", lwd=3)
#dev.off()

In [None]:
homophones$LogSemanticSupportForForm = log(homophones$SemanticSupportForForm+0.05)

In [None]:
options(repr.plot.width=5, repr.plot.height=5)
plot(density(homophones$LogSemanticSupportForForm))

Note that after the transformation, we still have fairly long tails.

In [None]:
resid.gam = gam(LogBaselineDuration ~ s(C.Precision, k=3) +
                                      s(Cind) +
                                      s(HomophoneSemanticSimilarity, k=3),
                data = homophones)
homophones$ResidualLogBaselineDurationEOL = resid(resid.gam)

In [None]:
dlm_eol.gam = gam(list(LogMeanDuration ~ s(PauseQuotient) +
                                         NounBias +
                                         HomophoneSemanticSimilarity +
                                         C.Precision +
                                         OrthographicRegularity +
                                         ResidualLogBaselineDurationEOL +
                                         s(Cind),
                                       ~ s(Cind)),
                  data=homophones, family="gaulss", method="ML")

In [None]:
summary(dlm_eol.gam)

In [None]:
AIC(dlm_eol.gam, localist.gam)

Model refitted with smooths for known linear terms, for ease of visualization.

In [None]:
dlm_eol.gam0 = gam(list(LogMeanDuration ~ s(PauseQuotient) +
                                          NounBias +
                                          s(HomophoneSemanticSimilarity) +
                                          s(C.Precision) +
                                          s(OrthographicRegularity) +
                                          s(ResidualLogBaselineDurationEOL) +
                                          s(Cind),
                                        ~ s(Cind)),
                   data=homophones, family="gaulss", method="ML")

In [None]:
options(repr.plot.width=8, repr.plot.height=8)
par(mfrow=c(2,2))
ylimit=c(-0.25, 0.2)
plot(dlm_eol.gam0, select =2, scheme=1, scale=0, shade.col="steelblue2", ylab="partial effect (mean)", ylim=ylimit)
abline(h=0, col="indianred")
plot(dlm_eol.gam0, select =3, scheme=1, scale=0, shade.col="steelblue2", ylab="partial effect (mean)", ylim=ylimit)
abline(h=0, col="indianred")
plot(dlm_eol.gam0, select =6, scheme=1, scale=0, shade.col="steelblue2", ylab="partial effect (mean)", ylim=ylimit)
abline(h=0, col="indianred")
plot(dlm_eol.gam0, select =7, scheme=1, scale=0, shade.col="steelblue2", ylab="partial effect (variance)")
abline(h=0, col="indianred")

In [None]:
#pdf("../fig/time_thyme_gam.pdf", he=8, wi=8)
par(mfrow=c(2,2), oma=rep(0,4), mar=c(5,5,1,1))
ylimit=c(-0.25, 0.2)
plot(dlm_eol.gam0, select =2, scheme=1, scale=0, shade.col="steelblue2", ylab="partial effect (mean)", ylim=ylimit)
abline(h=0, col="indianred")
plot(dlm_eol.gam0, select =3, scheme=1, scale=0, shade.col="steelblue2", ylab="partial effect (mean)", ylim=ylimit)
abline(h=0, col="indianred")
plot(dlm_eol.gam0, select =6, scheme=1, scale=0, shade.col="steelblue2", ylab="partial effect (mean)", ylim=ylimit,
    xlab="contextual independence")
abline(h=0, col="indianred")
plot(dlm_eol.gam0, select =7, scheme=1, scale=0, shade.col="steelblue2", ylab="partial effect (variance)",
    xlab="contextual independence")
abline(h=0, col="indianred")
#dev.off()

In [None]:
cor(homophones$Cind, homophones$LogCelexFrequency)

### Analysis with SemanticSupportForForm (instead of C.Precision)

In [None]:
resid.gam = gam(LogBaselineDuration ~ s(LogSemanticSupportForForm, k=3) +
                                      s(Cind) +
                                      s(HomophoneSemanticSimilarity, k=3),
                data = homophones)
homophones$ResidualLogBaselineDurationEOLssf = resid(resid.gam)

In [None]:
dlm_eol.gam2 = gam(list(LogMeanDuration ~ s(PauseQuotient) +
                                          NounBias +
                                          HomophoneSemanticSimilarity +
                                          LogSemanticSupportForForm +
                                          OrthographicRegularity +
                                          ResidualLogBaselineDurationEOLssf +
                                          s(Cind),
                                        ~ s(Cind)),
                   data=homophones, family="gaulss", method="ML")

In [None]:
summary(dlm_eol.gam2)

In [None]:
AIC(dlm_eol.gam, localist.gam, dlm_eol.gam2)

In [None]:
options(repr.plot.width=10, repr.plot.height=10)
plot(dlm_eol.gam2, pages=1, scale=0, scheme=1)

# References

Gahl, S. and Baayen, R. H. (2024). Time and thyme again: Connecting spoken word duration to models of the mental lexicon. Language. accepted for publication.