# Chapter 14.2 Exercise 2 solutions: Spoken word duration of English homophones FIL (statistics)

## Preparations

Load csv file with FIL measures calculated for the training dataset

In [None]:
fil = read.csv("../res/timeAndThyme_FIL_measures.csv")
head(fil)

Load the dataset with the homophones from Gahl and Baayen (2024)

In [None]:
dat = read.table("../dat/time_thyme.txt", header=TRUE)
head(dat)

In [None]:
homophones = merge(dat, fil, by.x="Spelling", by.y="Word")

__not all homophones have embeddings for the common crawl fasttext embeddings__

In [None]:
nrow(dat) - nrow(homophones)

Load the GAM library

In [None]:
library(mgcv)

Prepare predictor variables

In [None]:
homophones$NounBias = factor(homophones$NounBias)
homophones$LogMeanBigramProbability = log(homophones$MeanBigramProbability)

Just as for the EOL-based GAMs, we use a baseline duration measure from which other predictors have been partialled out, we do this for both a model with localist variables, and a model with DLM variables. We first reconstruct the localist model.

In [None]:
homophones$ResidualLogBaselineDuration =
   resid(gam(LogBaselineDuration ~ s(LogPronunciationFrequency, k=3) +
                                   s(PhonologicalNeighborhoodDensity) +
                                   s(LogMeanBigramProbability, k=5),
             data=homophones)
   )

## GAMs

### Localist GAM

In [None]:
localist.gam = gam(list(LogMeanDuration ~ s(PauseQuotient) +
                                          NounBias +
                                          s(PhonologicalNeighborhoodDensity) +
                                          s(OrthographicRegularity) +
                                          te(LogCelexFrequency, LogRelativeFrequency) +
                                          s(ResidualLogBaselineDuration),
                                        ~ s(LogCelexFrequency)),
                   data=homophones, family="gaulss", method="ML")

In [None]:
summary(localist.gam)

In [None]:
AIC(localist.gam)

In [None]:
plot(localist.gam, pages=1, scale=0)

### GAM with DLM predictors

SemanticSupportForForm and C.Precision are strongly correlated predictors, and even more strongly so compared to when EOL is used:

In [None]:
cor(homophones$SemanticSupportForForm, homophones$C.Precision, method="sp")

The two distributions are also similar:

In [None]:
options(repr.plot.width=10, repr.plot.height=5)
par(mfrow=c(1,2))
plot(density(homophones$SemanticSupportForForm))
plot(density(homophones$C.Precision))

A logarithmic transformation is not optimal:

In [None]:
homophones$LogSemanticSupportForForm = log(homophones$SemanticSupportForForm+0.2)

In [None]:
options(repr.plot.width=5, repr.plot.height=5)
plot(density(homophones$LogSemanticSupportForForm))

A square root transformation performs somewhat better.

In [None]:
homophones$SqrtSemanticSupportForForm = sqrt(homophones$SemanticSupportForForm+0.2)
homophones$SqrtCPrecision = sqrt(homophones$C.Precision+0.2)

In [None]:
options(repr.plot.width=10, repr.plot.height=5)
par(mfrow=c(1,2))
plot(density(homophones$SqrtSemanticSupportForForm))
plot(density(homophones$SqrtCPrecision))

In [None]:
resid.gam = gam(LogBaselineDuration ~ #s(SqrtSemanticSupportForForm, k=3) +   ## n.s.
                                      s(Cind) +
                                      s(HomophoneSemanticSimilarity, k=3),
                data = homophones)
homophones$ResidualLogBaselineDurationFIL = resid(resid.gam)

Semantic support for form is not well supported:

In [None]:
dlm_fil.gam = gam(list(LogMeanDuration ~ s(PauseQuotient) +
                                         NounBias +
                                         s(HomophoneSemanticSimilarity) +
                                         s(SqrtSemanticSupportForForm) +
                                         s(OrthographicRegularity) +
                                         s(ResidualLogBaselineDurationFIL) +
                                         s(Cind),
                                       ~ s(Cind)),
                  data=homophones, family="gaulss", method="ML")

In [None]:
summary(dlm_fil.gam)

In [None]:
AIC(dlm_fil.gam, localist.gam)

In [None]:
options(repr.plot.width=10, repr.plot.height=10)
plot(dlm_fil.gam, pages=1, scale=0, scheme=1)

The partial effect plot suggests that there is a positive trend where we have dense data.  We therefore restrict the range of SemanticSupportForForm to the (0, 0.5) interval, removing the 43 outliers for which this measure is unlikely to be informative. 

In [None]:
homophones2 = homophones[homophones$SemanticSupportForForm > 0 & homophones$SemanticSupportForForm < 0.5,]
nrow(homophones)-nrow(homophones2)

In [None]:
43/nrow(homophones)

For the GAM, we now no longer need a transformation.

In [None]:
dlm_fil.gam2 = gam(list(LogMeanDuration ~ s(PauseQuotient) +
                                          NounBias +
                                          s(HomophoneSemanticSimilarity) +
                                          s(SemanticSupportForForm) +
                                          s(OrthographicRegularity) +
                                          s(ResidualLogBaselineDurationFIL) +
                                          s(Cind),
                                        ~ s(Cind)),
                   data=homophones2, family="gaulss", method="ML")
summary(dlm_fil.gam2)

In [None]:
options(repr.plot.width=5, repr.plot.height=5)
plot(dlm_fil.gam2, select=3, scale=0, scheme=1, shade.col="steelblue2")
abline(h=0, col="indianred")

With FIL, the SemanticSupportForForm measure is so strongly correlated with frequency that it becomes difficult to see its effect in a regression model that includes a frequency measure.

# References

Gahl, S. and Baayen, R. H. (2024). Time and thyme again: Connecting spoken word duration to
models of the mental lexicon. Language. accepted for publication.