# Chapter 13: JudiLingMeasures (R analysis)

If necessary install, and then load the mgcv package.

In [None]:
#install.packages("mgcv")
library(mgcv)

Load the measures data calculated in the previous notebook.

In [None]:
all_measures = read.table("../res/dlp_measures.csv", sep=",", header=T)

In [None]:
head(all_measures)

Load the DLP dataset (Keuleers et al, 2010, you can download it from [here](https://osf.io/uw7t6/) if you haven't downloaded it before) including information such as reaction times and accuracy.
Remove nonwords with duplicate spellings.

In [None]:
dlp.items = read.table("../dat/dlp-items.txt", header=T)
dup_spellings = dlp.items$spelling[duplicated(dlp.items$spelling)]
dup_spellings = dlp.items[dlp.items$spelling %in% dup_spellings,]
rownames_to_exclude = rownames(dup_spellings[dup_spellings$lexicality == "N",])
dlp.items = dlp.items[-as.numeric(rownames_to_exclude),]

In [None]:
head(dlp.items)

Merge the DLP data with reaction times and accuracy with the measures dataset.

In [None]:
dlp = merge(all_measures, dlp.items[,c("spelling", "rt", "accuracy")], by.x="spelling", by.y="spelling")

In [None]:
head(dlp)

Remove rows where RT and word frequency are NaN or infinite.

In [None]:
dlp = dlp[!is.na(dlp$rt),]
dlp = dlp[!is.infinite(dlp$rt),]
dlp = dlp[!is.na(dlp$celex.frequency),]

Normalise reaction times + word frequency. Add a column with number of letters for each word.

In [None]:
dlp$RTinv = -1000/dlp$rt
dlp$celex.frequency.log = log(as.numeric(dlp$celex.frequency) + 0.002)
dlp$nletters = nchar(dlp$spelling)

Normalise Coltheart's N for all words for which Coltheart's N is  > 0. Add a new column indicating whether Coltheart's N is 0 or > 0.

In [None]:
dlp$coltheart.N.log = as.numeric(dlp$coltheart.N)
dlp$has_neighbour = ifelse(dlp$coltheart.N > 0, 1, 0)
dlp$has_neighbour_fac = as.factor(dlp$has_neighbour)
dlp[dlp$has_neighbour == 1, "coltheart.N.log"] = as.numeric(scale(log(dlp$coltheart.N[dlp$has_neighbour==1])))

Install and load packages for plotting.

In [None]:
#install.packages("ggplot2")
#install.packages("GGally")
library(ggplot2)
library(GGally)

Plot the distributions of and correlations between classical predictors of lexical reaction times.

In [None]:
options(repr.plot.width=5, repr.plot.height=5)
#pdf("../../fig/dlp.dist.classical.pdf", he=5, wi=5)
ggpairs(dlp[, c("RTinv", "celex.frequency.log", "coltheart.N.log", "nletters")])
#dev.off()

GAM of RT predicted by word frequency, Coltheart's N and word length.

In [None]:
gam.classical = gam(RTinv ~ s(celex.frequency.log) + 
                                s(coltheart.N.log, by=has_neighbour_fac) + 
                                has_neighbour_fac  + 
                                s(nletters), 
                    data=dlp)

Inspect model summary and plot.

In [None]:
summary(gam.classical)

In [None]:
options(repr.plot.width=15, repr.plot.height=4)
#pdf("../fig/dlp.gam.classical.pdf", he=4, wi=15)
par(mfrow=c(1,4), mar=c(5.1, 5.1, 4.1, 2.1))
plot(gam.classical, scale=F, rug=T, scheme=1, shade.col="steelblue2", ylab="RTinv", cex.lab=2.5, cex.axis=2)
#dev.off()

In [None]:
options(repr.plot.width=11.5, repr.plot.height=4)
pdf("../fig/dlp.gam.classical_bw.pdf", he=4, wi=11.5)
par(mfrow=c(1,3), mar=c(5.1, 5.1, 4.1, 2.1))
plot(gam.classical, scale=F, rug=T, scheme=1, select=1, ylab="RTinv", cex.lab=2.5, cex.axis=2)
plot(gam.classical, scale=F, rug=T, scheme=1, select=3, ylab="RTinv", cex.lab=2.5, cex.axis=2)
plot(gam.classical, scale=F, rug=T, scheme=1, select=4, ylab="RTinv", cex.lab=2.5, cex.axis=2)
dev.off()

Plot the distributions of and correlations between reaction times and two DLM-based measures.

In [None]:
options(repr.plot.width=5, repr.plot.height=5)
#pdf("../../fig/dlp.dist.measures.pdf", he=5, wi=5)
dlp$SemanticSupportForForm.log = log(dlp$SemanticSupportForForm + 1)
ggpairs(dlp[, c("RTinv", "SemanticDensity", "SemanticSupportForForm.log")])
#dev.off()

GAM of RT predicted by word frequency, semantic density and semantic support for form.

In [None]:
gam.measures = gam(RTinv ~ s(celex.frequency.log) + 
                               s(SemanticDensity) + 
                               s(SemanticSupportForForm.log), 
                   data=dlp)

Inspect model summary and plot.

In [None]:
summary(gam.measures)

In [None]:
options(repr.plot.width=15, repr.plot.height=5)
#pdf("../../fig/dlp.gam.measures.pdf", he=5, wi=15)
par(mfrow=c(1,3), mar=c(5.1, 5.1, 4.1, 2.1))
plot(gam.measures, scale=F, rug=T, scheme=1, shade.col="steelblue2", ylab="RTinv", cex.lab=2.5, cex.axis=2)
#dev.off()

Inspect concurvity (collinearity in non-linear models) of the DLM-based model:

In [None]:
concurvity(gam.measures)

Concurvity is fine.

Calculate AIC of the classical and DLM-based measures GAMs.

In [None]:
AIC(gam.classical)

In [None]:
AIC(gam.measures)

In [None]:
AIC(gam.classical) - AIC(gam.measures)

The AIC of the DLM-based measures GAM is clearly lower than of the classical GAM. Thus, the DLM-based measures GAM is much more likely to have produced the observed data.

# References

Keuleers, E., Diependaele, K., and Brysbaert, M. (2010). Practice effects in large-scale visual word recognition studies: A lexical decision study on 14,000 dutch mono-and disyllabic words and nonwords. Frontiers in psychology, 1:174.