# Chapter 15.2: Frequency effects and boundary effects as litmus tests?

Loading libraries

In [None]:
using CSV, RCall, JudiLing, JudiLingMeasures, DataFrames

We first implement some Julia code to obtain the measures and statistics that we will then further scrutinize in R.

If you haven't done so before, download the BLP data (blp-items.txt.zip and blp-stimuli.txt.zip) from [here](https://osf.io/b5sdk/), store in `dat` and unzip them. Next, we load and merge the two dataframes:

In [None]:
items = JudiLing.load_dataset("../dat/blp-items.txt", delim="\t")
words = items[items.lexicality .== "W",:]
stimuli = JudiLing.load_dataset("../dat/blp-stimuli.txt", delim="\t")

# merge the two dataframes
english = leftjoin(words, stimuli, on = "spelling")

# only keep relevant columns
english = english[:, ["spelling", "subtlex.frequency", "coltheart.N", "nletters", "rt", "morphology"]]
english = english[english.rt .!= "NA",:]
english.frequency = english."subtlex.frequency" .+ 1;

Write out the data to provide the file that is used in the book.

In [None]:
CSV.write("../dat/words_dualroutes.csv", english)

Download fasttext embeddings from [here](https://fasttext.cc/docs/en/crawl-vectors.html) (the `text` embeddings for English), unzip and store in `dat`. Then:

In [None]:
english, S = JudiLing.load_S_matrix_from_fasttext_file(english, "../dat/cc.en.300.vec", target_col=:spelling);

Create the cue object:

In [None]:
cue_obj = JudiLing.make_cue_matrix(english, grams=3, 
                                   target_col=:spelling, tokenized=false);

Comprehension and production mapping:

In [None]:
F = JudiLing.make_transform_matrix(cue_obj.C, S, english.frequency);
Shat = cue_obj.C * F;
G = JudiLing.make_transform_matrix(S, cue_obj.C, english.frequency);   
Chat = S * G;

Generate measures:

In [None]:
mes = JudiLingMeasures.compute_all_measures_train(english, 
       cue_obj, Chat, S, Shat, F, G, low_cost_measures_only=true);

We need the by-trigram semantic supports.

In [None]:
tri_sup = JudiLingMeasures.semantic_support_for_form(cue_obj, Chat, sum_supports=false);

In [None]:
@rput tri_sup;

In [None]:
R"""
head(tri_sup)
"""

In [None]:
@rput mes;

In [None]:
R"""
mes$LogSSF = log(unlist(mes$SemanticSupportForForm)+0.8)  # back-off from negative numbers and zero
mes$LogL1Chat = log(mes$L1Chat)
mes$logSubFreq = log(mes$subtlex.frequency+1)
mes$RTinv = -1000/as.numeric(mes$rt)
mes$origOrder = 1:nrow(mes)
colnames(mes)
"""

## 1. Parallel dual routes?

Predicting RTinv (-1000/RT) using DLM measures, compared with classical measures:

In [None]:
R"""
library(mgcv)
mes.gam1 = bam(RTinv ~ s(logSubFreq) + s(nletters, k=4) + s(coltheart.N, k=4),
               data=mes)
mes.gam2 = bam(RTinv ~ s(TargetCorrelation) + s(nletters, k=4) + s(LogL1Chat),
               data=mes)
summary(mes.gam2)
"""

In [None]:
R"""
par(mfrow=c(1,3))
for (i in 1:3) {
   plot(mes.gam2, select=i, scheme=1, shade.col="steelblue2")
   abline(h=0, col="indianred")
}
"""

The model with log frequency is better (one would need the contextual indepedence measure and/or FIDLL to render log frequency superfluous):

In [None]:
R"""
AIC(mes.gam1, mes.gam2)
"""

The mess of morphological types:

In [None]:
R"""
library(lattice)
tab = table(mes$morphology)
dotplot(sort(tab[tab>10]))
"""

We add in family frequency, family size, and stem frequency, for those words where there is a decent morphological structure (morphology=complex), and we exclude inflectional variants.

In [None]:
R"""
family_stats = read.csv("../dat/family_stats.csv", header=TRUE)
mes_fam = merge(mes, family_stats, by.x="spelling", by.y="Word")
f = 1.646 # scaling factor for difference in corpus size
mes_fam$logStemFreq = log((mes_fam$StemFreq*f)+1)
mes_fam$MinTC = 1-mes_fam$TargetCorrelation
""";

In [None]:
R"""
dim(mes_fam)
"""

GAMs comparing predicors' partial effects for RTinv and 1-TargetCorrelation.

In [None]:
R"""
mes_fam.gam1 = gam(RTinv ~ s(logSubFreq) + s(nletters, k=4) + s(logStemFreq), 
                           data = mes_fam)
mes_fam.gam2 = gam(MinTC ~ s(logSubFreq) + s(nletters, k=4) + s(logStemFreq), 
                           data = mes_fam)
""";

Visualization:

In [None]:
R"""
par(mfrow=c(2,3))
plot(mes_fam.gam1, select=1, scale=0,
  scheme=1, shade.col="steelblue2",
  xlab="log word frequency", ylab="partial effect RTinv")
abline(h=0, col="indianred")
plot(mes_fam.gam1, select=3, scale=0,
  scheme=1, shade.col="steelblue2",
  xlab="log stem frequency", ylab="partial effect RTinv")
abline(h=0, col="indianred")
plot(mes_fam.gam1, select=2, scale=0,
  scheme=1, shade.col="steelblue2",
  xlab="length", ylab="partial effect RTinv")
abline(h=0, col="indianred")
plot(mes_fam.gam2, select=1, scale=0,
  scheme=1, shade.col="steelblue2",
  xlab="log word frequency", ylab="partial effect 1-TargetCorrelation")
abline(h=0, col="indianred")
plot(mes_fam.gam2, select=3, scale=0,
  scheme=1, shade.col="steelblue2",
  xlab="log stem frequency", ylab="partial effect 1-TargetCorrelation")
abline(h=0, col="indianred")
plot(mes_fam.gam2, select=2, scale=0,
  scheme=1, shade.col="steelblue2",
  xlab="length", ylab="partial effect 1-TargetCorrelation")
abline(h=0, col="indianred")
""";

## 2. DLM correlates of word frequency and stem frequency

We add information about stem frequency for the complex words, resulting in a substantially reduced subset of words.

In [None]:
R"""
family_stats = read.csv("../dat/family_stats.csv", header=TRUE)
mes_fam = merge(mes, family_stats, by.x="spelling", by.y="Word")
f = 1.646 # scaling factor for difference in corpus sizes
mes_fam$logStemFreq = log((mes_fam$StemFreq*f)+1)
mes_fam$MinTC = 1-mes_fam$TargetCorrelation
c(nrow(mes)-nrow(mes_fam), nrow(mes_fam))
"""

We are interested in how word frequency and stem frequency relate to Target Correlation and Semantic Support for Form, with word length as control.

In [None]:
R"""
msf = gam(logSubFreq ~ s(TargetCorrelation) + s(LogSSF) + s(nletters), data = mes_fam)
mbf = gam(logStemFreq ~ s(TargetCorrelation) + s(LogSSF) + s(nletters), data = mes_fam)
""";

In [None]:
R"""
par(mfrow=c(2,2), oma=rep(0,4), mar=c(5,5,2,1))
plot(msf, select=1, scheme=1, shade.col="steelblue2", ylab="partial effect", main="form frequency") 
abline(h=0, col="indianred")
plot(msf, select=2, scheme=1, shade.col="steelblue2", ylab="partial effect", main="form frequency") 
abline(h=0, col="indianred")
plot(mbf, select=1, scheme=1, shade.col="steelblue2", ylab="partial effect", main="stem frequency") 
abline(h=0, col="indianred")
plot(mbf, select=2, scheme=1, shade.col="steelblue2", ylab="partial effect", main="stem frequency") 
abline(h=0, col="indianred")
""";

In [None]:
R"""
pdf("../fig/frequencies_and_DLM_measures.pdf", he=6, wi=6)
par(mfrow=c(2,2), oma=rep(0,4), mar=c(5,5,2,1))
plot(msf, select=1, scheme=1, shade.col="steelblue2", ylab="partial effect", main="form frequency") 
abline(h=0, col="indianred")
plot(msf, select=2, scheme=1, shade.col="steelblue2", ylab="partial effect", main="form frequency") 
abline(h=0, col="indianred")
plot(mbf, select=1, scheme=1, shade.col="steelblue2", ylab="partial effect", main="stem frequency") 
abline(h=0, col="indianred")
plot(mbf, select=2, scheme=1, shade.col="steelblue2", ylab="partial effect", main="stem frequency") 
abline(h=0, col="indianred")
dev.off()
""";

Effects in the neuroscience from the Marantz school that are traced to word frequency and stem frequency are likely confounded with Target Correlation and Semantic Support for Form. 

## 3. Reduced semantic support for syllable boundaries and morpheme boundaries

For the majority of the words in `mes`, information on syllable and morpheme structure is available (taken or computed from the CELEX database):

In [None]:
R"""
sy_mo_boundaries = read.csv("../dat/syllable_morpheme_boundaries.csv",T)
head(sy_mo_boundaries)
"""

We merge this information into `mes`:

In [None]:
R"""
mes2 = merge(mes, sy_mo_boundaries[,-1], by = "spelling")
nrow(mes)-nrow(mes2)
"""

In [None]:
R"""
mes2 = mes2[order(mes2$origOrder),]
head(mes2)
"""

We need an R function that generates trigrams.

In [None]:
R"""
ngram = function(s) {
  s = paste0("#", s, "#")
  letter = unlist(strsplit(s, ""))
  len = length(letter)
  trigrams = NULL
  for (i in 1:(len - 2)) {
      trigrams = c(trigrams, paste(letter[i:(i + 2)], collapse = ""))
  }
  return(paste(trigrams, collapse = "_"))
}
""";

In [None]:
R"""
tri_sup2 = tri_sup[mes2$origOrder]
head(tri_sup2)
"""

### 3.1 morphological boundaries

We now extract the semantic support for all trigrams that span a morphological boundary, and also for all trigrams that do not do so.

In [None]:
R"""
without_boundary_list = list()
with_boundary_list = list()

for (i in 1:nrow(mes2)) {
  trigrams1 = strsplit(ngram(mes2$spelling[i]), "_")[[1]]
  trigrams2 = strsplit(ngram(mes2$Morph[i]), "_")[[1]]
  without_boundary = which(trigrams1 %in% trigrams2)
  with_boundary = which(!is.element(trigrams1, trigrams2))
  without_boundary_list[[i]] = tri_sup2[[i]][without_boundary]
  with_boundary_list[[i]] = tri_sup2[[i]][with_boundary]
}

v_without_boundary = unlist(without_boundary_list)
v_with_boundary = unlist(with_boundary_list)
""";

In [None]:
R"""
boxplot(v_without_boundary, v_with_boundary, horizontal=TRUE, names=c("without", "with"), 
        xlab="semantic support", col="steelblue2", cex.lab=1.5,
        ylab="morphological boundary")
""";

In [None]:
R"""
pdf("../fig/boxplotMorphBoundary.pdf", he=4, wi=12)
boxplot(v_without_boundary, v_with_boundary, horizontal=TRUE, names=c("without", "with"), 
        xlab="semantic support", col="steelblue2", cex.lab=1.5,
        ylab="morphological boundary")
dev.off()
""";

In [None]:
R"""
wilcox.test(v_without_boundary, v_with_boundary)
"""

In [None]:
R"""
c(mean(v_without_boundary), mean(v_with_boundary))
"""

Trigrams that do not straddle a morphological boundary receive more semantic support.

### 3.2 Syllable boundaries

In [None]:
R"""
without_boundary_list = list()
with_boundary_list = list()

for (i in 1:nrow(mes2)) {
  trigrams1 = strsplit(ngram(mes2$spelling[i]), "_")[[1]]
  trigrams2 = strsplit(ngram(mes2$syll[i]), "_")[[1]]
  without_boundary = which(trigrams1 %in% trigrams2)
  with_boundary = which(!is.element(trigrams1, trigrams2))
  without_boundary_list[[i]] = tri_sup2[[i]][without_boundary]
  with_boundary_list[[i]] = tri_sup2[[i]][with_boundary]
}

v_without_boundary = unlist(without_boundary_list)
v_with_boundary = unlist(with_boundary_list)
""";

In [None]:
R"""
boxplot(v_without_boundary, v_with_boundary, horizontal=TRUE, names=c("without", "with"), 
        xlab="semantic support", col="steelblue2", cex.lab=1.5,
        ylab="syllable boundary")
""";

In [None]:
R"""
pdf("../fig/boxplotSyllBoundary.pdf", he=4, wi=12)
boxplot(v_without_boundary, v_with_boundary, horizontal=TRUE, names=c("without", "with"), 
        xlab="semantic support", col="steelblue2", cex.lab=1.5,
        ylab="syllable boundary")
dev.off()
""";

In [None]:
R"""
wilcox.test(v_without_boundary, v_with_boundary)
"""

In [None]:
R"""
c(mean(v_without_boundary), mean(v_with_boundary))
"""

Trigrams that do not straddle a syllable boundary receive more semantic support.

# Exercises

In [None]:
R"""
mes_fam$rank = unlist(mes_fam$rank)
mes_fam.gam3 = gam(rank ~ s(logSubFreq) + s(nletters, k=4) + s(logStemFreq), 
                   data = mes_fam)
par(mfrow=c(1,3))
plot(mes_fam.gam3, select=1, scheme=1, shade.col="steelblue2"); abline(h=0)
plot(mes_fam.gam3, select=3, scheme=1, shade.col="steelblue2"); abline(h=0)
plot(mes_fam.gam3, select=2, scheme=1, shade.col="steelblue2"); abline(h=0)
""";

The functional shapes of the frequency effects are more similar for rank and RT than for MinTC and RT, but the effect of length basically disappears. Why this happens is unclear to us.