# Chapter 16.3: Conceptual relations and CARIN

## Getting the datasets

Download LADECv1-2019.csv from [here](https://era.library.ualberta.ca/items/dc3b9033-14d0-48d7-b6fa-6398a30e61e4). This is a database of compounds. We will use this for setting up mappings. Save it in `../dat`.

Download 13423_2018_1478_MOESM1_ESM.csv from [here](https://static-content.springer.com/esm/art%3A10.3758%2Fs13423-018-1478-x/MediaObjects/13423_2018_1478_MOESM1_ESM.csv).  This is a database of conceptual relations, the "top relations" data of Schmidtke et al. (2018). We will use this dataset to zoom in on the role of conceptual relations. Store it in `../dat`.

For reaction times, we'll use the BLP.  If you haven't done so before, download blp-items.txt.zip from [here](https://osf.io/b5sdk/files/osfstorage), unzip, and place in `../dat`.

## Packages

In [None]:
using JudiLing, JudiLingMeasures, CSV, DataFrames, RCall;

In [None]:
R"""
library(MASS)
library(mgcv)
library(party)
library(lattice)
""";

## Data preparation for DLM modeling

In [None]:
R"""
ladec = read.csv("../dat/LADECv1-2019.csv")
ladec = unique(ladec[,c("c1", "c2", "stim")]) 
words = unique(c(ladec$c1, ladec$c2, ladec$stim))
dat = data.frame(Word=words, IsCompound=words %in% ladec$stim)
bnc = read.table("../dat/bnc_frequencies.txt", T)
dat$Frequency = bnc$Frequency+1
head(dat)
"""

In [None]:
@rget dat;

In [None]:
size(dat)

In [None]:
R"""
table(dat$IsCompound)
"""

# Defining form and meaning matrices

In [None]:
dat2, S = JudiLing.load_S_matrix_from_fasttext(dat, :en, target_col=:Word);

This dataset is somewhat smaller, as not all words are available with fasttext embeddings.

In [None]:
size(dat2)

In [None]:
cue_obj = JudiLing.make_cue_matrix(dat2, grams=3, target_col="Word");
size(cue_obj.C)

In [None]:
dat2.origOrder = collect(1:size(dat2)[1]);

# Endstate learning

In [None]:
F = JudiLing.make_transform_matrix(cue_obj.C, S);
G = JudiLing.make_transform_matrix(S, cue_obj.C);

In [None]:
Shat = cue_obj.C * F;
Chat = S * G;

In [None]:
JudiLing.eval_SC(Shat, S, dat2, "Word")

In [None]:
JudiLing.eval_SC_loose(Shat, S, 5, dat2, :Word)

In [None]:
all_measures = JudiLingMeasures.compute_all_measures_train(
    dat2, cue_obj, Chat, S, Shat, F, G, 
    low_cost_measures_only=true);
CSV.write("../res/endstate_measures.csv", all_measures);

# FIL

In [None]:
dat2.Frequency = Int.(dat2.Frequency);

In [None]:
F = JudiLing.make_transform_matrix(cue_obj.C, S, dat2.Frequency);
G = JudiLing.make_transform_matrix(S, cue_obj.C, dat2.Frequency);

In [None]:
Shat = cue_obj.C * F;
Chat = S * G;

In [None]:
JudiLing.eval_SC(Shat, S, dat2, "Word")

In [None]:
JudiLing.eval_SC_loose(Shat, S, 5, dat2, :Word)

In [None]:
JudiLing.eval_SC(Shat, S, dat2, :Word, freq=dat2.Frequency)

In [None]:
all_measures = JudiLingMeasures.compute_all_measures_train(
    dat2, cue_obj, Chat, S, Shat, F, G, 
    low_cost_measures_only=true);
CSV.write("../res/FIL_measures.csv", all_measures)

In [None]:
@rput S;

# Putting everything together for analyses of the compounds

In [None]:
R"""
fil = read.csv("../res/FIL_measures.csv")
colnames(fil)[5:ncol(fil)]=paste("FIL", colnames(fil)[5:ncol(fil)], sep="_")
head(fil)
eol = read.csv("../res/endstate_measures.csv")
colnames(eol)[5:ncol(eol)]=paste("EOL", colnames(eol)[5:ncol(eol)], sep="_")
dat3 = cbind(fil, eol[,5:ncol(eol)])
head(dat3, 3)
"""

Restrict to compounds.

In [None]:
R"""
compounds = dat3[dat3$IsCompound=="true",]
nrow(compounds)
"""

Add in information about the constituents.

In [None]:
R"""
head(ladec)
"""

In [None]:
R"""
table(table(ladec$stim))
"""

We eliminate duplicates of compounds (such as compounds with multiple parses). 

In [None]:
R"""
ladec2 = ladec[!duplicated(ladec$stim),]
c(nrow(ladec), nrow(ladec2))
"""

In [None]:
R"""
compounds2 = merge(compounds, ladec2, by.x="Word", by.y="stim")
head(compounds2,3)
"""

In [None]:
R"""
c1Count = table(compounds2$c1)
c2Count = table(compounds2$c2)
compounds2$c1Count = as.vector(c1Count[compounds2$c1])
compounds2$c2Count = as.vector(c2Count[compounds2$c2])
head(compounds2)
"""

# Add in information about conceptual relations.

In [None]:
R"""
rels = read.csv("../dat/13423_2018_1478_MOESM1_ESM.csv")
head(rels)
"""

In [None]:
R"""
table(rels$PrintExposure)
"""

In [None]:
R"""
rels_high = rels[rels$PrintExposure=="high",]
"""

In [None]:
R"""
tab = table(apply(rels_high[,3:18], 1, FUN=function(v)sum(v>0)))
tab
"""

In [None]:
R"""
median(apply(rels_high[,3:18], 1, FUN=function(v)sum(v>0)))
"""

In [None]:
R"""
rels_high[which(apply(rels_high[,3:18], 1, FUN=function(v)sum(v>0))==1),]
"""

In [None]:
R"""
pos = which(compounds2$Word %in% rels$Compound)
length(pos)
"""

In [None]:
R"""
compounds3 = compounds2[pos,]
S_compounds3 = S[compounds3$origOrder,]
""";

In [None]:
R"""
head(compounds3,3)
"""

In [None]:
R"""
rels_high = rels_high[rels_high$Compound %in% compounds3$Word,]
sum(rels_high$Compound==compounds3$Word) == nrow(compounds3)
"""

Add in the compound relation that is most often selected in rels_high.  Ties are solved randomly.

In [None]:
R"""
semrels = colnames(rels_high[3:18])
is_best_jitter = function(v) {
  v = jitter(as.numeric(v))
  return(semrels[which(v==max(v))])
}
set.seed(314)
compounds3$bestSemRel_high = apply(rels_high[,3:18], 1, is_best_jitter)
compounds3$ent = rels_high$ent
head(compounds3)
"""

In [None]:
R"""
nrow(compounds3)
"""

# FastText embeddings and conceptual relations

Can we predict the best supported conceptual relation form FastText embeddings?

In [None]:
R"""
set.seed(314)
library(MASS)
m = lda(S_compounds3, as.factor(compounds3$bestSemRel_high), CV=TRUE) 
tab = table(compounds3$bestSemRel_high, m$class)
sum(diag(tab))/sum(tab)
"""

Comparison with majority baseline:

In [None]:
R"""
class_counts = table(compounds3$bestSemRel_high)
prop.test(c(max(class_counts), sum(diag(tab))), rep(nrow(compounds3),2))
"""

What about a non-linear classifier?

In [None]:
R"""
library(e1071)
msvm = svm(S_compounds3, as.factor(compounds3$bestSemRel_high), cross=10)
summary(msvm)
"""

In [None]:
R"""
prop.test(c(max(class_counts), floor(0.364*nrow(compounds3))), rep(nrow(compounds3),2))
"""

Conclusion: there is some information about conceptual relations in the embeddings, but it is hardly above baseline when it comes to prediction for held-out data. 

# RTs

In [None]:
R"""
blp = read.table("../dat/blp-items.txt",T)
""";

In [None]:
R"""
compounds4 = merge(compounds3, blp[,c("spelling", "rt")], by.x="Word", by.y="spelling")
compounds4$WordLen = nchar(compounds4$Word)
compounds4 = compounds4[!is.na(compounds4$rt),]
compounds4$RTinv = -1000/compounds4$rt
nrow(compounds4)
"""

In [None]:
R"""
head(compounds4, 3)
"""

# Modeling of RTs with GAM.

Exploratory model using Endstate Learning.

In [None]:
R"""
compounds4.gam = gam(RTinv~s(EOL_L1Chat)+s(EOL_L1Shat)+s(EOL_C.Precision)+WordLen+s(ent), data=compounds4)
summary(compounds4.gam)
"""

Now a model using FIL measures.

In [None]:
R"""
compounds4.gamF = gam(RTinv~s(FIL_L1Chat)+s(FIL_L1Shat)+s(FIL_C.Precision)+WordLen+s(ent), data=compounds4)
summary(compounds4.gamF)
"""

The FIL model appears preferable, supporting the importance of taking frequency into account.

In [None]:
R"""
AIC(compounds4.gam, compounds4.gamF)
"""

A simplified version:

In [None]:
R"""
compounds4.gamF2 = gam(RTinv~s(FIL_L1Chat)+(FIL_TargetCorrelation), data=compounds4)
summary(compounds4.gamF2)
"""

In [None]:
R"""
AIC(compounds4.gamF, compounds4.gamF2)
"""

The simplified model has a tiny bit lower AIC.

In [None]:
R"""
plot(compounds4.gamF2, scheme=1, shade.col="steelblue2");abline(h=0, col="indianred")
""";

# Random Forest Analysis with measures form LADEC for which there are no missing values.

In [None]:
R"""
ladec = read.csv("../dat/LADECv1-2019.csv")
head(ladec, 3) 
"""

In [None]:
R"""
vars=c(
#"id_master"               , "c1"                       ,
#"c2"                      , 
"stim"                     ,
"obs"                     , "obsc1"                    ,
"obsc2"                   , #"stimlen"                  ,
"c1len"                   , "c2len"                    ,
"nparses"                 , "correctParse"             ,
"ratingcmp"               , "ratingC1"                 ,
"ratingC2"                , "isPlural"                 ,
"nc1_cmp"                 , "nc2_cmp"                  ,
"nc1_cmpnoplural"         , "nc2_cmpnoplural"          ,
"sentiment_stim"          , "sentiment_c1"             ,
"sentiment_c2"            , "sentimentprobpos_stim"    ,
"sentimentprobpos_c1"     , "sentimentprobpos_c2"      ,
"sentimentprobneg_stim"   , "sentimentprobneg_c1"      ,
"sentimentprobneg_c2"     , "sentimentratioposneg_stim",
"sentimentratioposneg_c1" , "sentimentratioposneg_c2"  ,
"profanity_stim"          , "profanity_c1"             ,
"profanity_c2"            , #"isCommonstim"             ,
#"isCommonC1"              , "isCommonC2"               ,
#"bg_boundary"             , 
"bgJonesMewhort"           ,
"bgSUBTLEX"               , "bgFacebook"               ,
#"inSUBTLEX"               , "inBLP"                    ,
#"inELP"                   , "inJuhaszLaiWoodcock"      ,
#"c1_inELP"                , "c1_inBrysbaert"           ,
#"c1_inWordnet"            , "c1_inMMA"                 ,
#"c2_inELP"                , "c2_inBrysbaert"           ,
#"c2_inWordnet"            , "c2_inMMA"                 ,
#"LSAc1c2"                 , "LSAc1stim"                ,
#"LSAc2stim"               , "stim_SLlg10wf"            ,
"BLPbncfrequency"         , #"BLPbncfrequencymillion"   ,
#"BLPrt"                   , "elp_ld_rt"                ,
#"elp_naming_mean_rt"      , 
"c1c2_snautCos"            ,
#"c1stim_snautCos"         , "c2stim_snautCos"          ,
#"fbusfreq"                , "fbukfreq"                 ,
"valence_stim"            , "valence_c1"               ,
"valence_c2"              , "concreteness_stim"        ,
"concreteness_c1"         , "concreteness_c2"          ,
#"Juhasz_tran"             , "st_c1_mean"               ,
#"st_c2_mean"              , "Zipfvalue"                ,
"c1_SLlg10wf"             , "c2_SLlg10wf"              ,
"c1_BLPbncfrequency"      , #"c1_BLPbncfrequencymillion",
"c2_BLPbncfrequency")     # , "c2_BLPbncfrequencymillion")
""";

In [None]:
R"""
missingCounts = apply(ladec[,vars], 2, FUN=function(v)sum(is.na(v)))
vars = vars[missingCounts == 0]
vars
"""

In [None]:
R"""
ladec1 = ladec[,vars]
ladec1$sentiment_stim = as.factor(ladec1$sentiment_stim)
ladec1$sentiment_c1 = as.factor(ladec1$sentiment_c1)
ladec1$sentiment_c2 = as.factor(ladec1$sentiment_c2)
ladec1$profanity_stim = as.factor(ladec1$profanity_stim)
ladec1$profanity_c1 = as.factor(ladec1$profanity_c1)
ladec1$profanity_c2 = as.factor(ladec1$profanity_c2)
""";

In [None]:
R"""
nrow(compounds4)
"""

In [None]:
R"""
compounds5 = merge(compounds4, ladec1, by.x="Word", by.y="stim")
nrow(compounds5)
"""

We are still missing constituent frequencies. We add these in from the bnc.

In [None]:
R"""
head(bnc)
freqs = bnc$Frequency
names(freqs) = bnc$Word
compounds5$C1Freq = freqs[as.character(compounds5$c1)]
compounds5$C2Freq = freqs[as.character(compounds5$c2)]
ncol(compounds5)
"""

In [None]:
R"""
head(compounds5, 3)
"""

In [None]:
R"""
vars = c(3, 5:12,14:25,27:30,33:34,35,36,37,38, 40:69)
colnames(compounds5)[vars]
"""

In [None]:
R"""
forest_input = compounds5[,vars]
head(forest_input,3)
"""

In [None]:
R"""
forest_input$bestSemRel_high = as.factor(forest_input$bestSemRel_high)
forest_input$correctParse = as.factor(forest_input$correctParse)
""";

In [None]:
R"""
compounds.cforest = cforest(rt~., forest_input)
varimps = varimp(compounds.cforest)
""";

In [None]:
R"""
pdf("../fig/dotplot_vars.pdf", he=10, wi=6)
print(dotplot(sort(varimps),xlab="variable importance"))
dev.off()
""";

In [None]:
R"""
png("../fig/dotplot_vars.png", he=960, wi=480)
print(dotplot(sort(varimps), xlab="variable importance"))
dev.off()
""";

In [None]:
R"""
png("../fig/dotplot_vars_small.png", he=480, wi=480)
print(dotplot(tail(sort(varimps),25), xlab="variable importance"))
dev.off()
""";

In [None]:
R"""
pdf("../fig/dotplot_vars_small.pdf", he=6, wi=8)
print(dotplot(tail(sort(varimps),25), xlab="variable importance"))
dev.off()
""";

![dotplot](../fig/dotplot_vars.png)

# Final GAM analysis taking the sentiment measures into account.

In [None]:
R"""
compounds5.gam = gam(RTinv~s(FIL_L1Chat)+(FIL_TargetCorrelation) + #s(FIL_C.Precision)+
   s(sentimentprobneg_stim), # + s(sentimentratioposneg_stim),
  data=compounds5)
summary(compounds5.gam)
"""

In [None]:
R"""
par(mfrow=c(1,2))
plot(compounds5.gam, select=1, scheme=1, shade.col="steelblue2", ylab="partial effect");abline(h=0, col="indianred")
plot(compounds5.gam, select=2, scheme=1, shade.col="steelblue2", ylab="partial effect");abline(h=0, col="indianred")
abline(v=median(compounds5$sentimentprobneg_stim))
""";

In [None]:
R"""
pdf("../fig/CARIN_compounds_partial_effects.pdf", he=5, wi=10)
par(mfrow=c(1,2))
plot(compounds5.gam, select=1, scheme=1, shade.col="steelblue2", 
   xlab="L1Chat (FIL)",
   ylab="partial effect");abline(h=0, col="indianred")
plot(compounds5.gam, select=2, scheme=1, shade.col="steelblue2", 
   xlab = "probability of negative sentiment (compound)", 
   ylab="partial effect");abline(h=0, col="indianred")
abline(v=median(compounds5$sentimentprobneg_stim), col="indianred")
dev.off()
""";

Conclusions: 

1. RT decleases with greater TargetCorrelation, as expected.
2. Greater uncertainty about the spoken form leads to longer RTs, this is by far the strongest effect.
3. The sentiment measure suggests response optimization, with the slowest responses for median negativity, and faster responses for less common negativities.

Frequency is the strongest predictor, which is unsurprising because (1) it captures much more than lexical learning, and (2) the model was trained only on the words in this small dataset, which is only a sample of real experience.

In [None]:
R"""
compounds5$LogCompFreq = log(compounds5$Frequency+1)
compounds5.gam2 = gam(RTinv~s(FIL_L1Chat)+(FIL_TargetCorrelation) + s(LogCompFreq)+
   s(sentimentprobneg_stim), # + s(sentimentratioposneg_stim),
  data=compounds5)
summary(compounds5.gam2)
"""

In [None]:
R"""
concurvity(compounds5.gam2)
"""

The L1Chat measure surves, but the sentiment measure does not.