# Chapter 14.3: Vertical tongue tip position

## Modeling

In [None]:
using JudiLing, JudiLingMeasures, DataFrames, RCall

__Training data__

In [None]:
german = JudiLing.load_dataset("../dat/frequencylist.txt", delim="\t");

In [None]:
first(german, 6)

In [None]:
size(german)

__Modeling with word2vec using frequency-informed learning__

__NB__ The following code snippets cannot be excecuted as we cannot distribute the specific word2vec embeddings used by Saito. The resulting SemanticSupport measure is available in the "articulography.csv" dataset that is used below. 

In [None]:
#S, words = JudiLing.load_S_matrix("data/german_w2v.csv",
#                                  header=false, sep='\t');

In [None]:
#first(words, 6)

In [None]:
#size(words)

In [None]:
#Cue_object = JudiLing.make_cue_matrix(german, grams=3, target_col="WordPhono");

In [None]:
#G = JudiLing.make_transform_matrix(S, Cue_object.C, german.Frequency);

In [None]:
#Chat = S * G;

In [None]:
#semantic_support_word2vec = JudiLingMeasures.last_support(Cue_object, Chat);

__Modeling with FastText vectors using frequency-informed learning__

We require German fasttext vectors:

In [None]:
germanft, Sft = JudiLing.load_S_matrix_from_fasttext(
                    german,
                    :de,
                    target_col=:WordOrtho);

In [None]:
size(germanft)

In [None]:
Cue_object = JudiLing.make_cue_matrix(germanft, grams=3, target_col="WordPhono");

In [None]:
G = JudiLing.make_transform_matrix(Sft, Cue_object.C, germanft.Frequency);

In [None]:
Chat = Sft * G;

In [None]:
semantic_support_fasttext = JudiLingMeasures.last_support(Cue_object, Chat);

__Modeling FastText vectors using endstate learning__

In [None]:
Cue_object = JudiLing.make_cue_matrix(germanft, grams=3, target_col="WordPhono");

In [None]:
G = JudiLing.make_transform_matrix(Sft, Cue_object.C);

In [None]:
Chat = Sft * G;

In [None]:
semantic_support_fasttext_EOL = JudiLingMeasures.last_support(Cue_object, Chat);

__Save results__

In [None]:
#dfrw2v = DataFrame(Word=german.WordOrtho, Phon=german.WordPhono, Frequency=german.Frequency, 
#    SemanticSupport=semantic_support_word2vec);
dfrft = DataFrame(Word=germanft.WordOrtho, Phon=germanft.WordPhono, Frequency=germanft.Frequency, 
    SemanticSupportFIL=semantic_support_fasttext, SemanticSupportEOL=semantic_support_fasttext_EOL);

In [None]:
# @rput dfrw2v;
@rput dfrft;

In [None]:
R"""
save(dfrft, file="../res/dfrft.rda")
""";

## Statistical analysis

We do the analysis here with the precompiled measures using word2vec vectors as reported in the book. To do the analysis with a DLM based on fasttext vectors see Exercise 1.

In [None]:
R"""
ema = read.csv("../dat/articulography.csv", header=TRUE)
head(ema)
"""

In [None]:
R"""
suppressPackageStartupMessages(library(mgcv))
suppressPackageStartupMessages(library(itsadug))
""";

__Analysis using the word2vec based measure__

Merge datasets and prepare for modeling with GAMs.

In [None]:
R"""
dat = ema
dat = dat[order(dat$OrigOrder),]
dat$Speaker  = factor(dat$Speaker)
dat$Prev1Seg = factor(dat$Prev1Seg)
dat$Next1Seg = factor(dat$Next1Seg)
head(dat)
"""

Detect and remove extreme outliers using a simple GAM model with little concurvity.

In [None]:
R"""
fmla = formula(SenTT.Z ~ s(Speaker, bs="re") + 
                         s(Prev1Seg, bs="re") + 
                         te(SemanticSupportW2V, WordFreq, nTime, k=c(3,3,3)))
m = bam(fmla, data=dat)
qqnorm(resid(m));qqline(resid(m))
dat2 = dat[resid(m) > -10,]
""";

In [None]:
R"""
w2v.gam = bam(fmla, data=dat2, 
              AR.start=AR.start.segment, rho=0.91,
              discrete=TRUE)
summary(w2v.gam)
"""

Adding scat does not improve model fit.

In [None]:
R"""
w2v_scat.gam = bam(fmla, data=dat2, 
               AR.start=AR.start.segment, rho=0.91,
               family="scat", discrete=TRUE)
AIC(w2v.gam, w2v_scat.gam)
"""

Concurvity is low for the tensor product smooth, so the model, and specifically the te(), is interpretable.

In [None]:
R"""
concurvity(w2v.gam)
"""

Removing Semantic Support leads to a substantial increase in AIC:

In [None]:
R"""
fmlaAIC = formula(SenTT.Z ~ s(Speaker, bs="re") + 
                            s(Prev1Seg, bs="re") + 
                            te(WordFreq, nTime, k=c(3,3)))
m = bam(fmla, data=dat, AR.start=AR.start.segment, rho=0.91,
              discrete=TRUE)
AIC(m, w2v.gam)
"""

The residuals of the model have long thin tails, but there is little we can do about this. 
Visualisation:

In [None]:
R"""
plot(w2v.gam, select=3, scheme=2,hcolors=topo.colors(20), main="tensor product smooth")
""";

In [None]:
R"""
pdf("../fig/w2v_FIL.pdf", he=8, wi=8)
plot(w2v.gam, select=3, scheme=2, hcolors=topo.colors(20),
     main="partial effect")
dev.off()
""";

## Exercises

__Analysis using the fasttext based measures__

Load the data

In [None]:
R"""
load("../res/dfrft.rda")
dfrft$SemanticSupportFIL=as.vector(unlist(dfrft$SemanticSupportFIL))
dfrft$SemanticSupportEOL=as.vector(unlist(dfrft$SemanticSupportEOL))
ema = read.csv("../dat/articulography.csv", header=TRUE)
dat = merge(ema, dfrft, by="Word")
dat = dat[order(dat$OrigOrder),]
dat$Speaker  = factor(dat$Speaker)
dat$Prev1Seg = factor(dat$Prev1Seg)
dat$Next1Seg = factor(dat$Next1Seg)
head(dat)
"""

__Analyse fasttext+EOL based measure__

Detect and remove extreme outliers.

In [None]:
R"""
fmla = formula(SenTT.Z ~ s(Speaker, bs="re") + 
                         s(Prev1Seg, bs="re") + 
                         te(SemanticSupportEOL, WordFreq, nTime, k=c(3,3,3)))
m = bam(fmla, data=dat)
qqnorm(resid(m));qqline(resid(m))

dat2 = dat[resid(m) > -10,]
""";

In [None]:
R"""
ft_EOL.gam = bam(fmla, data=dat2, AR.start=AR.start.segment, rho=0.91,
             discrete=TRUE)
plot(ft_EOL.gam, select=3, scheme=2,hcolors=topo.colors(20))
""";

This is a rather different partial effect than that obtained with word2vec.

__Analysis using the fasttext+EOL based measure__

Detect and remove extreme outliers.

In [None]:
R"""
fmla = formula(SenTT.Z ~ s(Speaker, bs="re") + 
                         s(Prev1Seg, bs="re") + 
                         te(SemanticSupportFIL, WordFreq, nTime, k=c(3,3,3)))
m = bam(fmla, data=dat)
qqnorm(resid(m));qqline(resid(m))

dat2 = dat[resid(m) > -10,]
""";

In [None]:
R"""
ft_FIL.gam = bam(fmla, data=dat2, AR.start=AR.start.segment, rho=0.91,
             discrete=TRUE)
plot(ft_FIL.gam, select=3, scheme=2,hcolors=topo.colors(20))
""";

Carry out model comparison with AIC.

In [None]:
R"""
aics = AIC(ft_FIL.gam, ft_EOL.gam, w2v.gam)
aics[order(aics$AIC),]
"""

The model using word2vec is superior, and the tensor product interaction for this model is also the simplest and the easiest to make sense of.