# Chapter 14.1: English s-duration (DLM simulation in julia)

First load the usual packages, including `JudiLingMeasures`, since we want to predict behavioural data.

In [None]:
using JudiLing, JudiLingMeasures, CSV, DataFrames

## Data preparation

We load the prepared subset of [MALD](https://era.library.ualberta.ca/items/3344343b-2b4a-4b8c-af8e-8bd829c76472) (Tucker et al, 2019) paired with phonological transcriptions from CELEX (Baayen et al., 1997).

In [None]:
mald = JudiLing.load_dataset("../dat/s_durations_base_data.csv")
first(mald, 5)

Load a subset of fasttext vectors for the words in the MALD dataset. A smaller dataset containing only the words for which fasttext vectors are available is stored in `mald_small`:

In [None]:
mald_small, S = JudiLing.load_S_matrix_from_fasttext(mald, :en, target_col=:Word);

Download the s-durations data from Schmitz et al. (2021) from OSF:

In [None]:
download("https://osf.io/download/8ze5b/?view_only=ef43a5caf6444270a56074027d7d6482", 
    "../dat/s_durations.csv")

Load the s-duration data and throw away all columns containing LDL measures as computed by Schmitz et al. (2021):

In [None]:
s_duration = JudiLing.load_dataset("../dat/s_durations.csv")
s_duration = s_duration[:, 1:32]
first(s_duration, 5)

Create a subset of the s-duration data containing only unique stimuli

In [None]:
s_duration_unique = unique(s_duration[:,[:DISC, :Word, :Base, :Affix]])

## Creating semantic vectors for the pseudowords

Create cue objects for the words from MALD and the pseudowords. We use triphones, following Schmitz et al. (2021)

In [None]:
cue_obj_mald, cue_obj_s_dur = JudiLing.make_combined_cue_matrix(mald_small[:,[:Word, :DISC]],
                                                                s_duration_unique[:,[:Word, :DISC]],
                                                                target_col = :DISC,
                                                                grams=3);

Train a comprehension matrix based on the MALD data:

In [None]:
F_train = JudiLing.make_transform_matrix(cue_obj_mald.C, S, mald_small.frequency)

Create predicted semantic vectors for the pseudowords

In [None]:
S_s_dur = cue_obj_s_dur.C * F_train

Now, we need to impute semantic vectors for all lemmas and inflectional/declensional features using the technique from Nikolaev et al. (2023). For this, we make use of the `:features` column in the `mald_small` dataset:

In [None]:
mald_small.features[1:5]

We hand this column to the `make_pS_matrix` function, which creates a binary matrix indicating for each word form which features it contains:

In [None]:
L = JudiLing.make_pS_matrix(mald_small, 
    features_col=:features)
JudiLing.display_matrix(mald_small, :Word, L, L.pS, :pS)

Next, we create a transformation matrix W, which contains the imputed vectors for all lexemes and features. For further information on this method see Nikolaev et al. (2023).

In [None]:
W = JudiLing.make_transform_matrix(L.pS, S)
JudiLing.display_matrix(mald_small, :word, L, W, :F, ncol=5, nrow=4)

From this, we extract semantic vectors for the meanings of plural, singular, alien and creature:

In [None]:
plural_vec = W[L.f2i["P"],:]
sing_vec = W[L.f2i["S"],:]
alien_vec = W[L.f2i["alien"],:]
creature_vec = W[L.f2i["creature"],:]

...and add these to the predicted semantic vectors we created earlier.

In [None]:
for i in 1:size(S_s_dur,1)
    S_s_dur[i,:] = S_s_dur[i,:] + alien_vec + creature_vec
    if s_duration_unique[i, :Affix] == "PL"
        S_s_dur[i,:] = S_s_dur[i,:] + plural_vec
    else
        S_s_dur[i,:] = S_s_dur[i,:] + sing_vec
    end
end

## Learning the meanings of the pseudowords

All words in MALD are learned according to their frequency. Since participants see the pseudowords in the experiment only once, they get a frequency of 1. Then, the two datasets are combined into one big dataset.

In [None]:
s_duration_unique[!,:frequency] .= 1
combined_dataset = vcat(mald_small[:,[:Word, :DISC, :frequency]], s_duration_unique[:,[:Word, :DISC, :frequency]])

Now we create a cue object for the combined dataset. We provide the previous cue object for the words in MALD to it, so that the function re-uses the `i2f` and `f2i` mappings created earlier. This will allow us to later reuse the cue object created for the pseudowords.

In [None]:
cue_obj_comb = JudiLing.make_cue_matrix(combined_dataset, cue_obj_mald, grams=3, target_col=:DISC);

We also need an S matrix containing all semantic vectors:

In [None]:
S_comb = vcat(S, S_s_dur)

Now we train F and G matrices based on the combined dataset.

In [None]:
F_train_comb = JudiLing.make_transform_matrix(cue_obj_comb.C, S_comb, combined_dataset.frequency)
G_train_comb = JudiLing.make_transform_matrix(S_comb, cue_obj_comb.C, combined_dataset.frequency)

And calculate predicted semantic and form matrices for the pseudowords:

In [None]:
Shat_s_dur = cue_obj_s_dur.C * F_train_comb
Chat_s_dur = S_s_dur * G_train_comb

We now evaluate the predicted semantic vectors. Given their low frequency during training it's unsurprising that we get low accuracy:

In [None]:
JudiLing.eval_SC(Shat_s_dur, S_s_dur, S, s_duration_unique[:,[:Word, :DISC]], mald_small[:,[:Word, :DISC]], :DISC)

## Computing the measures

We use the JudiLingMeasures package to compute all measures. Since we do this only on a small subset of the data that the mappings were trained on, but want to make sure that the training data is taken into account when computing measures such as semantic density, we use the `compute_all_measures_val` function, which is informed about the validation and training data and returns measures for only the validation data.

In [None]:
all_measures = JudiLingMeasures.compute_all_measures_val(s_duration_unique,
                                          cue_obj_mald, 
                                          cue_obj_s_dur, 
                                          Chat_s_dur, 
                                          S, 
                                          S_s_dur, 
                                          Shat_s_dur, 
                                          F_train_comb, 
                                          G_train_comb, 
                                          low_cost_measures_only=true)

We computed measures only for the unique set of pseudowords, but for easier processing later on, we want to remerge these measures into the dataset with responses from all participants. For this we use a `leftjoin` operation:

In [None]:
all_measures_full = leftjoin(s_duration, all_measures, on = [:DISC, :Word, :Base, :Affix])

Finally, we can save this dataframe as a csv file:

In [None]:
CSV.write("../res/s_duration_measures.csv", all_measures_full)

Please see the next notebook for the R analysis.

## References

Baayen, R. H., Piepenbrock, R., & Gulikers, L. (1996). The CELEX lexical database (cd-rom).

Nikolaev, A., Chuang, Y.-Y., and Baayen, R. H. (2023). A generating model for finnish nominal
inflection using distributional semantics. The Mental Lexicon.

Tucker, B. V., Brenner, D., Danielson, D. K., Kelley, M. C., Nenadić, F., & Sims, M. (2019). The massive auditory lexical decision (MALD) database. Behavior research methods, 51, 1187-1204.