# Chapter 12.4: Maltese nouns

In [None]:
using JudiLing, DataFrames, Statistics

Download the maltese data (Nieder et al., 2023) from OSF and save in the `dat` directory.

In [None]:
download("https://osf.io/download/whrqs/",
         "../dat/maltese.csv")

Inspect the dataframe:

In [None]:
dat = JudiLing.load_dataset("../dat/maltese.csv");
size(dat)

In [None]:
first(dat, 5)

Do a careful split of the maltese data, making sure that all lemmas and numbers as well as bisyllables have already occurred in the training data.
The `:Word_syll` columns contains already syllabified words, and we have to inform the function about the separator token etc.

In [None]:
data_train, data_val = JudiLing.loading_data_careful_split(
                        "../dat/maltese.csv", "maltese", "../dat/careful",
                        ["Lemma", "Number"],
                        n_grams_target_col = "Word_syll",
                        n_grams_tokenized = true,
                        n_grams_sep_token = ".",
                        n_grams_keep_sep = true,
                        grams = 2,
                        val_ratio = 0.1,
                        random_seed = 42)
first(data_train, 5)

In [None]:
size(data_train)

In [None]:
size(data_val)

Now we load fasttext vectors for both the training and the validation data.

In [None]:
data_train_small, data_val_small, S_train, S_val = JudiLing.load_S_matrix_from_fasttext(data_train, data_val,
                                                                                       :mt, target_col=:Word)

We evidently lose a significant part of the data because no embeddings are available.

In [None]:
size(data_train_small)

In [None]:
size(data_val_small)

Create cue matrices for training and validation data, based on bi-syllables.

In [None]:
cue_obj_train, cue_obj_val = JudiLing.make_combined_cue_matrix(data_train_small,
                                                               data_val_small,
                                                               grams=2,
                                                               target_col="Word_syll",
                                                               tokenized=true,
                                                               sep_token=".",
                                                               keep_sep=true);

In [None]:
JudiLing.display_matrix(data_train_small, :Word_syll, cue_obj_train, cue_obj_train.C, :C)

Train comprehension and production matrices.

In [None]:
F_train = JudiLing.make_transform_matrix(cue_obj_train.C, S_train)
G_train = JudiLing.make_transform_matrix(S_train, cue_obj_train.C)

Predict semantic and form matrices for the training data.

In [None]:
Shat_train = cue_obj_train.C * F_train
Chat_train = S_train * G_train

Evaluate comprehension accuracy.

In [None]:
JudiLing.eval_SC(Shat_train, S_train, data_train_small, :Word_syll)

Use the learn paths algorithm to predict forms for the training data.

In [None]:
res_learn_train = JudiLing.learn_paths(data_train_small, cue_obj_train, S_train, F_train, Chat_train,
                                       threshold=0.005)

In [None]:
JudiLing.eval_acc(res_learn_train, cue_obj_train)

Moving on to the validation data.

Predict semantic and form matrices for the validation data.

In [None]:
Shat_val = cue_obj_val.C * F_train
Chat_val = S_val * G_train

Compute accuracy.

In [None]:
JudiLing.eval_SC(Shat_val, S_val, S_train, data_val_small, data_train_small, :Word_syll)

Compute accuracy@10

In [None]:
JudiLing.eval_SC_loose(Shat_val, S_val, S_train, 10, data_val_small, data_train_small, :Word_syll)

Write comprehension results to a dataframe

In [None]:
acc_comp = JudiLing.accuracy_comprehension(S_val, S_train, Shat_val, data_val_small, data_train_small,
                                           target_col=:Word_syll);
acc_comp_dfr = acc_comp.dfr

Combine with validation dataframe

In [None]:
data_val_small_comp = hcat(data_val_small, acc_comp.dfr)

Compute accuracy per plural type

In [None]:
gdf = groupby(data_val_small_comp, [:Number, :pluralType])
combine(gdf, :correct => mean)

In [None]:
gdf = groupby(data_val_small_comp, [:Number, :pluralType])
combine(gdf, nrow)

# Exercises

## Exercise 1
Running the learn paths algorithm on the held-out data:

Production of validation forms

In [None]:
max_t = JudiLing.cal_max_timestep(data_train_small, data_val_small, :Word_syll, tokenized=true, sep_token=".")

In [None]:
res_learn_val = JudiLing.learn_paths(data_train_small,
data_val_small,
cue_obj_train.C,
S_val,
F_train,
Chat_val,
cue_obj_val.A,
cue_obj_train.i2f,
cue_obj_train.f2i,
Shat_val = Shat_val,
threshold=0.0005,
    is_tolerant=true,
    tolerance=-0.1,
    max_tolerance=2,
target_col=:Word_syll,
max_t = max_t,
tokenized=true,
sep_token=".",
keep_sep=true,
grams=2,
verbose=true)

Accuracy

In [None]:
JudiLing.eval_acc(res_learn_val, cue_obj_val)

## Exercise 2
Compute learn paths accuracy on singulars and plurals (broken down by broken and sound) respectively

Write results to dataframe

In [None]:
prod_acc = JudiLing.write2df(res_learn_val, data_val_small, cue_obj_train, 
                  cue_obj_val, target_col=:Word_syll, tokenized=true,
                sep_token=".", output_sep_token="")

Subset dataframe to only include the top candidates and combine with validation dataframe

In [None]:
prod_acc_best = prod_acc[prod_acc.isbest .== true,:]
data_val_small_prod = hcat(data_val_small, prod_acc_best)

Compute accuracy per plural type

In [None]:
gdf = groupby(data_val_small_prod, [:Number, :pluralType])
combine(gdf, :iscorrect => mean)

It is noteworthy, that the broken plurals show much higher accuracy than the sound plurals. This might be because the data is unequally distributed: due to the careful split, there are only 12 broken plurals, but 212 sound plurals in the data.

In [None]:
gdf = groupby(data_val_small_prod, [:Number, :pluralType])
combine(gdf, nrow)

One possible reason is therefore simply that the high accuracy of the broken plurals might be due to chance, since the low number of tokens makes the accuracy estimate not statistically reliable.
However, further inspection reveals that all but one of these forms have a homograph in the training data:

In [None]:
for f in data_val_small_prod[data_val_small_prod.pluralType .== "broken", :Word]
    println("Broken plural: ", f)
    if f in data_train_small.Word
        println("\t Homograph lemma:", data_train_small[data_train_small.Word .== f, :Lemma])
    end
end

Since we use fasttext vectors which do not distinguish the meanings of homographs, these are effectively not real held-out words, since they already occur in the training data. Therefore, their accuracy is exceptionally high. Incidentally, the only word which is not a homograph ("trabi") is also the only incorrectly produced form:

In [None]:
data_val_small_prod[data_val_small_prod.pluralType .== "broken",[:Word, :iscorrect]]

# References

Nieder, J., Chuang, Y.-Y., van de Vijver, R., and Baayen, H. (2023). A discriminative lexi-
con approach to word comprehension, production, and processing: Maltese plurals. Language,
99(2):242–274.