# Chapter 12.3: Kinyarwanda verbs

Load the usual packages

In [None]:
using DataFrames, JudiLing

## Data preparation

First, download the file "kinyarwandaVerbsExtensionsSylJuliaTestCombo.csv" from the Supplementary Materials of van de Vijver, Uwambayinema and Chuang (2024) which can be found [here](https://osf.io/jdaqb/) and store it in "dat".

In [None]:
download("https://osf.io/8uzah/download", "../dat/kinyarwanda_verbs.csv")

Next, we load the full dataset for inspection:

In [None]:
dat = JudiLing.load_dataset("../dat/kinyarwanda_verbs.csv", delim=";");
first(dat[:, 1:6],5)

In [None]:
first(dat[:,7:11], 5)

Since the `loading_data_careful_split` can only deal with comma-separated files, we first need to save our dataset as a proper .csv file:

In [None]:
using CSV
CSV.write("../dat/kinyarwanda_verbs.csv", dat)

Now we can reload the data, splitting into training and validation data. We use the `loading_data_careful_split` function, ensuring that all cues, lexemes and inflectional features in the validation data have already been seen in the training data.
Following van de Vivjer et al. (2024) we use bi-syllables, and hold out 10% of the data for validation.

In [None]:
data_train, data_val = JudiLing.loading_data_careful_split(
                        "../dat/kinyarwanda_verbs.csv", 
                        "kinyarwanda", 
                        "../dat/careful",
                        ["Lexeme", "Person", "Number", "Tense", "Voice", "Mood", "Extension", "Aspect"],
                        n_grams_target_col = "WordSyl2",
                        n_grams_tokenized = true,
                        n_grams_sep_token = ".",
                        n_grams_keep_sep = true,
                        grams = 2,
                        val_ratio = 0.1,
                        random_seed = 42,
                        verbose=true);

Inspect the training data:

In [None]:
first(data_train, 5)

Inspect the respective sizes of training and validation data:

In [None]:
size(data_train)

In [None]:
size(data_val)

## Model

### Matrix preparation

We now create cue objects for both the training and validation data. For this, we use the `make_combined_cue_matrix` function which ensures that there are columns for all cues in both the training and validation data in the training cue matrix. We also tell the function that we use the form representation in column `:WordSyl2`, which are already tokenized, and tokens are separated by a `"."`. We want to use bi-syllables, so set `grams=2`:

In [None]:
cue_obj_train, cue_obj_val = JudiLing.make_combined_cue_matrix(data_train, 
                                                               data_val,
                                                               grams=2, 
                                                               tokenized=true,
                                                               sep_token=".", 
                                                               keep_sep=true, 
                                                               target_col=:WordSyl2);

In [None]:
JudiLing.display_matrix(data_train, :WordSyl2, cue_obj_train, cue_obj_train.C, :C)

Now create semantic matrices for both datasets using `make_combined_S_matrix`. We want `:Lexeme` as the base for all semantic vectors, and all inflectional features (`["Person", "Number", "Tense", "Voice", "Mood", "Extension", "Aspect"]`) as feature vectors. We also set the dimension of the semantic matrix to be the same as that of the cue matrix.

In [None]:
S_train, S_val = JudiLing.make_combined_S_matrix(data_train, 
                                                data_val, 
                                                ["Lexeme"], 
                                                ["Person", "Number", "Tense", "Voice", "Mood", "Extension", "Aspect"], 
                                                ncol=size(cue_obj_train.C, 2));

### Training (Seen) data

Next, we train the F matrix on the training data:

In [None]:
F = JudiLing.make_transform_matrix(cue_obj_train.C, S_train);

... and compute the predicted semantic matrix for the training data:

In [None]:
Shat_train = cue_obj_train.C * F;

Evaluate comprehension accuracy on the training data:

In [None]:
JudiLing.eval_SC(Shat_train, S_train, data_train, :WordSyl2)

Moving on to production, we first train the G matrix and predicted form vectors:

In [None]:
G = JudiLing.make_transform_matrix(S_train, cue_obj_train.C);
Chat_train = S_train * G;

Next, we need to assemble the produced forms from the predicted form vector. For this we make use of the `learn_paths` algorithm:

In [None]:
res_learn_train = JudiLing.learn_paths(data_train,
                                        cue_obj_train,
                                        S_train,
                                        F,
                                        Chat_train,
                                        threshold=0.01);

...and evaluate the result:

In [None]:
JudiLing.eval_acc(res_learn_train, cue_obj_train)

# Exercises
## Exercise 1
Evaluating the model on the validation data:

First compute the predicted semantic vectors:

In [None]:
Shat_val = cue_obj_val.C * F;

...and evaluate comprehension accuracy on the unseen data:

In [None]:
JudiLing.eval_SC(Shat_val, S_val, S_train, data_val, data_train, :WordSyl2)

Moving on to production, compute the predicted form matrix:

In [None]:
Chat_val = S_val * G;

Now we need to again run the `learn_paths` algorithm for the validation data. We need to use the somewhat more complex version of the algorithm. First we need to compute the maximum number of cues which can occur in a word:

In [None]:
max_t = JudiLing.cal_max_timestep(data_train, data_val, :WordSyl2, tokenized=true, sep_token=".")

Now we can proceed to run the algorithm. Note that we decrease the `threshold` slightly to 0.005:

In [None]:
res_learn_val = JudiLing.learn_paths(data_train,
                                    data_val,
                                    cue_obj_train.C,
                                    S_val,
                                    F,
                                    Chat_val,
                                    cue_obj_train.A,
                                    cue_obj_train.i2f,
                                    cue_obj_train.f2i,
                                    Shat_val = Shat_val,
                                    max_t = max_t,
                                    grams=2,
                                    target_col=:WordSyl2,
                                    tokenized=true,
                                    sep_token=".",
                                    keep_sep=true,
                                    verbose=true,
                                    threshold=0.005)

In [None]:
JudiLing.eval_acc(res_learn_val, cue_obj_val)

## Exercise 2

Call the `accuracy_comprehension` function on the validation data, supplying all features as base and inflections.

In [None]:
acc_comp = JudiLing.accuracy_comprehension(S_val, S_train, Shat_val, 
                                            data_val, data_train, target_col="WordSyl2",
                                            base=["Lexeme"], 
                                            inflections=["Person", "Number", "Tense", 
                                                         "Voice", "Mood", "Extension", 
                                                         "Aspect"])
acc_comp_dfr = acc_comp.dfr

Keep only the rows with errors.

In [None]:
errors = acc_comp_dfr[acc_comp_dfr.correct .== 0,:]

In [None]:
names(errors)

Now we need to count how many times there were errors for each of the features. We can do this manually or by summing each of the feature columns.

In [None]:
sum.(eachcol(errors[:,
        [:Lexeme, :Person, :Number, :Tense, :Voice, :Mood, :Extension, :Aspect]]))

The numbers tell us how many times the pertinent feature was understood correctly. Evidently, the second to last feature (Extension) was not understood the most.

In [None]:
combine(groupby(acc_comp_dfr[acc_comp_dfr.correct .== 0,:], :Extension), nrow)

The reason for this is quite straightforward: within the inflection features, the Extension features has the most unique classes.

In [None]:
combine(groupby(data_val, :Extension), nrow)

# References

van de Vijver, R., Uwambayinema, E. & Chuang, Y. (2024). Comprehension and production of Kinyarwanda verbs in the Discriminative Lexicon. Linguistics, 62(1), 79-119. https://doi.org/10.1515/ling-2021-0164